A Small Y-DNA Haplogroup:  E1b1b1a1b1a6 (L540)

22-Jul-15

Peter Gwozdz

pete2g2@comcast.net

News

 

            22 Jul 2015: A new haplogroup branch, A6295, has just been added to the L540 Tree.  We now have two branches defined for L540, and one of those is further subdivided into two, for 4 haplogroup divisions total, with 10 samples (men) assigned, and a couple more with results expected soon.

 

            1 Jun 2015:  Now is a good time to do some SNP testing.  The topic “Dividing L540;  Discovering New SNPs”, below, has a discussion about this.

 

            10 Feb 2015:  Simplified L540 Tree added.

 

            9 Feb 2015:  We now have the 5th Big Y result (mine) from L540.  This produced a new SNP and haplogroup defined by the code name Y7026.  This is our third division of L540.

 

            2 Feb 2015 News:  We now have 4 Big Y results from L540.  They split L540 into two haplogroup branches, each with two of these 4 Big Y samples (men).  These branches are defined by the corresponding two new SNPs with code names A783 and Z29042.

 

Abstract

            Rewrite 14 Dec 2014:

            This web document is a summary of my information on a small haplogroup of Y-DNA based on an SNP mutation named L540.  The subject is genetic genealogy.

            There is a Neighborhood table below with a list of samples (men) predicted to belong to the L540 haplogroup, and also samples predicted to be in the Neighborhood just outside L540.  The samples near the cutoff (borderline fit) are the ones that should be tested for the L540 SNP to see if they belong to the L540 haplogroup;  probability of belonging decreases with the STR step number, as explained in the discussion below that table.

            This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to first read an Introduction that I wrote for another of my web documents.

            L540 was discovered in my Walk Through the Y (My WTY).  I purchased WTY, a commercial product (no longer available - replaced by Big Y).

            The clade that we now call L540 was originally called cluster C, a hypothetical cluster proposed as a division of haplogroup E1b1b1a1b1a, which is defined by the SNP named V13.  Hence I coined the name V13C.html for this document about cluster C in early 2010.  I rewrote this document and renamed it L540.html on 30 Apr 2011.  E1b1b1a1b1a (V13) is the largest haplogroup division of haplogroup E, but L540 is relatively small.

            I am not planning a separate L540 project, because it is more convenient to run this informally through the E-M35 project.  Villarreal and Lancaster have been very helpful.

            My equivalent name for V13C was “C type”, or just “C”.  I independently verified C type on 9 Jan 2010 as a good candidate type.  I use the word “type” to mean an STR cluster with statistical validity as established by my Mountain Method.  I published my methods in the Fall 2009 issue of JoGG.

            Two of the L540 positive samples in the Neighborhood table do not belong to C type, so C type is a predicted branch of L540.  I use “L type” to mean samples predicted to be L540 using STR values.

            C type is quite young, perhaps less than 1,000 years old.  The L540 mutation is older, maybe more than twice the age of C.  Such age estimates are uncertain, particularly with so little data in this case.

            L540 seems to be roughly 90% C type plus 10% other, older branches, as explained in the next topic about L540.

            Note 2 Feb 2015:  One of those two L type samples came out in the new Z29042 branch, along with a C type sample, so I need to update those last 3 paragraphs about L vs C.

            Watch this document.  I’ll add more information as data accumulates.

 

L540

            Update 31 Jan 2015:

            L540 is the code name for an SNP that was discovered in my WTY.  L540 was announced 29 March 2011.  On 27 Apr 2011 I demonstrated that L540 defines a new haplogroup branch of E1b1b1a1b1a (V13).

            I use the code name L540 for the SNP, for the associated haplogroup, and for the samples (men) in that haplogroup.

            The L540 haplogroup includes C type, and C type is most of L540.  The C type samples that have been tested are all L540+.

            The Neighborhood table below has my predictions for L540 and for C type.

            My sample was tested negative in 2011 for all the 7 ISOGG confirmed branches of V13, so L540 was an 8th branch of V13, not a branch of one of the prior 7 branches.

            Three of the 7 branches are too small - few or no samples available on-line for testing:  M35.2, V27, and P65.  In April 2014, V27 and P65 were dropped from the ISOGG tree (those two considered “private”) so L540 is now considered the 6th known branch of V13 in the ISOGG tree.

            Samples from the other 4 main branches of V13 were available;  I recruited and paid for testing as needed:  L143, L250, L17, and L241.  All are L540-.  That means none of these are branches of L540.

            ISOGG names change as new SNP divisions are discovered.  L540 was officially added 10 Jul 2011.  The code name is now E1b1b1a1b1a6 (L540).  Code names change as new branches are discovered.  For example in early 2011 V13, the father of L540, was called E1b1b1a2, so the original code for L540 was E1b1b1a2h, but the codes for V13 and L540 changed as new older SNP branch nodes were discovered.

            Steve Fix is now heading a project to discover new branches of V13, using Big Y data.  New branches are showing up more than one per month.  His V13 tree and discussion can be found at:

http://community.haplozone.net/index.php?topic=3657.msg36427#new

            The Fix tree, concentrating on newer branches, is very different than ISOGG’s.  V13 at ISOGG has not been updated in more than a year.  In Fix’s tree, L540 is a branch of a new haplogroup SNP named S3003, which is a branch of V13.

            The FTDNA tree was updated in mid 2014, with a large number of new branches for V13, but these are independent of the ISOGG tree, and these seem to be mostly very small branches (I call them twigs), not main branches of V13.  S3003 (and L540) seems to be a branch independent of all the FTDNA branches.

            I am working with others in an effort to find new SNP branches (twigs) for L540.  Forum topic:

http://community.haplozone.net/index.php?topic=3815.0

            We have identified 27 SNPs in the S3003 branch.  L540 seems to be equivalent to 18 of these.  S3003, older than the L540 set, seems to be equivalent to 6 others.  One is of undetermined location.

            We have Big Y tests on order for two more L540 samples.  Watch this web page and that haplozone topic for updates.

            Neighborhood table:  There are 19 samples confirmed positive with the L540 test;  17 of these are predicted C type and 2 are not.  There are 7 additional samples confidently predicted C type based on STR values;  these need the L540 test for confirmation.  In addition there are 26 more samples predicted C type with lower probability due to few STR markers;  these also need the L540 test.  The table also lists 8 STR neighbors that have tested L540- or L241+ (L241 is a neighbor haplogroup).

            In addition there are many more neighbors listed in the table where L540 C type probability is low for each individual sample;  these would benefit from the L540 test;  some of these would probably come out positive meaning L540 but not C type and most of these would come out L540 negative meaning STR neighbors.  L540 negative samples in that table should try the S3003 test, which includes L540 and surely other branches not yet defined.

 

L540 Tree

 

            Updated 22 Jul 2015:

            L540 Tree in conventional outline format.  Click on a link in this tree for more discussion about that SNP or sample (male ancestral line).

            The Yfull tree http://www.yfull.com/tree/E-V13/ does not yet show the new A6295 branch.

 

V13

                                                            V13 has about 80 equivalent SNPs

            Other V13 branches

                                                            Z5018 and Z5016 are by far the largest branches of V13

            S3003

                                                            PGP89

                        L540

                                    A6295

                                                            Gwozdz

                                                            Nowak

                                    Y7026

                                                            Stavbom

                                                Z29042

                                                            Gebert

                                                            Roider

                                                            Donovan

                                                A783

                                                            Hochreiter

                                                            Svercl

                                                            Burlik - Stelz

 

Appell  Z29042-;  A783-;  needs A6295 & Y7026

 

V13

            Update 10 Feb 2015

            For a detailed V13 tree, see:

http://www.yfull.com/tree/E-V13/

http://isogg.org/tree/ISOGG_HapgrpE.html

            V13, in the E haplogroup, is a major branch of the human Y-DNA tree.  The L540 branch is a relatively small branch of V13.

            There are about 80 known SNP equivalents to V13.  V13 was the first to be discovered and the one used in most discussions about this haplogroup.  All but a very few V13 samples belong to L142 and CTS5856, so technically L540 is the main branch of the S3003 haplogroup, which is the main branch of the CTS5856 haplogroup, which is the main branch of the L142 haplogroup, which is the main branch of V13.  For simplicity the L540 Tree above minimizes these details.  I usually just say in this web page that L540 is a branch of V13.

 

PGP89

            New Topic 10 Feb 2015:

            PGP89 is a sample from the Personal Genome Project (search Google for details).  PGP89 is S3003+ but L540-, so this sample represents an older node in the branch leading to L540.  So far there are no such S3003+ L540- results in the E-M35 Project.

 

Z29042

            New Topic 14 Jan 2015:

            This SNP was discovered by Steve Fix on 10 Jan 2015, from the Big Y data of Roider, compared to Gebert.  These two samples have this SNP, but Hochreiter does not, so Z29042 defines a new Haplogroup, the first branch to be found for L540.  Steve assigned the Z series code number.  Actually, there are 6 new SNP locations common to Roider and Gebert, but only Z29042 was assigned a code;  some of those others may be needed in the future.

            I’m a bit surprised.  I expected Roider to fall into a branch with Hochreiter, because they are closest in STRs.  Also, I have been predicting an older node for Gebert, based on his DYS389 value, and his STR values that differ from other L540 samples, more than they differ from each other.  STR predictions are statistical, because STRs mutate relatively rapidly.  So this is a surprise, but such surprises are expected from time to time when making predictions based on STRs.  I’ll have to rewrite a few paragraphs in this web page, to reflect our current knowledge of the structure of L540.

 

A783

            Update 10 Feb 2015:

            This SNPs was noticed by Steve Fix and me in Hochreiter’s Big Y data, our first L540 Big Y.  Actually, there were 10 new SNPs;  I tested myself for them but came out negative.  Yseq assigned A series code numbers to them.  None of the 10 showed up in the Big Y data for Roider or Gebert.  In February I noticed this one in the Big Y data for Svercl, so it defines a new haplogroup branch for L540, with Hochreiter and Svercl, not me, not Roider, not Gebert.

 

Y7026

            New Topic 8 Feb 2015:

            This SNP is present in the Big Y Data for all 4 samples from Z29042 and A783.  But Y7026 is not present in my Big Y.  So Y7026 defines a haplogroup composed of both Z29042 and A783.

 

A6295

            New Topic 22 Jul 2015:

            This SNP has just been defined.  It is present in Nowak’s recent Big Y, and is also present in  my Big Y from earlier this year.  So A6295 defines a new haplogroup, with only the two of us so far.

 

S3003

            Update Feb 2015:

            This SNP is in the L540 branch, but older.  PGP89 is a sample from the Personal Genome Project (search Google for details).  PGP89 is S3003+ but L540-, so this sample represents an older node in the branch leading to L540.  So far there are no such S3003+ L540- results in the E-M35 Project.  Technically, S3003 defines a haplogroup with branches PGP89 and also L540, but for simplicity I just say in this web page that L540 is a branch of V13.

 

Determining Your L540 Twig;  Dividing L540;  Discovering New SNPs

 

            News 22 Jul 2015:  FTDNA has a 15% off sale this week on Big Y (described in the following paragraphs).  Use code “Jerusalem2015”.  Also for men who have never tested Y-DNA, the Y-DNA37 is available for $119 ($30 off);  use code “Jerusalem30”;  this test will place you in a major haplogroup;  if you fall in V13 you should follow that with the V13 test panel, described in the following paragraphs.  For men already confirmed in L540, Yseq sells SNP tests at only $17.50 each (plus $5 for the sample kit for new customers);  all 4 SNPs dividing L540 are available at Yseq.

            Update 22 Jul 2015:

            I highly recommend the new V13 panel at Yseq.  This is a test for several SNP branches of V13;  it costs $88, plus $5 for a cheek swab kit if you have not already tested at Yseq.  This test will determine your branch position on the V13 Y-DNA tree, including L540 and 3 of its branches.  I recommend this for anyone in the V13 haplogroup, and I particularly recommend it for anyone in my L540 Neighborhood table below.  Of course, if you are planning Big Y, this V13 panel is not needed.

            Link with a description of this V13 panel test:  http://www.yseq.net/product_info.php?products_id=2486.  That description has a nice V13 tree.

            Please let me know if you are L540 and order this panel, so I can keep track of results.

            Big Y:  Discovering new SNPs is part of my genetic genealogy hobby.  I have been recently recruiting L540 members to purchase Big Y in order to discover new SNPs.  We hope to discover more SNPs to subdivide L540.  It’s not cheap.  $575 for Big Y.  Anyone interested in joining this L540 project can order Big Y;  please contact me so I can keep track of the status.  With Big Y, there is no need for individual SNP testing.  In fact, with Big Y, many men immediately discover their own new small branch haplogroup.  I call these twigs of the Y-DNA tree.  If you don’t discover a new twig with Big Y immediately, that is because two samples with the same SNP are required to officially define a new haplogroup;  it is almost certain that a future Big Y test will match one of the new SNPs in your Big Y data, thereby defining your common haplogroup twig.  Any future haplogroup twigs that branch from your twig will also be already present in your Big Y data (unless someone uses a test that covers more of the Y chromosome).

            A783, Z29042, Y7026:  These are the first three known branches of L540.  These three SNPs are included in the V13 panel at Yseq, and they are also available as individual tests at Yseq.  I requested these SNPs from FTDNA, but only A783 and Z29042 available at $39 each;  FTDNA has announced that rare SNPs will not longer be added to their inventory.

            Click on SNP ordering for detailed instructions.  If your STRs closely match one of the samples already known to belong to one of the L540 SNP branches, you might save a few dollars by ordering only that one SNP test to verify your branch.  But at $88 the V13 panel (see the top paragraph of this topic) has all three of these plus many more at price lower than three individual tests.  If you purchase Big Y, your results will include all the L540 SNPs (and many more) as “novel” SNPs if you are positive for any of them.

            Obviously, you can save money by waiting for other men to test first;  in a year we should have multiple branches discovered for L540;  you can wait and order an SNP for the branch that best matches your STRs.  But if you are interested in the hobby of discovering your twig on the Y-DNA tree now is a good time to do some testing.

            For more specific discussion, click on  L540, A783, Z29042, Y7026, A6295, SNP ordering, and Big Y.

 

Cluster C

            Rewrite 15 Dec 2014:

            Clusters are based on STR correlation.  There are 22 samples predicted C type in the Neighborhood table, labeled “C”, with greater than 80% probability.  An additional 26 samples, most with fewer than 67 markers, are labeled “C” with less than 80% probability;  some of those have only about 50% probability of belonging to C.  Most of the C samples are listed at the E-M35 site and at the haplozone site, in the V13 + L540 branch.

            Friedman proposed cluster C in 2008, based on STR correlations, when the data was less than what is available today. 

            Cluster C includes me and my 3rd cousin (Gwozdz).

 

C Type

            Rewrite 15 Dec 2014:

            I use the word type for an STR cluster with statistical validity as established by my Mountain Method.  “Type” is my own term.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  By “type” I mean the cluster data, the hypothetical clade, the modal haplotype, and the set of all possible haplotypes, at any number of markers.  Accordingly, by “C type” I mean any or all of these 4 things.  I sometimes use just “C” as short for “C type”.  I also have a previous C type identified in R1a;  unrelated;  please don’t get confused.

            My analysis files define C type.  Sorry, it can be a bit confusing because I have multiple STR definitions for C type, for various marker sets.  The number of markers in my definitions change slightly when new samples show up with unusual STR values.  I hope the meanings are clear from the context of my discussions in this web document.

            I also provide STR definitions for L540, discussed below, treating L540 as L type.

            C type is roughly 90% of L540.  My evidence:  Considering only the independent samples with 67 or more STR markers in the Neighborhood:  There are 17 type L samples and only 2 do not fit C type but have tested positive (L540+) for the L540 SNP test:  I say “independent” because some C type have been recruited based on known genetic relationship to C type men (my Gwozdz cousin, Kargul, Svercel);  these should not be counted in this estimate.  I do not include samples with fewer than 67 markers in this estimate, because type prediction has lower confidence with fewer markers.  Notice in the table that most borderline samples (near the L type cutoff) have L540 test results.  I say “roughly” for this estimate because I have been recruiting samples for this table at 67 markers with equal emphasis on borderline samples, in order to properly sample the STR borderline, but it is difficult to prove no bias toward samples that fit well;  also there is a chance of outliers showing up in the future;  either L540- among those samples that fit C type very well;  or L540+ among beyond the borderline.  Also, with only 2 L540+ outside C type, the sampling confidence is not good.

            The evidence today indicates that L540 is much older than C type - details below.

 

V13C

            Rewrite 15 Dec 2014:

            I coined the name V13C in 2010 to represent C type, cluster C, the hypothetical haplogroup, and the samples (men) in the hypothetical haplogroup.  This web document used to be named V13C.html.

            Now that C type is a subdivision of L540 I am editing away the name “V13C”, but I’ll continue to use “C type” for the hypothetical clade that is part of (most of) L540.

            V13 is the defining SNP for E1b1b1a1b1a, so I similarly use “V13” to mean the “father” haplogroup - the large branch in the Y-DNA tree from which L540 is a small branch.  I also use “V13” to mean the associated database of V13 samples at E-M35 or at Haplozone, or at other databases.

 

111 Markers

            Rewrite 17 Dec 2014:

            FTDNA provides STR markers in various sets.  The largest, a set of 111, was introduced in 2011.  Upgrades can be purchased for samples with fewer markers.  Obviously, matches and predictions are more accurate using more markers.  For 3 years, I had been recommending the 111 set to L540 members, hoping to discover STR correlations good enough to divide the L540 haplogroup into clusters with high confidence.

            I do not have a confident STR based division of L540.  L540 is dominated by C type, which seems to be young, and more like a bush than a tree.

            Today, SNPs are more important than STRs.  This is because the cost of discovering new SNPs has come down a lot.  SNPs define haplogroup divisions;  STRs only provide statistical predictions for haplogroups.

            Still, the set of 111 Markers is the most accurate way to find out which samples in the database for your haplogroup are your best matches, and statistically most likely to from a recent branch (node) in your male line ancestry, and a prediction of their order (older vs younger nodes).  As an example of the value of 111 STRs, I discovered DYS445=11 as an unusual mutation in my own Y, shared by my cousin, and also shared by Kargul, adding evidence that we form a twig in the L540 tree, perhaps restricted to Poland, perhaps only a few centuries old.  DYS445 is not available at less than 111 markers at FTDNA.  The rest of L540 has the value DYS445=10.  The value 11 does show up rarely elsewhere in V13, as an independent mutation, so although DYS445 is very slowly mutating it is not as slow as a typical SNP, so not as statistically reliable as an SNP.

            New clusters can still be discovered with STRs, as predictions for new haplogroups, which still need confirmation by discovery of a corresponding SNP.  However, STR analysis is yielding diminishing returns for the effort.  SNP discovery is now accelerating instead.

            C111Type.xls is my C type analysis using 111 Markers.  L111Type.xls is my L type analysis using 111 Markers.  I consider L type statistically equivalent to L540, although there may be rare exceptions in the future.  Those files use data from E-M35, downloaded March 2014;  since then two more L540 samples have been upgraded to 111 markers, and a few new samples have shown up in the Neighborhood, but the result of recent analysis is the same.  The Neighborhood table has been more recently updated with all data available.  The columns in that table use the results from these analysis files.  There are notes below that table that provide explanations.

            My Type.xls master file has instruction sheets explaining how my xls analysis files work.

 

67 Markers

            Update 17 Dec 2014:

            FTDNA provides a 67 marker standard set of STR markers.  I have been using this 67 set for analysis for more than 5 years.  Although the 111 set is more accurate, this 67 set is valuable because there are more samples at 67, and all samples with 111 are included.

            The 67 marker data includes a few L540+ samples where I paid for the extra data, but it also includes one sample that was in the neighborhood table but has been removed because I paid for the extra data, which showed it is not an L540 neighbor.  So my help is causing a slight bias in the data, if you compare the size (number of samples) to other haplogroups.  (Other haplogroups may or may not have similar bias.)

            L67Type.xls is my analysis of L type STRs, which are predicted equivalent to L540.  C67Type.xls is my analysis of C type STRs, which is a predicted branch of L540.  Those files use data from E-M35, downloaded March 2014;  since then a few more 67 marker samples have shown up but the result of recent analysis is the same.

            L45(67) means a modal haplotype for the L540 haplogroup using 45 of the 67 standard markers.  My definition for L540 is L45(67), all samples less than the cutoff (genetic distance, or step) 6.  L45(67) was new on 21 Mar 2013.  I also typed L45 into Ysearch as 479H7.  My Mar 2014 analysis L67Type.xls indicates that L45(67) is still my best STR definition for the L540 haplogroup.  L67(67) refers to the modal haplotype using all 67 markers.

            Similarly, C42(67) with cutoff 3 is my definition of C type, introduced 11 Mar 2014, Ysearch QAZ7P.

            Please refer to the Neighborhood table, where these definitions and others provide columns.

            My previous definitions are available in those xls files, in the sheet “Haplotypes & Masks”.

 

37 Markers

            Update 17 Dec 2014:

            This and the following 2 topics are based on my Feb 2011 analysis.  These 3 topics are C type only;  data is not good enough yet for L540 analysis below 67 markers.

            C37.xls is my C type analysis using 37 Markers.  I no longer update my on-line xls files at less than 67 markers, but I do the analysis sufficiently to update the Neighborhood table, which has columns for the results.  The table shows how 37 marker predictions compare to 67 and 111 markers.  The 37 marker results predict all the C cluster samples, but the L samples are not distinguished, and there is no gap in the step separating the L540- samples.  The C cluster just barely qualifies to be called a “type” using my Mountain Method at 37 markers.

            FTDNA no longer sells the 25 and 12 STR marker standard sets.  The 37 marker set is sufficient as a first test for men curious to see in which main branch haplogroup they belong.  In the past, I highly recommended 67 markers, and I encouraged 111 markers, but SNPs are now more important.  37 STR markers are enough to identify the main branch of the Y-DNA tree;  any number of STR markers usually needs to be followed by SNP tests to identify the smaller branches, and even the twigs to which the sample belongs.  If cost is not an issue the 67 and 111 sets can more accurately distinguish closest matches.

            Actually, you can do better by comparing your 37 markers to the database;  if your best 3 or 4 fits (samples at lowest step from you) are all C type, you are very likely C type.  Even if only some of your best fits are C type, that is enough evidence to try the L540 and S3003 SNP tests for confirmation.

 

25 Markers

            Update 17 Dec 2014:

            C25.xls is my C type analysis using 25 Markers.  Again, the Neighborhood table has more recent results, where you can see that 25 markers provide a reasonable but not accurate prediction for C type. 

            The same 25 marker STR set has been used by other Y-DNA testing companies.  The Haplozone data include Sorenson data that is not in the E-M35 data, and 3 of these land in cluster C.  Those samples have kit numbers starting with “S” in the Neighborhood table below.

 

12 Markers

            Update 17 Dec 2014:

            I also have a file using only 12 markers for analysis:  C12.xls.  I used the full database but truncated that analysis to the closest 33 samples to keep the file small.  The 3 Sorenson samples in the table below are not in this database.

 

Best STR Markers

            Update 17 Dec 2014:

            STR markers that mutate relatively slowly are statistical indicators for clades in which they are recently mutated, but they are not perfect because of subsequent independent mutations.  When a clade has a few such good STR markers those provide a signature set of STR markers.  A signature is statistically expected to be a more probable indicator of a clade than just one marker.  Indeed cluster C is characterized by the Friedman Signature.  My definitions of C type and L540 use other helpful markers, not just the signature.

            My analysis files automatically rank markers, as useful for a definition, using a method that I published.  You can view my ranking in those xls files linked in the previous topics.  See row 11 of the Calculator sheet, and row 17 of the TypeRank sheet.  The exact ranking of markers varies slightly from month to month due to the random nature of mutation values in new samples, and due to the somewhat arbitrary cutoff that I use to restrict the database to the neighborhood (using too many samples provides a ranking of the father clade instead of the clade of interest).  For example a sample that ranks 6th one month might come out 4th or 5th or 7th or 8th the next month.

            An SNP that defines a haplogroup is very unlikely to have happened exactly at the time of the most recent common ancestor (TMRCA) of a haplogroup.  Most likely the SNP is somewhat older, because usually there are many generations between nodes.  By definition an SNP cannot be younger than the TMRCA.  Similarly, we can consider a hypothetical clade defined by a particular STR mutation, which is likely somewhat older than the TMRCA of that clade.  However, for clusters defined by signatures, and for types defined by definitions, one rare STR mutation that contributes to the signature might have happened shortly before or after the TMRCA of that cluster or type.

            Very slow mutators should make the best markers.  However the slowest are rarely mutated, so those with intermediate mutation rate show up more often as signature markers.  My Type.xls master file has the Chandler STR mutations rates, in the ASD sheet, row 5.  The ASD sheet is not usually included in my analysis files.

            Usually it is silly to speculate about clusters defined by a single STR value.  In this case, however, we have a hypothetical haplogroup, C type, which seems quite young, with relatively little STR variation, so some speculation is in order:

 

DYS389II = 32  (389II minus 389I = 19);  Best Marker for C type

            Update 1 Jan 2015:

            DYS389II=32 is one of the Friedman markers for the C cluster.  It is the best marker for C type.

            [Technical detail:  DYS389 is a compound marker, where 389I is the first STR chain and (389II minus 389I) is the second STR chain.  For C type the first chain is 389-1 = 389I = 13.  The second chain is 389-2 = 19.  389II = 13 + 19 = 32.  The marker of interest here is really 389-2 = 19 (389II minus 389I = 19).  However, 389I mutates more slowly and has the value 13 for all but one C sample so far and for almost all samples in the L540 neighborhood.  At Ysearch or Haplozone, both 389 markers need to be used together;  if one is omitted both are ignored.  My analysis files allow the 389-2 chain to be used alone in analysis, using 389-I only to calculate the difference.  However, I use both 389 values (or neither in some cases) in my published definitions to be compatible with other web sites.  In this discussion topic, by “32” I really mean 19 for the delta value.]

            All STR marker sets by all DNA companies include the 389 pair (I have not noticed any exceptions).

            The two L540+ samples that do not fit cluster C, Fredeen and Gebert, have the ancestral value 30.  Butman, the closest STR match with L540-, also has 30.  So far, there are no C type with the ancestral 30 value, although a few have the value 31, which is not common in the neighborhood.  On this basis, it seems likely that the mutations to from 30 to 31 to 32 happened close to the TMRCA for C type, and some time after the L540 mutation.  However, we cannot confidently conclude that those samples with 31 are older, because they may have independently mutated back from 32 to 31.  (We also cannot rule out a C type unique double size mutation incident, from 30 to 32, followed by back mutations to 31.)

            DYS389II is the only signature marker that clearly distinguishes C type from the 2 known L540+ samples that are not C type.  None of the other 110 markers in the standard set at FTDNA does this.  The other signature markers work well for both C type and L540.  It is possible with more data that another marker might statistically distinguish (weak correlation due to relatively rapid mutations).  On this basis, it seems likely (not certain) that the clade of descendants of the initial 389II mutation is the same as the C type clade.

            Those two non C L540 samples differ from C type by other markers that are not signature markers.  C type has only this one very good marker to distinguish from the other L540 samples.  Actually, a type does not need any very good markers if it is well isolated in haplospace.  A correlation using several markers can sometimes be very good, as it is here for C vs L at 111.

            The 32 value is rare throughout V13 but shows up in E-M35 branches outside V13.

            DYS389II (actually the delta value 389-2) ranks 43rd in Chandler mutation rates.  Near the middle.  So exceptions are expected, due to recent mutations.

            Speculation:

            Model A:  A mutation from 30 to 31 happened long after the L540 mutation not long before the TMRCA for C type.  A short time later another mutation happened from 31 to 32.  Most C type samples with 31 represent the oldest nodes, and only a minority are back mutations from 32 to 31.  The samples in the STR neighborhood with 31 that do not match C are independent mutations.  This seems to me the simplest model, so I favor it, but only tentatively.

            Model B:  There was a double mutation from 30 to 32 in one man not long before the TMRCA for C type.  Or two single mutations too close in time to be distinguished.  Almost all 31 in C are back mutations, most of them from a single subclade.

            Model C:  I can think of more complicated models.

            The data is not good enough to distinguish these models.  Maybe more data in the future will show correlation with other markers to distinguish models like A vs B.

            More likely, now that we are experiencing a deluge of new SNPs, we will soon find SNPs to divide L540 into official haplogroup twigs, and one of these might be equivalent to C type.

            Eventually, due to mutation luck, samples with the 30 value will show up in C type, isolated in haplospace together with the C samples.  Eventually, due to mutation luck, L540+ samples with 31 or 32 will show up that are too old to fit C type.

            Summary (for now):  The initial mutation to 31 (or double to 32) is very close to the same age as C type, so the mutation(s) defines C type, with high statistical confidence, but not with certainty.

 

DYS594 = 12;  Best Marker for L540 at 67 Markers

            Rewrite 1 Jan 2015:

            Reminder:  “L540” means here samples that have tested positive for the L540 SNP test (or will test positive in the future).  L540 is an official haplogroup.  “L type” is a statistically confident type based on STRs;  L type seems equivalent to L540 today, but there should be exceptions in the future, just due to the luck of mutations.  C type is my hypothetical branch of L540, not yet confirmed by an SNP.

            In my analysis, DYS594=12 is the best marker for L540 using the 67 marker set, and is also a good marker for C type.  594 is not in the 37 marker set.  594 helps a lot in defining C type and L540 using the 67 marker set, but does not distinguish C from the rest of L540.

            The 12 L540+ samples with 67 or more markers, including 2 that are not C type, all have the 594=12 value.  Butman, the closest STR match not predicted L540, indeed tested L540-, and has the ancestral 11.

            There are 18 L540 samples;  those 12 plus 6 more without the L540 test that fit C type very well;  all 18 have the 594=12 value.

            Two samples in the STR neighborhood have 594=12 but are L540-.  These are not a random sample;  I recruited them for the L540 test to find out if all 594=12 in the neighborhood are L540;  no, not all.

            The 594=12 value is more common in the L540 neighborhood than in the rest of the V13 data.  So I was wondering if 594=12 is an old mutation in the S3003 branch.  So I tested one of those two L540- samples with 594=12;  it came out S3003-, so it is clearly an independent mutation.  Also, considering the L241 haplogroup, some of those samples are in the neighborhood, but they have 594=11 except one sample that has the value 12, so that is also independent.

            DYS594 ranks 12th from the slowest in the 67 Chandler mutation rates.  Quite slow, so independent recent mutations should be rare.

            Summary:  The DYS594=12 STR mutation seems to be roughly the same age as the L540 SNP mutation.  There is not enough data to say if the 12 value is a little younger or a little older.

 

DYS636 = 12;  DYS504 = 14;  DYS561 = 17

Excellent Signature Markers for L540;  Available Only in the 111 Set

            Rewrite 8 Jan 2015:

            These three are not in the FTDNA 67 STR maker set, but are available in the extended 111 STR marker set.  They are each about as good as DYS594=12, previous topic.  They do not distinguish C from L.  There are other markers almost as good in the 111 set.  That’s why the L77(111), my 111 marker definition for L type, works very well.

            Using the analysis file L111Type.xls to analyze these 4 best markers (including 594, previous topic):

            10 samples L type at 111 markers are all confirmed L540+.  (My cousin and I - both Gwozdz - do not show L540+ at the E-M35 SNP web page because my WTY discovery of L540 does not show at that page.)

            8 samples are statistically independent because I recruited my cousin and Kargul.

            That 111 marker analysis file includes the 42 nearest STR neighbors at 111.  52 samples total.

            DYS594 = 12:  All 10 L540 samples have 594=12.  41 neighbors have 596=11 and one has 10.  See the previous topic for discussion of 67 marker neighbors with the 12 value, most confirmed L540-.

            DYS636 = 12:  All 10 have 636=12, and all 42 neighbors have 636=11.

            DYS504 = 14:  8 of the 10 have 504=14, but the two exceptions are not really exceptions, because they have 504=15, representing an additional mutation.  One of the neighbors has 14;  the rest are all 13.

            DYS561 = 17:  9 of the 10 have 561=17;  one has 16.  Only 1 of the 42 nearest neighbors has 17;  the rest are mostly 16 with a few 15.

            Kargul is that sole exception, with 561=16.  All the other 10 L540+ samples match on all 4 of these signature markers.  As discussed in the Kargul topic below, Kargul is obviously a male line relative of mine from the past few centuries, so this exception seems to be an independent mutation back to the ancestral value.

            Butman is the closest STR neighbor at 111 (and also closest at 67).  Butman is confirmed L540-.  Butman has the ancestral values for all 4 of these, which is evidence that all 4 of these signature markers are close in age to the L540 mutation, not much older mutations in the S3003 father branch.

            Consider those two signature values among the neighbors:  Those are two different neighbors.  Those are not among the 9 closest STR neighbors.  This is evidence that those two are independent mutations.  This is further evidence that these signature markers are not much older than L540.

            Summary:  Same as the DYS594=12 Summary, previous topic.  These three STR mutations in the 111 set also seem to be roughly the same age as the L540 mutation.  Again, there is not enough data to say if any of them are a little younger or a little older.

 

Other Good Markers:  DYS390=25; DYS444=13; DYS406=11; DYS456=15; CDY=29,33; DYS447=25

            Rewrite 1 Jan 2015:

            These typically rank among the best in my analysis file at 67 markers, usually in about the order listed in the title here.  Good for both C type and L type (L540) prediction.  The exact ranking is very sensitive to the choice of database.  In the close neighborhood of L540 each of these markers does well, but using the entire E-M35 database these need to be used in combination to statistically distinguish L540 samples.  There is a cluster in E1b1b1c1a (M84) that has samples with the L540 signature (389II minus 389I)=19 and 594=12, but that interfering cluster in M84 has 444=11, two steps away from L540 at that 444 marker.

            Chandler rank for these, in the same order:  47th, 49th, 35th, 60th, 67th, 45th.  (67th means fastest;  discussion next topic)

            None of the other markers in the title here are as good at 67 markers as 389II for C type or as good as 594 for L540.  Each has some confounding independent mutations in the data.

 

CDY = 29,33;  Discussion

            Update 1 Jan 2015:

            The CDY pair is the most rapid mutator of the FTDNA standard STR markers.  It is unusual for CDY to be useful in a definition.

            In my C type and L type analysis files using 37 or more markers, the CDY pair usually ranks well in the definition.  Not good enough to be called a signature, but useful in helping to define a statistically significant cluster, which I call a type.

            (In previous versions of this web page, I speculated that CDY might have a mutation, in the CDY chain corresponding to CDYb, that effectively decreases the CDYb mutation rate in L540.  That was based on all samples coming out with CDYb=33 a few years ago.  Recent data has more mutations, so I no longer consider that speculation relevant;  I deleted that discussion.)

            (Check the web for “P1 palindrome” and “recLOH” if you are interested in the complications associated with CDY.  CDY is also called DYS724.)

            A simple explanation for why CDY works well:  Most markers work well for STR definitions of relatively young clusters.  The only markers that don’t work well are the ones that mutated in the oldest branches of the Y-DNA tree of that that cluster, so that different branches have different values at those markers.  Usually different markers for different clusters.  Usually CDY mutates in various branches because CDY is a rapid mutator, but for relatively young clusters it is possible, just by luck, that the oldest branches all have the ancestral values.  This seems to be the case for C type, which is apparently not very old.  In fact, analysis file C111Type.xls shows that any number of markers from 5 (best 5 out of 111) to 111 can be used to extract C type with reasonable confidence;  C90(111) is simply my best choice. Apparently, just due to the luck of random mutations, no CDY mutations showed up in the oldest branches of C type, so CDYa and CDYb rank 15th and 22nd in that file.  L540 is probably quite a bit older than C type, but there are only 2 L540 samples outside C type in the database so far;  in fact one of these does have a mutation at CDYa.

 

Signature C4

            Update 15 July 2011:

            An excellent signature for C type is (389I, 389II, 594, 444) = (13, 32, 12, 13).  Seven of 9 C type samples with 67 markers have this signature, and the two that miss are at step 1;  no other samples in the neighborhood have step 1;  Gebert is the only one at step 2.  In the vast E-M35 Haplozone database there is only one confounding sample at step 1, but that one is from E1b1b1c1a;  all others differ from this signature by 2 or more steps.  In other words, this 3 marker signature, cutoff 2, extracts all the C type samples and none others from V13 data.  Eventually, of course, exceptions will turn up.

            There are better markers than 389I.  I included that one because it enables C4 in the search function at the Haplozone site.

 

Friedman Signature

            Update 29 Mar 2014:

            The signature is (390, 389-2, 447) = (25, 32, 25).

            Friedman had been calling this the “characteristic marker values” for cluster C at the Haplozone site before I started working on this, back in 2008, when there were only 9 samples available in cluster C, including mine.

            This original Friedman signature by works surprisingly well by itself for samples with only 25 of the standard markers, but not with high confidence.  For more details, see the discussion about C3(25) below the Neighborhood Table.

            In early 2011 Friedman added 594=12 to the “characteristic marker values”, for 67 marker samples.  For more details, see the discussion about C4(67) below the Neighborhood Table.

            DYS389 is a compound marker, discussed above.

            Friedman used a more complicated analysis than just this simple signature in her C type assignments.  I do not know what her method was exactly, but most definitions (not all) that I tried, selecting well ranked markers, extracted the same samples that she did.

 

L540 Neighborhood

            7 Feb 2015 a few updates of markers

            5 Feb 2015 minor correction

            1 Feb 2015 change Gebert from L to C

            31 Jan 2015 one more L540+

            See the last paragraphs of the L540 topic at the top of this web page for a brief summary of this table

            5 Jan 2015 one L540+ added

            Update 14 Nov 2014:  complete table update;  check the News topic at the top of this web page

            Update 4 Oct 2014:  a few minor updates

            Update 10 Aug 2014:  a few minor updates

            Update 21 Jun 2014:  a few minor updates

            Update 24 May 2014:  3 more L540+ results added to the table

            Update 21 Apr 2014:

            L540 is small enough that I can insert a complete table here, including neighbors just beyond in STR values.

            Those numbers are STR step, which is mutation count from the Modal Haplotype;  the columns are explained more in the notes below the table.

            Violet numbers are L “paratype” (L540 samples not in C type), where step less than the cutoff means predicted members of the L540 haplogroup.

            Boldface Ancestor indicates a sample with Big Y data

            Boldface Type means confirmed:

                        + vs --- means confirmed positive vs negative by the L540 test.

                        L241 means positive for another haplogroup, implying negative for L540.

                        There are many more negative L540 results from outside this neighborhood (higher step).

            Red step numbers are predicted C type, a predicted branch of the L540 haplogroup.  Boldface means 80% or higher confidence that a future SNP will be discovered, confirming these samples in a future haplogroup branch.  C? means predicted C type at less than 80% confidence, based on the STR step number in boldface.

            Black step numbers are greater than the cutoff.  Without SNP testing, even with high step number, there is a low probability that a sample might be an outlier member of L540. 

            Data sources:  e = E-M35 project, h = Haplozone,  y = Ysearch

 

 

 

 

 

 

 

 

Modal>

C90

(111)

C111

(111)

L77

(111)

L3

(111)

C42

(67)

C67

(67)

L45

(67)

C4

(67)

LnotC3

(67)

C15

(37)

C37

(37)

C12

(25)

C25

(25)

C3

(25)

C12

(12)

Note

 

 

 

 

 

 

 

Cutoff >

7

16

8

2

3

12

6

2

1

6

8

2

3

2

1

 

Kit

Ysearch

L540

Ancestor

Origin

Data

Type

Markers

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N16800

KFKGM

+(WTY)

Gwozdz 

Poland

ehy

C

111

3

9

3

0

0

7

1

1

5

3

5

1

4

1

2

 

N45041

UQR4B

+

Hochreiter

Germany

ehy

C

111

4

7

2

0

0

4

2

0

4

4

4

0

3

0

1

 

155155

 

+

Svercl

Czech

eh

C

111

4

15

5

0

1

8

1

0

5

1

5

0

2

0

2

 

N81304

 

(+)

Gwozdz

Poland

eh

C

111

4

11

4

0

1

9

1

1

6

4

7

1

5

1

3

1

140927

9JM9U

+

Donovan

Prussia

ehy

C

111

4

11

3

0

1

6

2

1

3

2

3

1

1

1

1

 

175213

5XP46

+

Burlik Stelz

Germany

ey

C

111

5

12

5

0

2

7

2

0

4

2

5

1

2

0

0

 

51282

A9FVE

+

Wion 

Germany

eh

C

111

6

13

1

0

2

10

3

0

4

5

7

1

4

0

1

 

199446

TK98K

+

Kargol

Poland

ehy

C

111

6

10

6

1

1

6

2

1

5

2

4

1

4

1

2

 

225596

6S4J6

+

Nowak

Poland

ehy

C

111

6

12

5

0

1

9

4

0

4

2

5

0

0

0

0

 

171456

79QF7

+

Glasser 

Germany

ehy

C

111

6

9

3

0

0

3

1

0

4

1

3

0

1

0

0

 

166692

8FTXT

+

Gebert

Germany

ehy

C

111

13

17

7

0

6

10

5

2

0

2

6

2

4

2

3

 

162917

 

+

Fredeen 

Sweden

eh

L

111

17

23

6

0

8

18

5

2

0

10

13

3

6

2

4

 

N91348

 

---

Butman

England

e

X

111

17

20

15

3

6

11

8

3

2

4

6

2

2

2

2

 

N39989

5N5MF

---

Hohnloser

Germany

ehy

X

111

20

28

20

4

11

18

12

3

2

7

9

2

3

2

3

 

5960

V93B3

-(BigY)

Bartlett

England

ehy

X

111

20

27

18

3

9

16

11

3

2

7

11

3

6

2

4

 

N58717

CV7WB

(--)

Bartlett

Unknown

ehy

X

111

21

29

19

3

9

18

12

3

2

8

13

3

8

2

5

 

98212

 

L241

Baber

England

ehy

X

111

22

33

21

3

10

22

11

5

3

11

15

5

9

4

5

 

105741

3FVPX

-(BigY)

Malay

Slovakia

ehy

X

111

22

32

20

4

9

20

15

5

1

11

16

4

8

3

4

 

 

 

 

5 samples

 

e

X

111

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

---

5 samples

 

e

X

111

24

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3 more

 

e

X

111

24

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 more

 

e

X

111

 

 

19

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

V13 Modal

 

e

X

111

23

25

21

3

9

17

14

6

3

8

12

4

7

4

4

 

 

 

 

L241 Modal

 

e

X

111

25

33

25

3

10

24

14

5

3

13

17

4

8

4

5

 

 

 

 

L143 Modal

 

e

X

111

29

34

25

3

10

20

15

5

2

11

15

4

7

4

4

 

320415

 

+

Micek

Russia

e

C

67

 

 

 

 

0

4

1

0

4

1

4

0

2

0

1

 

229581

 

 

Zinin

Unknown

eh

C

67

 

 

 

 

1

4

2

1

3

1

3

1

2

1

2

 

200924

 

+

Ratuszni

Hungary

e

C

67

 

 

 

 

2

 

1

 

 

 

 

 

 

 

 

 

262750

 

+

Svercel

Slovakia

eh

C

67

 

 

 

 

2

8

2

0

4

1

4

0

1

0

1

 

243901

FSQXZ

 

Stubblefield

Unknown

ehy

C

67

 

 

 

 

2

10

2

0

3

5

9

1

6

0

2

 

E10751

 

 

Schulz

Germany

 

C

67

 

 

 

 

2

7

3

2

5

5

7

2

5

2

4

3

6104

4HJ3D 

 

Boyd

Unknown

ehy

C

67

 

 

 

 

2

10

3

0

4

4

9

0

1

0

0

 

207878

 

 

Frind

Germany

eh

C

67

 

 

 

 

2

10

3

0

4

4

7

1

4

0

2

 

70482

6HMRD

+

Ostholm 

Sweden

ehy

C

67

 

 

 

 

2

10

4

1

4

4

6

1

2

1

1

 

226416

 

+

Sabieka

Belarus

eh

C

67

 

 

 

 

2

11

5

2

5

7

10

3

7

2

5

 

174240

 

 

 

Unknown

 

C

67

 

 

 

 

2

3

2

1

3

1

2

1

1

1

1

3

 

WHFQB

 

Froetscher

Germany

y

C?

67

 

 

 

 

3

15

6

1

3

8

12

1

4

1

2

 

B3807

 

 

Stavbom

Sweden

eh

C?

67

 

 

 

 

4

 

5

 

 

4

9

1

5

0

4

 

B2670

X2JH9

 

Blind

Germany

e

C?

67

 

 

 

 

3

 

7

 

 

5

7

2

5

2

2

 

97005

CBF87

 

Strejc

Austria

ehy

X?

67

 

 

 

 

6

16

9

4

3

8

11

3

8

3

5

 

310951

 

 

Petrov

Russia

e

X?

67

 

 

 

 

6

18

9

3

2

9

13

2

7

2

5

 

E7459

8K6VZ

 

Casado

Croatia

ehy

X?

67

 

 

 

 

6

17

11

5

2

9

14

5

9

4

6

 

E8272

 

L241

Abdurrah

Kosovo

eh

X

67

 

 

 

 

7

15

9

4

2

8

10

4

7

4

4

 

25780

5DQ2B

 

Wilson

England

ehy

X

67

 

 

 

 

7

20

12

3

2

9

14

3

7

2

2

 

143479

 

 

Mastel

United K

eh

X

67

 

 

 

 

7

17

10

4

3

8

12

4

8

4

3

 

24437

 

 

Harvison

Scotland

h

X

67

 

 

 

 

7

23

16

3

3

14

17

2

6

2

3

 

199300

EJ4B6

---

McKrell

Unknown

ehy

X

67

 

 

 

 

7

18

10

3

2

11

15

3

8

3

5

 

 

 

 

4 more

 

y

X

67

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

27 samples

 

ey

X

67

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

N109412

BYHHR

 

Howe

Unknown

eh

C?

37

 

 

 

 

 

 

 

 

 

1

3

0

2

0

0

 

350864

 

+

Appell

Germany

e

C

37

 

 

 

 

 

 

 

 

 

1

3

 

 

 

 

 

158091

QHU8Y

+

Kline 

Germany

eh

C

37

 

 

 

 

 

 

 

 

 

2

5

1

2

1

2

 

141863

W5JHS

 

Pohl 

Germany

eh

C?

37

 

 

 

 

 

 

 

 

 

5

8

1

3

1

3

 

N106293

GJNU6

 

Beasley

USA

e

X?

37

 

 

 

 

 

 

 

 

 

6

14

7

12

1

8

 

275510

3K5CF

+

Roider

Germany

e

C

37

 

 

 

 

 

 

 

 

 

7

9

0

4

0

1

 

 

 

 

8 more

 

e

X?

37

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

177898

B6CUR

 

Miller

Germany

e

L?

37

 

 

 

 

 

 

 

 

 

8

13

5

8

4

5

 

 

 

 

7 more

 

e

 

37

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

Q8JRJ

 

Spooner

USA

y

C?

37

 

 

 

 

 

 

 

 

 

1

3

0

1

0

0

 

 

9P4Z5

 

Sager

Germany

y

C?

37

 

 

 

 

 

 

 

 

 

2

5

1

1

1

1

 

 

2N3UM

 

Oppitz

Germany

y

C?

37

 

 

 

 

 

 

 

 

 

3

7

1

4

1

2

 

 

EDS4E

 

Haenicke

Germany

y

C?

37

 

 

 

 

 

 

 

 

 

3

5

1

2

1

1

 

 

V6X4V

 

Fitze

Germany

y

C?

37

 

 

 

 

 

 

 

 

 

3

6

0

1

0

0

 

 

3K4Y2

 

Lintner

Germany

y

C?

37

 

 

 

 

 

 

 

 

 

4

8

0

4

0

1

 

 

4Q933

 

Kephart

USA

y

C?

37

 

 

 

 

 

 

 

 

 

4

6

2

3

2

2

 

 

WME5S

 

Cervenka

Hungary

y

C?

37

 

 

 

 

 

 

 

 

 

5

10

1

6

0

2

 

 

UF6K3

 

Spatz

Poland

y

X?

37

 

 

 

 

 

 

 

 

 

6

9

3

6

3

4

 

 

 

 

More

 

y

X?

37

 

 

 

 

 

 

 

 

 

7

 

0

 

 

 

 

S10193

 

 

Engel

Germany

h

C?

34

 

 

 

 

 

 

 

 

 

 

 

0

1

0

1

 

S10194

 

 

Kochtitizky

Hungary

h

C?

34

 

 

 

 

 

 

 

 

 

 

 

0

3

0

1

 

294225

 

 

Belinskiy

Russia

eh

C?

34

 

 

 

 

 

 

 

 

 

 

 

0

 

 

1

 

A10196451

 

 

Stavbom

Sweden

h

C?

34

 

 

 

 

 

 

 

 

 

 

 

1

6

0

4

 

285764

 

 

Stavbom

Sweden

e

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A2983

 

 

Undisclosed

Austria

h

C?

34

 

 

 

 

 

 

 

 

 

 

 

1

4

1

1

 

S10195

 

 

Jordan

Germany

h

C?

34

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

22 more

 

eh

X?

25

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

N26163

R38X2

 

Fritsch

Czech

ehy

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

2

N39377

 

 

Obendorf

Germany

eh

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

2

N57225

XKCE3

 

Livingston

Germany

ehy

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

2

284871

 

 

Knotz

Austria

eh

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

2

 

Ysearch

 

6 more

 

y

C?

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

2

 

Summary:  See the News topic at the top of this web page for a summary of this table.

 

Explanation of the modal haplotype columns in the table; update 29 Mar 2014:

            C90(111) is my new best modal haplotype definition for prediction of C type, a hypothetical branch of L540, using 90 of the 111 standard STR markers.  The cutoff is 7;  notice that there are no samples in the gap at steps 7 through 12.  No doubt future samples will show up in the gap, at which time the cutoff and the definition may need slight adjustment.  My analysis file http://www.gwozdz.org/C111Type.xls is available if you are interested in the details.

            C111 is the modal haplotype using the full 111 STR set.  It just barely works, with a gap of step one at the cutoff value of 16.

            L77(111) is my new best modal haplotype definition for prediction of L540 type, using 77 of the 111 standard STR markers.  L540 type is predicted equivalent to the haplogroup L540.  The cutoff for L540 type is 8;  notice that there are no samples in the gap at steps 8 through 14.  No doubt future samples will show up in the gap, at which time the cutoff and the definition may need slight adjustment.  My analysis file http://www.gwozdz.org/L111Type.xls is available if you are interested in the details.  Notice that two samples in the table meet L77 but not C90, so those two are predicted to belong to older branches of the Y-DNA tree - older than the node for C type.  C type samples do not differ much in STR values, which means the C type clade is quite young.  Those two L type samples that are not C type differ a lot more in STR values, meaning their nodes are probably older.  Future data may provide more such samples, at which time the L type definition will likely need adjustment.  In this respect, L type predictions are not as reliable as C type predictions;  there may be future samples with L77 step greater than 15 that turn out L540+.  Such a probability is low for each particular sample, so I used boldface “X” for samples in the table with step greater than 15, even without the L540 test, because each particular sample has greater than 80% confidence, in my estimation, of not belonging to L540.  The probability might be 1% or 20% that any one sample at L77 steps 15 to 20 might be an L540 outlier;  I do not know how to estimate such probability, except that I’m sure such probability is not very high and decreases with increasing step.

            L3(111) is the L540 STR signature for L540 type, using only 3 markers, and it nicely selects only the confirmed L540+ samples.  L3 is (594,636,561) = (12,12,17).  That file L111Type.xls shows that any number of markers from 3 to 105 can be used to distinguish the L540 samples, but this will surely change in the future as L540 samples with unusual STR mutations show up in the database.  Actually, any one of those three L3 markers is sufficient to identify the L540 samples in this L540 Neighborhood table, but there are a few samples outside the V13 father haplogroup that match at any one of the three due to the large number samples in the full E-M35 database.  Notice that the L3 signature markers are not among the standard 37 set.  The signature markers that strongly distinguish C type from the rest of L540 are all in the standard 67 set (see C4(67) below);  the data so far do not suggest more signature markers in the 111 set for C type, although a few are useful in the C90 definition discussed above.

            L111, the modal haplotype using the full 111 STR set, is not in the table, because L111 differs from C111 at only two STR markers, which are very variable, so for most samples L111 differs from C111 by step < 2. L77 step numbers differ from C90, because the same markers are not used;  C90 markers are selected to best distinguish C type, while L77 markers are selected to best predict L540+.

            Nearest Neighbors at 111 markers:  The table includes a few samples just beyond the C90 cutoff of step 8.  These are all the samples I could find with C90 step less than 23, or with L77 step less than 19.  These “near neighbors” act as a calibration of the modal haplotypes using 67 or fewer markers - data on the right side of the table at 111 markers.  I hope the meanings of those code names are obvious in light of the 5 examples explained above.  For more discussion, see the topics C67Type.xls, L67Type.xls, C37Type.xls, C25, and C12.  Notice that the definitions C42(67) and L45(67) work, but without a wide gap.  As expected, 67 STR markers provide less confidence in assignments than 111 markers.  At 37 or fewer markers there is yet less confidence, and the defintions fail for a few samples.

            LnotC3(67) is a signature that seems to predict L “paratype” samples (L540 not C type):  (439, Δ389, 413b) = (11, 17, 25).  My confidence is not high in this signature, because it is based on only two samples, which might be just the luck of random mutations.

            C4(67) is the signature used by Haplozone cluster C since before L540 was discovered:  (390, Δ389, 447, 594) = (25, 19, 25, 12).

            C3(25) is the original Friedman signature, proposed years ago.  For this table I used the difference for DYS389: (390, Δ389, 447) = (25, 19, 25).  The table shows that it still works remarkably well.  However, there is selection bias, because some samples at C3 step 1 were not included in the table because so far all these have 37 or more markers and do not fit C type using the corresponding definition in the table;  some of these belong to V13, the father haplogroup of L540.  At C3 step 0 there are only a few more samples in the database but these are from outside V13 so these are not listed in the table.  Note that Sabieka is L540+ at C3 step 2, and a few others at step 2 are predicted C type based on more markers.  On the other hand, the table has at C3 step 2 two samples L540- and a few more predicted well outside C type.  The table indicates 22 more samples with only 25 markers at C3 step 2;  no doubt a few of these might eventually test L540+, but the probability for each one individually seems to be low.

 

Explanation:  The simplest explanation for what this table means: Butman (N91348) has a male line with node in the Y-DNA tree slightly older than the L540 mutation, so Butman is the closest “neighbor” to L540 but not in the L540 haplogroup.  Gebert (166692) and Fredeen (162917) each have male lines with nodes in the Y-DNA tree younger than the L540 mutation.  The node for C type is much younger.  More complex explanations are possible;  for more discussion see the Structure topic.

 

Notes;  column in the table at the far right:

            1:  Sample from a 3rd cousin of an L540+ sample, so assumed L540+ without testing.

            2:  The 12 marker set at step zero (perfect match to the 12 marker modal haplotype) provides a low confidence prediction of which samples might benefit from the L540 test.

            3:  Two samples, E10751 and 174240, are not in the E-M35 Project;  these two were brought to my attention by Paul Svercl (in the table), who noticed them in an E haplogroup tree by Marko Heinila, but that tree is no longer on-line.

            I use the Ysearch method for calculating step, which gives a result slightly different than the Haplozone method.  There is also a one marker discrepancy mentioned in the Ysearch topic.

            If you are a neighbor and wish to be added to this table, please let me know.

 

Gwozdz

            My sample is kit N16800.  N81304 is my 3rd cousin Gwozdz.

 

Kargul

 

            Aloysius Kargul (Kargol) is my closest STR match available on the web.  Kit 199446.  In May 2010, his daughter noticed, on ancestry.com, that he and I are perfect matches at 12 markers.  I studied the LDS microfilms and located his 1820’s Kargul ancestor living in a village in Poland only 20 miles away from the village of my Gwozdz ancestor.  I paid for his FTDNA sample.  Kargul is in the table above.  His L540 test came out positive, placing him in that new haplogroup.  We are 5 steps apart at 67 markers;  9 at 111 (4 Jul 2011 update).

            For estimating the size of C type or L540, my cousin and Kargul should not be included, because I recruited them, paying for their tests.  Family sets such as these distort size estimates, compared to random data.  In other words, C type really has only 16 samples, not 18 (on 4 July 2011) if compared to other clusters (which should also be adjusted for family sets).

 

Butman

 

            New topic 13 May 2011.  Update 17 Jul 2011:  Butman’s L540 SNP test just came out negative;  that means he is not a member of the new L540 haplogroup.

            Raymond Butman, kit N91348, is right on the edge of the predicted C type using the old 61 marker definition.  This sample is a recent addition to the M35 database.  His step using the 61 marker definition is 8.  My cutoff for the definition before this sample showed up was 7 because of the gap - no samples from step 7 to 12 at that time.  When I wrote this topic after Butman’s sample showed up, I changed my cutoff to 9 and I pointed out here that this sample might land in V13C (L540), but it was a close call, not a confident prediction.

            Although this sample matches most of the markers of the definition, it misses at the two best signature markers, DYS389-2 and DYS594.

            I subsequently developed a new L540 definition that excludes Butman.

            What does this mean?  The simplest explanation:  Butman’s node in the Y-DNA tree (his male line common ancestor - branch joint) seems slightly older than the nodes of L540 members (in the database so far today).  His node is older than the L540 mutation, and also seems older than the DYS389-2 mutation, and also older than the DYS594 mutation.  Most of his other STR values match the L540 definition because his node is not much older, so there has not been much time for more mutations.

            This simplest explanation is a good statistical prediction, not a proof.  Other less likely explanations are possible.  For example Butman might be an outlier from a clade very distant from L540, where most men have different STR values, where he might have many matching STR values to L540 due to the luck of random mutations in his male line.  Yet another possible explanation:  Butman might belong to a very small clade with a much older node with L540, even older than the node for the men at steps 7 and 8, but the ancestor at that node might just happened to have STR values very close to the values for the L540 ancestor, due to the luck of random mutations, and now Butman is the only sample available from that very small clade.

 

Gebert

            Update 19 Dec 2014:

            I noticed Gebert’s sample on Ysearch and encouraged him to join the E-M35 project, which he did in 2011, kit 166692 in the table.  I helped pay for the orders for the L540 test and for the 111 extension.  He purchased Big Y in 2014.  Although he is L540+, his STR values do not quite fit C type.  Also, his STR values produce greater estimated genetic distance (step) from all C type samples than C type samples are from each other.  On this basis, I predicted that Gebert’s node in the tree is younger than L540 but older than C type.  Hochreiter’s Big Y produced 10 SNPs where Gebert is negative, so I hoped one of these would be a marker for C type, or at least younger than L540.  I tested all 10 of these for my sample, but I came up negative for all.  One explanation is that Gebert’s node is not much older than Hochreiter’s node, with all 10 of these in Hochreiter’s branch but not in my branch;  this would be a surprise because Gebert’s STR’s imply an older node, so one or more of those 10 are expected on the main branch.  Another explanation is that Gebert has a statistically very unusual number of STR mutations, with back mutations in C type signature markers, and he is really a member of C type, with my node older than both of theirs;  this explanation also seems surprising.  So I don’t have a convincing analysis.  In retrospect, I should have purchased Big Y instead of just those 10 SNPs.  My Big Y is on order.  So is Svercl’s.

 

Fredeen

            Update 30 Sep 2011:

            Kit 162917, Fredeen, came up L540+ in May 2011.  This was significant because this was the first L540+ sample outside C type, later joined by Gebert.

            Logically, Fredeen and Gebert might be outliers from cluster C, with back mutations in 389II and multiple other mutations just due to the luck of random mutations.  This is very unlikely

            These two samples most likely represent two old branches of L540, with older nodes than the C type node.  They do not match each other well at 67 markers.  Their closest matches at 67 markers are each other and C type samples.  There are no close STR matches to either in the 67 marker data.  Any future close STR matches would of course be predicted L540+ in the corresponding branch.

            Three markers are of interest here:  (385a,406,444) = 17,11,13) are the values for L540 and C type.  Most samples in the STR neighborhood just outside L540 have the ancestral values (16,12,12).  Fredeen is ancestral for two of these, (16,11,13) while Gebert is ancestral for one of them (17,12,13).  This is evidence that Fredeen’s node is older, but the evidence is statistically unconvincing.  We realize that these 3 markers may have experienced mutations after the two nodes of interest.  Both Fredeen and Gebert have 111 marker data, which does not help out for this question.

 

Hohnloser

 

            Hohnloser (kit N39989) fell into C type at 37 markers in 2010 (marginally), but not at 67 markers (not particularly close).  He is not a member of cluster C because his sample does not match the Friedman signature (originally at 25 markers).  He provides an interesting example of how statistics works - in this case, due to the luck of random STR mutations, the sample is close to C type only at 37 markers.

            Hohnloser does not belong to the L540 haplogroup because his SNP test came out negative.

            Hohnloser has extensive family tree research results.  He administers a Hohnloser project at FTDNA.  He exchanged helpful email discussions with me.

 

Hochreiter

            New topic 12 Dec 2014:

            His STRs are a good fit for C type.  Kit N45041 in the table.  He purchased Big Y in 2014.  Hochreiter is my closest STR match (excluding a cousin and an apparent distant cousin), so I expected to be on the same twig of the tree with Hochreiter.  Hochreiter’s Big Y produced 10 unique STRs (not found in Gebert);  I tested all 10 and came up negative for all.  See the Gebert discussion above for details.

 

Structure of the L540 Haplogroup

            Complete rewrite 16 Jul 2011.

            C type is a hypothetical haplogroup within L540.  The evidence is presented throughout this web document, particularly in topics about my 67 marker analysis files and about DYS389II.  My estimate is about 99% confidence that my C type definition corresponds to a clade that will be proven to be a haplogroup by a newly discovered SNP - someday when tests for new SNPs are more comprehensive and lower cost - continuation of the current trend.  My estimate is that 98% (80% confidence range more than 90%) of the samples predicted C type by my definition will end up in such a haplogroup.

            This confidence is based on a combination of statistical calculation, plus judgment where calculation is not possible;  for more discussion see my confidence topic.

            C type seems to be about 80% of L540, based on only 2 L540+ not C type, compared to 7 independent C type samples, for 77.8%, in the 67 marker data.  Confidence in this 80% is not high because it is based on only two samples.  There may be more samples in the STR neighborhood that will test out L540+ in the future, with STR values quite different than my current L540 definition.  Also, there may be L540- samples in the future that match my current L540 definition.

            In other words, I have high confidence that C type predictions are L540+, but not high confidence in the L540 predictions outside C type.

            I expect to update my L540 definition as data accumulates, thereby improving the confidence for L540 STR predictions outside C.

            For now, I recommend the L540 test to everyone in the neighborhood table, with particular emphasis on those with closest step to my current L540 definition, and with the caveat that there may be more L540 beyond that table.

            C type is close to the same as a clade defined by a particular mutation(s) at the DYS389II marker, from value 30 to 32.  We can even speculate that clade is identical to C type data.

            A particular mutation at the DYS594 marker, from 12 to 11, seems to define a “father” clade that is slightly older than L540.  The same might be true for a particular mutation at the DYS636 marker, although more 111 marker data is needed to estimate if 636 is younger or older than 594 and / or L540.  With more data, these two markers will provide a foundation from which the age of other mutations can be estimated, because independent mutations in other markers are unlikely to also have mutations in these two foundation markers.

            The L540 data is bimodal in a number of markers.  Each of these bimodal markers is evidence that there is a significant subclade division than might be determined for L540 structure.  However, these various suggestions point to different divisions;  so far no two of them are strongly correlated.  With more data, it may be possible to split L540 based on statistical correlation of STR markers, using my mountain method.

 

Ysearch

 

            Update 17 Jul 2011:

            479H7 is a direct link to the my modal haplotype for L540.

            QAZ7P is a direct link to the my modal haplotype for C type.

            If you are not listed in the table above you can compare your data on Ysearch.  You can compare your step genetic distance to these modal haplotypes if you have the standard 12, 25, 37, or 67 markers.  The comparison may not work if you have a non standard marker set.  The cutoff for each marker set is given in the legend in the table above.

            Brief description of Ysearch.  Link to the site home:  http://www.ysearch.org.

            To join Ysearch, click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services.  Or, you can type in your data.  You end up with a “User ID”.

 

            Instructions for comparison to V13C at Ysearch:

            Click here:  Research Tools (or click on the tab with that name)

            Copy the following line into the “UserIDs” bar at the Research Tools page:

                                    USEID, 479H7, QAZ7P

            Change USEID to your User ID.

            You need to type the Captcha puzzle for access.

            Click on ‘Show genetic distance report” to see your step genetic distance from C type and from L540.

 

Ancestry.com

 

            www.Ancestry.com  is the web page for a commercial DNA testing company.  Men with Y-DNA test results can choose to make results available for matching to others.  Kargul originally matched with me at this site.

            I last checked for matches 16 May 2011.  There are 9 close matches of Y-DNA to Kargul & me, but these are not close enough to include in my Neighborhood Table.

 

Age of C Type

Age of L540

 

            Comment 25 Mar 2013:  this topic needs update modification using the latest data.  The new on-line Excel file versions do not yet have the ASD sheets.  Coming soon.

            Topic update 11 Jul 2011:

            The discussion in this topic is based on the sheet “ASD” in the two 67 marker analysis files.

            Average Squared Distance (ASD) is equivalent to variance of STR values.  Most people use ASD to calculate age in genetic genealogy, as I explain in my publications.  The ASD method has large known systematic uncertainties, discussed in my publications, which make age calculation uncertain.  It is not possible to calculate a confidence range because the systematic errors might be larger than the statistical errors, even for small samples of data.

            C type is quite young.  The age using all 67 markers comes out 805 years, cell N12 on the “ASD” sheet in C67.xls.

            Although I do not have high confidence in estimating the exact age of C type, there is additional evidence that C type and L540 are young.  The fact that C37, using the first 37 markers, provides a reasonable definition of C type is evidence of youth.  Old haplogroups do not provide reasonable modal definitions using all 37 markers, because of the wide variation in the rapidly mutating markers.  Another way of saying this:  using all 37 markers, there is a lot of overlap of old haplogroups.  Another way of saying it:  Isolation in STR values is evidence of youth.  My publication elaborates on this.

            The fact that C type samples can be extracted from V13 using only a 4 marker signature is also evidence of isolation and youth.

            Low SBP is evidence that C type and L540 are well isolated;  see the analysis for SBP.

            I expect the age to creep up somewhat as new data is discovered.  The correct well known statistical way to correct for this expectation:  divide by N-1 instead of N when figuring ASD.  My files do not use N-1 because that is not the common practice in genetic genealogy.  Using N-1 the age is 939 years instead of 805.  This estimate includes a best guess for future samples;  eventually samples will show up that fit C type but have more mutations than the samples so far - either because of true older nodes with the Y-DNA tree or just due to bad luck in random STR mutations.

            Conclusion:  My best guess for the age of C type is about 1,000 years. 

            Of course, this result has very low confidence, because it is based on only 7 samples.  The true age might be up to a factor of 4 older.  Or it might be a lot younger.  Although it is not possible to calculate a confidence range that includes systematic errors in age estimation, I suppose a reasonable high confidence range estimate would be 500 to 2000 years old.

            Discussion of C age estimation.  My calculation excludes two samples, Gwozdz and Kargul, whom I recruited, because a random selection is required.  Including them yields 806 years, corrected to 907, not significantly different.  All 67 markers are valid for use in a young clade like this, because there are no recLOH issues in the C data.  Issues in recLOH usually cause problems using all 67.  Those analysis files include comparison to a classic “Thomas” method, which I explain in my publication, and which yields 292 (vs 805) years, but I consider that just a demonstration that the old method uses too few markers, which by luck do not vary much in the C data.  My files allow the user to easily vary the markers and easily vary the samples.  Boyd seems to be the oldest sample;  removing him lowers 805 to 712, corresponding to the age without his hypothetically older clade.  The “oldest” two markers (far right of the ASD sheet has a sort by age) are DYS460 and DYS385a, with 3975 and 3058 years respectively;  this is evidence that those markers might correspond to clades within C type, but the evidence is not compelling yet because some markers are bound to come out old just by luck.  My definition of C type of course excludes such old markers, which is a proper procedure for a good definition.  But it is not fair to exclude old markers in age estimates because the zeros balance the old ones statistically;  C type has 45 markers with zero age - the lucky ones with no mutations.  Many zeros is evidence of youth but also evidence of few samples - only 7.  With more samples there will be fewer zeros.  Most published ASD age calculations include a correction making the result older, but the reason for that correction is population bottlenecks, which reduce ASD;  since C type is large for a very young clade, I’m guessing the age corresponds to a rapid population expansion after which there were no significant bottlenecks.  If I am right, no such correction would be appropriate.  Also, most published ASD ages use N instead of N-1, so a published correction method should be applied to my first number, 805.

            An age calculation for L540 is not appropriate, although L67.xls does it, getting 971 years in cell N12 of the ASD sheet.  That’s not fair.  C type dominates the data, but C type is really only one man, the MRCA.  Our data for L540 is really only 3 men:  the hypothetical C type MRCA, Fredeen, and Gebert.  It is well known that estimating the TMRCA of two men is highly uncertain, just due to the luck of random mutations.  For 3 men it is not much better.  The L540 age (TMRCA) is surely older than C type, just a bit older based on the meager data here of 2 men with STR values different than C type - but not very different at 67 markers than the C type samples with highest step.

            Age of a mutation is of course older than the TMRCA because there should almost always be multiple generations between nodes.  We know the L540 mutation happened after the node for Butman and before the nodes for Fredeen and Gebert.  That assessment will get better with more data.

            I said in the Abstract here that L540 might be twice as old as C type.  That is just a guess.