A New Y-DNA Haplogroup:  E1b1b1a1b8 (L540)

25 Mar 2013

Peter Gwozdz

pete2g2@comcast.net

News

 

           24 Mar 2013 update of the L540 Neighborhood table using new definitions for L540 and C type

           17 July 2011 complete rewrite of this web page

           10 Jul 2011 ISOGG added E1b1b1a1b8 (L540) to the official Y-DNA tree

 

Abstract

           Rewrite 25 Mar 2013:

           This web document is a summary of my information on a small haplogroup of Y-DNA based on an SNP mutation named L540.  The subject is genetic genealogy.

           There is a Neighborhood table below with a list of samples (men) predicted to belong to the L540 haplogroup, and also samples predicted to be in the Neighborhood just outside L540.  The samples near the cutoff (borderline fit) are the ones that should be tested for the L540 SNP to see if they belong to the L540 haplogroup;  probability of belonging decreases with the step number, as explained in the discussion below that table.

           This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to first read an Introduction that I wrote for another of my web documents.

           L540 was discovered in my Walk Through the Y (My WTY).  I purchased WTY, a commercial product that reads more than 200,000 base pairs of the Y chromosome.

           The clade that we now call L540 was originally called cluster C, a hypothetical cluster proposed as a division of haplogroup E1b1b1a1b, which is defined by the SNP named V13.  Hence I coined the name V13C.html for this document about cluster C in early 2010.  I rewrote this document and renamed it L540.html on 30 Apr 2011.  E1b1b1a1b (V13) is the largest haplogroup division of haplogroup E, but cluster C is small.  There are 50 samples predicted C type in that Neighborhood table.

           I am not planning a separate L540 project, because it is more convenient to run this informally through the E-M35 project.  Villarreal and Lancaster are very helpful.

           My equivalent name for V13C was “C type”, or just “C”.  I independently verified C type on 9 Jan 2010 as a good candidate type.  I use the word “type” to mean an STR cluster with statistical validity as established by my Mountain Method.  I published my methods in the Fall 2009 issue of JoGG.

           C type is quite young, maybe only about 1,000 years old.  The L540 mutation is older, maybe more than twice the age of C.  Such age estimates are uncertain, particularly with so little data in this case.

           L540 seems to be roughly 90% C type plus 10% other, older branches, as explained in the next topic about L540.

           Watch this document.  I’ll add more information as data accumulates.

 

L540

           Rewrite 25 Mar 2013:

           L540 is the code name for an SNP that was discovered in my WTY, announced 29 March 2011.  On 27 Apr 2011 it was demonstrated that L540 defines a new haplogroup within E1b1b1a1b (V13).

           I use the code name L540 for the SNP, for the associated haplogroup, and for the samples (men) in that haplogroup.

           The L540 haplogroup includes C type, and C type is most of L540.  The C that have been tested are all L540+.

           The table below has my predictions for L540 and for C type.

           My sample has been tested negative for all the 7 confirmed branches of V13, so L540 is an 8th branch of V13, not a branch of one of those prior 7 branches.

           Three of the 7 branches are too small - few or no samples available on-line for testing:  M35.2, V27, and P65.

           Samples from the other 4 main branches of V13 are available;  I recruited and paid for testing as needed:  L143, L250, L17, and L241.  All are L540-.  That means none of these are branches of L540.  There are L241 samples in the Neighborhood table;  this is evidence (not proof) that L540 and L241 might be brothers, with a common undiscovered SNP branch of V13.

           Anyone in the Neighborhood would benefit from ordering the L540 test.  In the discussion following the table, I give my recommendations for which samples would most benefit from the L540 test.

           ISOGG names change as new SNP divisions are discovered.  L540 was officially added 10 Jul 2011, code name E1b1b1a1b8.  Code names change as new branches are discovered.  For example in early 2011 V13, the father of L540, was changed to E1b1b1a1b from E1b1b1a2.  The official FTDNA tree does not have L540 yet (Mar 2013), but the unofficial FTDNA tree does have L540 listed.

           Update 24 Mar 2013:  More results added to the Neighborhood.  Total 50 Cluster C.  Two L540+ that are not C type.  Total of 12 L540+ test results, 23 L540-.  Most negative results are not in the table, not close in STR step value.  Many V13 samples in WTY have L540 results during the past two years, mine is positive (included in the count of 12) and the others are all negative (not included in the count of 23, and not in the table).

 

Cluster C

           Rewrite 25 Mar 2013:

           Clusters are based on STR correlation.  There are 50 samples predicted C type in the Neighborhood table, labeled “C”.  Some of these are marginal, with fewer than 67 markers, with only 80% confidence of belonging.  26 of those are are listed at the haplozone site, in the V13 + L540 branch.  Cluster C includes me and my 3rd cousin (Gwozdz).

           Friedman proposed cluster C more than 4 years ago, based on STR and SNP correlations, when the data was less than what is available today. 

           New samples appear when Friedman updates cluster C, or when I update the Neighborhood.

 

C Type

           Rewrite 24 Mar 2013:

           I use the word type for an STR cluster with statistical validity as established by my Mountain Method.  “Type” is my own term.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  By “type” I mean the cluster data, the hypothetical clade, the modal haplotype, and the set of all possible haplotypes, at any number of markers.  Accordingly, by “C type” I mean any or all of these 4 things.  I sometimes use just “C” as short for “C type”.  I also have a previous C type identified in R1a;  unrelated;  please don’t get confused.

           My analysis files define C type.  Sorry, it can be a bit confusing because I have multiple STR definitions for C type, for various marker sets.  The number of markers in my definitions change slightly when new samples show up with unusual STR values.  I hope the meanings are clear from the context of my discussions in this web document.

           I also provide STR definitions for L540, discussed below, treating L540 as L type.

           C type is roughly 90% of L540.  My evidence:  Considering only the independent samples with 67 or more STR markers in the Neighborhood:  There are 17 type L samples and only 2 do not fit C type but have tested positive (L540+) for the L540 SNP test:  I say “independent” because some C type have been recruited based on known genetic relationship to C type men (my Gwozdz cousin, Kargol, Svercel);  these should not be counted in this estimate.  I do not include samples with fewer than 67 markers in this estimate, because type prediction has lower confidence with fewer markers, so I have been actively incouraging the L540 test primarily for samples with 67 markers.  Notice in the table that most borderline samples (near the L type cutoff) have L540 test results.  I say “roughly” for this estimate because I have been recruiting samples for this table at 67 markers with equal emphasis on borderline samples, in order to properly sample the STR borderline, but it is difficult to prove no bias toward samples that fit well;  also there is a chance of outliers showing up in the future;  either L540- among those samples that fit C type very well;  or L540+ among beyond the borderline.  Also, with only 2 L540+ outside C type, the sampling confidence is not good.

           The evidence today indicates that L540 is much older than C type - details below.

 

V13C

           Rewrite 24 Mar 2013:

           I coined the name V13C in 2010 to represent C type, cluster C, the hypothetical haplogroup, and the samples (men) in the hypothetical haplogroup.  This web document used to be named V13C.html.

           Now that C type is a subdivision of L540 I am editing away the name “V13C”, but I’ll continue to use “C type” for the hypothetical clade that is part of (most of) L540.

           V13 is the defining SNP for E1b1b1a1b, so I similarly use “V13” to mean the “father” haplogroup - the large branch in the Y-DNA tree from which L540 is a small twig.  I also use “V13” to mean the associated database of V13 samples at E-M35 or at Haplozone, or at other databases.

 

L67.xls Analysis File

C67.xls Analysis File

 

           www.gwozdz.org/L67.xls is analysis of L540 STRs.

           www.gwozdz.org/C67.xls is analysis of C type STRs.

           L45(67) means a modal haplotype for the L540 haplogroup using 45 of the 67 standard markers.  My definition for L540 is L45(67), all samples less than the cutoff (genetic distance, or step) 6.  L45(67) is new on 21 Mar .  I also typed L45 into Ysearch as 479H7.

           Similarly, C46(67) with cutoff 5 is my definition of C type, introduced 23 Mar 2013, Ysearch QAZ7P.

           My previous definitions are available in those files, sheet “Haplotypes & Masks”.

           Comment:  This topic is currently being rewritten;  coming soon.  The remainder of this topic was written 10 Jul 2011, so it need modification.

           Both those files have a sheet named “Comments” with technical discussion.

           Those files have “ASD” sheets, discussed below in the Age topic.

           The C49 definition is slightly better than the older C61 because of the new data that accumulated over a year.  Any reasonable C type definition using 30 to 55 markers does a great job extracting C type, because C type is very well isolated in STR values - a mountain in haplospace.

           The reason for a separate L540 definition:  Three new samples showed up, closest to C type but not C type; Butman, Gebert, and Fredeen.  They are in the table below.  Butman tested L540- and the other two are L540+, members of the L540 haplogroup.  An optimized C type definition does not distinguish Butman from the other two.  The L540 definition does.

           The simplest explanation: Butman has a male line with node in the Y-DNA tree slightly older than the L540 mutation.  Gebert and Fredeen each have male lines with nodes in the Y-DNA tree younger than the L540 mutation.  The node for C type is much younger.  More complex explanations are possible;  for more discussion see the Structure topic.

           L540 Summary.  L540 is isolated:  All samples predicted L540 have step less than the cutoff 5.  The gap is steps 5 and 6 with no samples.

           SBP = 12.4% for the definition L27(67) with cutoff 5.  However, two samples should be excluded from SBP calculation because I recruited them - Kargol and my 3rd cousin Gwozdz;  recruitment artificially increases the size of clade (as % of database), lowering SBP.  Adjusting for this SBP = 15.9%.  This is good.  See the Excel file for details.  In other words, I have confidence that less than 16% of samples predicted as L540 by this definition will not come out positive for the L540 SNP test, most of them at step 4.

           Two of the 5 samples at step 7 have been tested L540-;  I suppose the other 3 should be tested, since I do not have high confidence they will all come out negative.

           The table below has the samples for which I recommend ordering the L540 SNP test.  The discussion below the table provides my estimate of probability for L540+ vs step.

           C type Summary.  C type is very well isolated:  All C type samples have step less than cutoff 3.  There is a gap with no samples from step 3 to 6.  Steps 7 and 8 have only one sample each.  Step 9 has 2.  10 has 3.  11 has 12.  Step 12 has 26 samples.

           SBP = 7.9% for the definition C49(67) with cutoff 3.  This is very good.  See the Excel file for details.  Adjusting for those two recruited samples SBP = 10.9%.  In other words, I have confidence that less than 10% of sample identified as C type by this definition will not belong to the C type haplogroup if and when an SNP marker is discovered for C type in the future.  Again, most of those outsiders will be at step 2.

           See the validity topic below.

           History.  My quality factor is Statistical Background Percent (SBP).  My SBP is very sensitive to the confidence interval due to small sample sizes;  a valid cluster usually starts out with high SBP when few samples are available and then gets better, with lower SBP, as more samples accumulate over time for better statistics.  I introduced this web page (as V13C.html) on 14 Jan 2010 using 37 markers with SBP = 37%, which was not very good.  By April 2010, C type using C(61)67 markers came out with SBP = 11.4%, which is very good.  7 Feb 2011 with more data, C61(67) improved to, SBP = 7.3%, excellent.  That C61 definition now has SBP = 7.0% (before correction) showing steady improvement as data accumulated.  However, that old C61 definition does not discriminate the new samples well;  see column BZ in that C67.xls file.  The new C49 definition has higher SBP but discriminates better.  It is likely as more data accumulates an even better definition will become available, with a slightly different number of markers.

 

37 Markers

 

           17 July 2011 comment:  This and the following 2 topics are based on my Feb 2011 analysis.  I update less often at fewer than 37 markers.  These 3 topics are C type only;  data is not good enough yet for L540 analysis below 67 markers.

           I also have a file using only 37 markers for analysis:  www.gwozdz.org/C37.xls.  189 samples.

           The SBP using all 37 markers is 28%.  That 37 marker column, cutoff 9, captures the 10 samples assigned to cluster C and none others.  However, the other columns, using the best markers, consistently capture only 8 of those samples, casting some doubt on the 2 marginal ones.  It seems 67 markers are statistically required for marginal cases.

           The C67.xls file does a correlation;  the data is copied here below:  The samples with all 67 markers are evaluated using only the first 37.

 

25 Markers

 

           I also have a file using only 25 markers for analysis:  www.gwozdz.org/C25.xls.  228 samples.

           The standard 25 marker STR set is used by a number of Y-DNA testing companies.  The Haplozone data include Sorenson data that is not in the E3b data, and 3 of these land in cluster C.  Those samples have kit numbers starting with “S” in the Neighborhood table below.

           At 25 markers there is no valid C type.  SBP comes out greater than 100% for any combination of markers.  The SBP formula gives a result regardless, but the result is meaningless far above 50%.

           The modal haplotype C25 (all 25 markers) captures all 13 cluster C samples plus 7 others, cutoff 5.

           C18(25) is the best definition I found from the 25 marker set, but even that definition is not satisfactory, because it does not correlate with the results at 37 and 67 markers;  see the Excel analysis file.  Just about any definition using 25 markers captures the samples that fit well at 67 markers, but at 25 markers different definitions capture different marginal samples.

 

12 Markers

 

           I also have a file using only 12 markers for analysis:  www.gwozdz.org/C12.xls.  I used the full database but truncated that analysis to the closest 33 samples to keep the file small.  The 3 Sorenson samples in the table below are not in this database.

 

Best STR Markers

           New topic 14 July 2011:

           STR markers that mutate relatively slowly are statistical indicators for clades in which they are recently mutated, but they are not perfect because of subsequent independent mutations.  When a clade has a few such good STR markers those provide a signature set of STR markers.  A signature is statistically expected to be a more probable indicator of a clade than just one marker.  Indeed cluster C is characterized by the Friedman Signature.  The definitions of C type and L540 use other helpful markers, not just the signature.

           For example DYS389II is the best STR indicator for cluster C and C type because all but three of the cluster C and C type samples identified so far at the Haplozone site have the 32 value and very few other samples in the STR neighborhood have the 32.  The ancestral 30 value is most common in the neighborhood.  Those three exceptions have a 31 value, which is not common in the neighborhood.  We expect that subsequent mutations from 32 to 31 to 30 must occur rarely within C type, so eventually C type samples with 30 should show up, as data accumulates with time.

           My analysis files automatically rank markers using a method that I published.  The exact ranking of markers varies slightly from month to month due to the random nature of mutation values in new samples, and due to the somewhat arbitrary cutoff that I use to restrict the database to the neighborhood (using too many samples provides a ranking of the father clade instead of the clade of interest).  For example a sample that ranks 6th one month might come out 4th or 7th the next month.  For example 389II always comes out 1st for C type but ranks 5th (11 Jul 2011 analysis) for the father haplogroup L540.

           An SNP that defines a haplogroup is very unlikely to have happened exactly at the time of the most recent common ancestor (TMRCA) of a haplogroup.  Most likely the SNP is somewhat older, because usually there are many generations between nodes.  By definition an SNP cannot be younger than the haplogroup.  Similarly, we can consider a clade defined by a particular STR mutation, which is likely somewhat older than the TMRCA of that clade.  However, for clusters defined by signatures, and for types defined by definitions, one rare STR mutation that contributes to the signature might have happened shortly before or after the TMRCA of that cluster or type.

           Very slow mutators should make the best markers.  However the slowest are rarely mutated, so those with intermediate rank show up more often as signature markers.  My “Haplotypes & Mask” sheet in my analysis files has the mutation rate rank (slowest is 1st) for the 67 standard markers.  My publication has the Chandler reference.

           Usually it is silly to speculate about clusters defined by a single STR value.  In this case, however, we have a hypothetical haplogroup, C type, which seems quite young, with relatively little STR variation, so some speculation is in order:

 

DYS389II = 32  (389II minus 389I = 19);  Best Marker for C type

           Update 15 July 2011:

           DYS389II=32 is one of the Friedman markers for the C cluster.  It always ranks 1st in my C type analysis files.  The standard 12 marker set, used by all samples at most DNA companies, includes 389II.

           [Technical detail:  DYS389 is a compound marker, where 389I is the first STR chain and (389II minus 389I) is the second STR chain.  So the marker of interest here is really delta = 19 (389II minus 389I = 19).  However, 389I mutates more slowly and has the value 13 for all C samples and for most samples in the neighborhood.  At Ysearch or Haplozone, both 389 markers need to be used together;  if one is omitted both are ignored.  My analysis file allows 389-2 to be used alone, using 389-1 only to calculate the delta for comparison;  this is signaled by using a negative number in the “mask” in the analysis file.  In this discussion topic, by “32” I really mean 19 for the delta value.]

           The two L540+ samples that do not fit cluster C, Fredeen and Gebert, have the ancestral value 30.  Butman, the closest STR match with L540-, also has 30.  On this basis, it seems likely that the mutation(s) to from 30 to 32 happened close to the TMRCA for C type, and some time after the L540 mutation.

           DYS389II is the only signature marker that distinguishes C type from the 2 known L540+ samples that are not C type.  None of the other 110 markers in the standard set at FTDNA does this.  The other signature markers work well for both C type and L540.  It is possible with more data that another marker might statistically distinguish (weak correlation due to relatively rapid mutations).  On this basis, it seems likely (not certain) that the clade of descendants of the initial 389II mutation is the same as the C type clade.

           Those two non C L540 samples differ from C type by other markers that are not signature markers.  C type has only this one very good marker.  Actually, a type does not need any very good markers if it is very young and very isolated, so that neighborhood samples all differ at a significant number of STR markers, even if not the same markers for each sample.

           The 32 value is rare throughout V13 but shows up in E-M35 branches outside V13.

           DYS389II (actually the delta value) ranks 43rd in Chandler mutation rate.  Near the middle.  So exceptions are expected, due to recent mutations.

           Speculation:

           Model A:  A mutation from 30 to 31 happened close to the TMRCA for C type.  A little later in the history of C type another mutation happened from 31 to 32.  Most C type samples with 31 represent the oldest nodes, and only a minority are back mutations from 32 to 31.  The samples in the STR neighborhood with 31 that do not match C are independent mutations.  This seems to me the simplest model, so I favor it, but only tentatively.

           Model B:  There was a double mutation from 30 to 32 in one man close to the TMRCA.  Or two single mutations too close in time to be distinguished.  Almost all 31 in C are back mutations, most of them from a single subclade.

           Model C:  The 32’s that do not match C belong to the same 389II=32 clade, but there was a population bottleneck.  C is only one of two or more nodes, from MRCA’s who survived to produce descendants with corresponding STR clusters today.  Only C is large enough to be noticed so far, due to a population expansion for C.  This model predicts at least one other small 389II=32 clade will be discovered as L540+ branches with STR values different than C.

           Model D:  I can think of more complicated models.

           The data is not good enough to distinguish these models.  Maybe more data in the future will show correlation with other markers to distinguish a model like A through D.

           Model I:  The initial mutation to 31 (or double to 32) is very close to the same age as C type, so the mutation(s) defines C type.

           Model II:  Mutation(s) younger than C type.  Eventually samples with the 30 value will show up, isolated in haplospace together with the C samples.

           Model II:  Mutation(s) older than C type.  Eventually L540+ samples with 31 or 32 will show up that are too old to fit C type.

           Models I vs II vs III cannot be distinguished from back mutations and outliers until a new SNP is discovered to distinguish them.

 

DYS594 = 12;  Best Marker for L540

           Update 15 July 2011:

           In my analysis, DYS594=12 is the best marker for L540, and is also a good marker for C type.  594 is not in the 37 marker set.  594 helps a lot in defining C type and L540 using the 67 marker set, but does not distinguish C from the rest of L540.

           The 11 L540 samples with 67 markers, including 2 that are not C type, all have the 594=12 value.  Butman, the closest STR match with L540-, has the ancestral 11.

           Two samples in the neighborhood have 594=12 but are L540-.  These are not a random sample;  I recruited them based on STR matches closest to, but beyond the 10 closest matches to, C type at 67 markers.  Other 12 values have not been tested for L540.

           The 594=12 value is more common in the L540 neighborhood than in the rest of the V13 data.

           DYS594 ranks 12th in Chandler mutation rate.  Quite slow, so independent recent mutations should be rare.

           Speculation:

           Model A:  The 11 to 12 mutation in DYS594 is significantly older than L540.

           Model A1:  Quite a few branches, both younger and older than the 11 to 12 mutation, survived the population bottlenecks.  Only C had a significant subsequent population expansion, so only C stands out today.  The other 11 vs 12 branches will not be distinguished by STR values because they are too small and too old to be isolated in STR haplospace.

           Model A2:  The 11 vs 12 branches in the neighborhood will be distinguished by STR values when they are all evaluated for the L540 SNP, and when enough STR data is available to identify the signatures.

           Model B:  The 11 to 12 mutation is not much older than L540.  There is only one other significant independent 12 mutation in the neighborhood outside L540.  By luck.  That clade is the reason there are more 12s in the neighborhood.

           Future data will probably eliminate one or more of these models, and perhaps suggest other models.

           Model C:  The 11 to 12 is younger than L540.  An old 11 branch with L540+ will be found as data accumulates.  If this happens, STR data will not likely have enough correlation to distinguish if such a branch is really due to a back mutation.  A new SNP would probably need to be discovered.

 

Other Good Markers:  DYS390=25, DYS444=13, DYS406=11; DYS456=15, CDYb=33, DYS447=25

           Update 15 July 2011:

           These typically rank among the best in my analysis files, usually in about the order listed in the title here.  Good for both C type and L540 prediction.  The exact ranking is very sensitive to the choice of database.  In the close neighborhood of L540, 390=25 and 406=11 do very well.  Using the entire E-M35 database, 444=13 does better;  by luck 444 does not have any major clades with the 11 value;  there is a cluster in E1b1b1c1a (M84) that has samples with the L540 signature (389II minus 389I)=19 and 594=12, but that interfering cluster in M84 has 444=11, two steps away from L540 at that 444 marker.  This is an example of why the database should be restricted to reasonably close STR samples for analysis.

           Chandler rank for these, in the same order:  47th, 49th, 35th, 60th, 67th, 45th.

           390 is tied with 594 for 1st place in my current L540 analysis file, but I suppose it will end up in 2nd place because it has a higher published mutation rate.

           Models for 390 are similar to the models for 594.

           None of the other markers in the title here are as good as 389II for C type or as good as 594 or 390 for L540.  Each clearly has confounding mutation in the data.  For example, 444 has one L540+ sample with the ancestral 12 indicating that L540 is older, but there is also one L540- sample with the signature 13 indicating L540 is younger;  one of those must be an independent mutation (or an error).  More data will help this get sorted out.

           Many complicated models can be constructed combining 2 or more good markers.  Complication comes from figuring out the age order of the markers.  More data might point to a compelling model.

 

Signature C4

           Update 15 July 2011:

           An excellent signature for C type is (389I, 389II, 594, 444) = (13, 32, 12, 13).  Seven of 9 C type samples with 67 markers have this signature, and the two that miss are at step 1;  no other samples in the neighborhood have step 1;  Gebert is the only one at step 2.  In the vast E-M35 Haplozone database there is only one confounding sample at step 1, but that one is from E1b1b1c1a;  all others differ from this signature by 2 or more steps.  In other words, this 3 marker signature, cutoff 2, extracts all the C type samples and none others from V13 data.  Eventually, of course, exceptions will turn up.

           There are better markers than 389I.  I included that one because it enables C4 in the search function at the Haplozone site.

 

Friedman Signature

           Update 15 July 2011:

           In my 25 marker analysis (updated 12 Feb 2011), the markers ranked 1, 2, and 3 are 389II, 390, and 447.  These are the same as the 3 markers that Friedman had been calling the “characteristic marker values” for cluster C at the Haplozone site before I started working on this.

           The signature is (390, 389-2, 447) = (25, 32, 25).

           This original Friedman signature by works surprisingly well by itself for samples with only 25 of the standard markers, but not with high confidence.

           67 marker data;  correlation to 25 marker signature:  One sample that matches perfectly those 3 of 25 does not fit C type or L540 at 67 and has indeed come up L540-;  3 samples (actually only 1 independent sample - me - the other two I recruited) that miss by 1 step at 3 of 25 do fit at 67 and are L540+;  2 samples that miss by 1 step at 3 of 25 do not fit at 67 are indeed L540-.  Summary:  67 marker prediction is working perfectly so far, but 25 marker predictions on 67 marker data miss by a few.

           In early 2011 Friedman added 594=12 to the “characteristic marker values”, for 67 marker samples.

           DYS447 typically ranks about 8th or 9th out of 67 for C type or L540.

           Friedman uses a more complicated analysis than just this simple signature in her C type assignments.  I do not know what her method is exactly, but most definitions (not all) that I tried selecting well ranked markers extracted the same samples that she did.  My table has a few others that look good with 25 markers, but there is not high statistical confidence for extraction using only 25.

 

CDYb = 33;  Another Good Marker

           Update 16 July 2011:

           The marker CDYb is very unusual in L540;  10 of the 11 samples have the value 33.  See my 67 marker analysis file.  It ranks tied for 6th.  Just beyond L540 in STR step from the definition, less than half the samples have the 33 value, and some of them might turn out to be L540+ because most are not tested.

           The CDY pair is the most rapid mutator of the 67.  I have never seen a cluster or type that is so uniform for one of the CDY markers.

           It is possible but very unlikely this is a coincidence.  If it is just a lucky coincidence, then as data accumulates over the months lots of C type samples should show up with values other than CDYb = 33.  I originated this CDYb topic in April 2010, and the accumulated data since then has strengthened the evidence that CDYb=33 dominates what is now L540.

           Of the 5 tested among the 31 samples at steps 7 to 10 beyond L540:  2 are 33 and 3 are 34.

           My hypothesis:  There is a mutation within the CDYb chain.  Either a point mutation, or a foreign insertion, or a deletion (a deletion that is not a simple STR chain deletion of a motif, but a removal of only part of an STR motif).  It is known that a mutation within an STR that spoils the motif effectively splits that STR into two small STRs.  Smaller STRs have lower mutation rates.  A spoiler mutation near the middle of the main CDYb chain would turn that marker into a slower STR mutator.

           (By the way, 447 is known have two such defects, so 447, which looks like a long STR, is really 3 short STRs, so 447 is not very rapid, and in fact 447 is a good marker, discussed above.)

           CDY, also called DYS724, is a compound marker, so that means one of the pair often copies onto the other, providing equal values, CDYa = CDYb.  This is called recLOH, my publication has references if you want to read more about recLOH.  Or check Wiki, where there is an explanation that 459, 464, and CDY are all on the same “palindrome” P1, where P1 has two arms that are mirror copies.  Sometimes all three of these markers get the values copied from one arm of P1 onto the other arm.  A mutation at CDYb, making it unlike CDYa, would make copy mutations less common.

           Such seems to be the case, providing more evidence for a CDYb spoiler mutation.  There are no recLOH mutations in CDY or in 464 in the L540 data.  Of course there is not enough data yet to be compelling.  The 459 marker has both values = 9, so an recLOH would not be noticed there.

           It is not clear if the ancestral CDYb value is 33, or 34.  Both are common in the neighborhood.

           Model A:  More than a millennium ago, for a man in the L540 male line, not very long before or after the L540 mutation, a mutation destroyed the middle of the CDYb STR chain, turning CDYb into a unique marker, distinct from CDYa, and a much slower STR mutator than CDYa, and less likely to combine with CDYa in an recLOH event.  This rare mutation happened in a man who ended up with the equivalent net value of 33 at CDYb after the mutation.  A descendant of this man would be the most recent common ancestor (MRCA) for the clade corresponding to this mutation.  There were some normal STR mutations at CDYb in the descendants of that MRCA, and some of those mutated CDYb men were lucky enough to have male descendents living today, and one of them shows up in our C type data today with the 34 value.

           Although more data will add evidence to this model, I doubt the data will be good enough to determine if 33 or 34 was ancestral.  If 33, samples from old nodes might show up, which I doubt will be distinguished older or younger than that spoiler mutation, based on STR analysis.  If 34 is ancestral, I doubt the data will distinguish older branches from more recent back mutations.

           More models can be constructed along the lines of the models discussed in previous topics above.

           In a previous version of this web document I speculated that this CDYb mutation might define a new haplogroup.  I asked Thomas Krahn at FTDNA about sequencing my CDYb to prove there is an SNP in there.  Krahn explained that the P1 palindrome is very difficult to sequence with standard methods because the data is a mixture from the two arms of P1.  Krahn also pointed out that even if the SNP were proven it would not be accepted as a haplogroup division because an recLOH can still happen, and if CDYa gets copied onto CDYb that wipes out the SNP in the clade descending from that recLOH.

           My 2010 versions of this topic pointed out that the CDYb mutation is definitely older than C type, because the 33 value predominated in the very near neighborhood of C type.  That prediction has been validated.  The two L540+ from outside C type both have the 33 value.  With my new L540 definition, the 33 is not dominant beyond the L540 data, but of course a few of those that are there might end up in L540 when they are tested.

 

DYS636 = 12;  A New, Excellent Signature Marker for L540

           New topic 3 July 2011:

           DYS636 is not one of the standard 67 STR maker set, but has been available this year as part of the extended 111 STR marker set.  I have been encouraging men in the L540 neighborhood to purchase the extension to 111, helping out with the cost where necessary.  I was hoping that there would be a slowly mutating marker among those extra 44 with a mutation unique to the L540 haplogroup.  Sure enough, DYS636 provides such a marker.

           Of the 64 samples (2 Jul 2011) in the E-M35 database with all 111 markers, 9 are L540, and all of them have DYS636 = 12.  The others are DYS636=11 with only two exceptions at DYS636=12, but those two both have many STR mutation differences from L540 and V13, and are not predicted to belong to the V13 parent, so those are obviously independent rare mutations.
           The two L540 samples that are not in C type have all 111 markers now, and indeed carry the DYS636=12 value (Gebert & Fredeen).

           The one nearest neighbor to C type that came out L540- has all 111 markers now, and indeed carries the ancestral DYS636=11 value (Butman).

           It is not a coincidence that the critical samples (all the samples from the 67 marker data near the L540 cutoff) have all 111 markers so soon - I recruited the data and paid for it as needed.

           13 of the 111 marker samples are confirmed V13 (in addition to the 9 from L540), plus a few more are predicted V13.  (2 Jul data - will increase quickly because there are several more with panels beyond 67 obviously in process for 111).

           Actually, my main motive for encouraging 111 markers was to better subdivide L540.  No luck yet.  There is no slowly mutating marker among those new 44 that obviously mutated during the history of L540, like 389II for C type.  Recall that DYS389II=32 is the best marker for C type, distinguishing C from the parent L540.  None of the new 44 does this.  With more data I might find a reasonable way to further subdivide C type;  there are several hints in the data on how to do this with combined markers, but none of them are compelling.

           Recall that DYS594=12 is also unique to L540, also with ancestral 11.  This pair of STRs provide a firm foundation from which to notice any new clusters in the L540 neighborhood as data accumulates.

 

P Cluster:  (385a,439,447,464c,445) = (16,13,26,17,11)

           Update 30 Sep 2011:

           C type includes “P cluster”, a Polish cluster defined by that 5 marker signature.  The cluster is not convincing, because there are only 3 samples, me and two others that I recruited.  I call it a cluster because I reserve the word type for clusters with statistical significance.

           DYS454 is one of the markers recently available in the extension to 111 markers.  The other 4 markers in the signature are part of the standard 25 marker set.  As luck would have it, there are no other markers that correlate with these in the 67 marker set, so only 25 markers are required to match a sample to P cluster within the L540 haplogroup, using 4 of those markers.

           Each of these markers individually is variable.  There are quite a few E1b1b1a1b(V13) samples that match all 4 of those out of 25.  There are no matches in the STR neighborhood of L540 and C type, although eventually a match should show up just due to the luck of random mutations.  Within L540 and C type, only one sample so far matches the signature at two of these markers, and 4 other samples match at only one.

           DYS385a=16 is ancestral to L540, so most samples in the neighborhood outside L540 match this signature.  This marker apparently mutated to 17 near the time of the origin of L540, then mutated back to 16 near time of origin of C type.  Fredeen has the 16 value;  Gebert has the 17 value;  this is an inconclusive hint that Fredeen belongs to an older branch.  I say “inconclusive” because Fredeen also has the ancestral value at DYS444, but Gebert has the ancestral value at DYS406.  Both have numerous mutations so they are probably members of two separate old branches of L540,

 

The signature can only be used for L540+ samples, where the signature has no other matches in the data to date.  Fredeen matches at one of the two;  more such single matches are expected eventually with enough data, due to the luck of recent mutations.

           Because both values of the signature are ancestral, it is reasonable to wonder if cluster P corresponds to the oldest clade in the C type data.  However, Gebert has the mutated L540 values (17,25) for this pair;  Gebert is L540+ but does not match C type, so he seems to belong to a branch with node older than C.  It is not convincing to speculate that these two markers both mutated in C type close to the node for cluster P, and that Gebert by luck mutated to the same C type values in an older branch.  It is more convincing to assume that cluster P is not necessarily the oldest clade, and that these are the only 2 out of 111 STR markers where this young clade has unique markers, and by luck those 2 markers are both back mutations to the values ancestral to L540.  To add confusion, Fredeen, from another L540 branch older than C, has (16,25) for this pair, matching cluster P on the first but ancestral on the second;  it is reasonable to assume Fredeen has one independent mutation matching cluster P, because neither marker is slow;  both have variation of value in the neighborhood.

           Another unconvincing speculative model:  cluster P might be older than all the L540 data, with an independent 389II mutation of 2 steps matching C by coincidence.  This is unreasonable because cluster P matches C type at many markers;  the step count matches C even with 2 steps added.  Samples with older nodes than C each mismatch C at different markers, for higher step counts.

           Most C type men have German ancestry.  It makes sense that a C man moved to Poland and founded cluster P a few centuries ago.  Of course, future Polish samples need not match cluster P because another C type man may have also moved to Poland (or multiple male line ancestors may have each moved a short distance, diffusing into Poland).

 

L540 (V13C) Neighborhood

           Partial updates during March, most recent 24 Mar 2013, not finished

           Previous full update 19 May 2012

           L540 is small enough that I can insert a complete table here, including neighbors just beyond in STR values

           My modal haplotype for L540 uses 45 of the 67 standard STR markers

           My modal haplotype for C type uses 46 of the 67 standard STR markers

           Also shown are columns for full modals

           The columns show step using each modal

           + vs --- means confirmed positive vs negative by the L540 test

                       L241 and L99 mean positive for those haplogroup markers, implying negative for L540

                       There are more negative L540 results from outside this neighborhood (higher step)

           Red step numbers are predicted C type and violet are predicted L540 not C type, on the basis of STR values using my Mountain Method

           Boldface C are members of E-M35;  C without boldface are from other projects, or from Ysearch

 

 

 

 

 

 

 

Modal>

L45(67)

C46(67)

C67

33(37)

37

23(25)

25

12

 

 

 

 

 

 

Cutoff >

6

5

12

8

9

4

5

3

Kit

Ysearch

L540

Name

Origin

Type

Markers

 

 

 

 

 

 

 

 

N16800

KFKGM

+(WTY)

Gwozdz 

Poland

C

111

1

0

7

3

6

3

4

2

155155

 

+

Svercl

Czech

C

111

1

1

8

3

4

1

2

2

N81304

 

 

Gwozdz

Poland

C

111

1

1

9

4

7

3

5

3

N45041

UQR4B

+

Hochreiter

Germany

C

111

2

0

4

1

5

1

3

1

199446

TK98K

+

Kargol

Poland

C

111

2

1

6

3

5

3

4

2

140927

9JM9U

+

Donovan

Prussia

C

111

2

3

6

3

2

1

1

1

51282

 

+

Wion 

Germany

C

111

3

3

10

5

8

3

4

1

225596

6S4J6

+

Nowak

Poland

C

111

4

4

9

5

5

0

0

0

171456

79QF7

 

Glasser 

Germany

C

67

1

0

3

2

2

0

1

0

262750

 

ordered

Svercel

Slovakia

C

67

2

1

8

2

3

0

1

1

174240

 

 

 

Unknown

C

67

2

2

3

 

 

 

 

 

229581

 

 

Zinin

Unknown

C

67

2

2

4

 

 

 

 

 

 

5XP46

 

Stelz

 

C

67

2

2

 

 

 

 

 

 

243901

 

 

Stubblefield

Unknown

C

67

2

3

10

 

 

 

 

 

E10751

 

 

Schulz

Germany

C

67

3

2

7

 

 

 

 

 

6104

4HJ3D 

 

Boyd

Unknown

C

67

3

2

10

3

8

0

1

0

207878

 

 

Frind

Germany

C

67

3

3

10

6

7

4

4

2

70482

6HMRD

+

Ostholm 

Sweden

C

67

4

4

10

5

6

2

2

1

226416

 

+

Sabieka

Belarus

C

67

5

4

11

 

 

 

 

 

158091

QHU8Y

+

Kline 

Germany

C

37

 

 

 

4

4

1

2

2

141863

W5JHS

 

Pohl 

Germany

C

37

 

 

 

7

7

3

3

3

N109412

 

 

Howe 

Unknown

C

37

 

 

 

 

 

 

 

 

S10193

 

 

Engel

Germany

C

34

 

 

 

 

 

1

1

1

S10194

 

 

Kochtitizky

Hungary

C

34

 

 

 

 

 

3

3

1

S10195

 

 

Jordan

Germany

C

34

 

 

 

 

 

3

4

1

A10196451

 

 

Stavbon

Sweden

C

34

 

 

 

 

 

 

 

 

N26163

R38X2

 

Fritsch

Czech

C

12

 

 

 

 

 

 

 

0

N39377

 

 

Obendorf

Germany

C

12

 

 

 

 

 

 

 

0

N57225

XKCE3

 

Livingston

Germany

C

12

 

 

 

 

 

 

 

0

42790

RAQXW

 

Brenneman

 

C

37

 

 

 

6

12

3

6

2

B0003

 

 

Krak

Czech

C

25

 

 

 

 

 

1

3

3

N74113

 

 

Kiewat

 

C

25

 

 

 

 

 

2

2

2

A2983

 

 

Undisclosed

Austria

C

33

 

 

 

 

 

2

4

1

S10290

 

 

Layer

Germany

C

34

 

 

 

 

 

3

4

4

S10231

 

 

Karozewski

Hungary

C

34

 

 

 

 

 

3

5

4

 

A9FVE

 

Weiand

Germany

C

37

 

 

 

4

7

3

4

 

 

5XP46

 

Ludwig

Germany

C

37

 

 

 

5

5

2

2

0

 

4Q933

 

Kephart

PA

C

37

 

 

 

5

5

 

 

 

 

2N3UM

 

Oppitz

Germany

C

37

 

 

 

6

6

 

 

 

 

3K4Y2

 

Lintner

Germany

C

37

 

 

 

6

7

2

4

 

 

7R3RV

 

Mowers

Canada

C

25

 

 

 

 

 

2

3

 

 

K48RR

 

Mowers

Canada

C

25

 

 

 

 

 

2

3

 

 

CQUMV

 

Muise

France

C

25

 

 

 

 

 

3

4

 

 

K8VAB

 

d’Entremont

France

C

25

 

 

 

 

 

3

4

 

 

Ysearch

 

6 more

 

C

 

 

 

 

 

 

 

 

0

166692

8FTXT

+

Gebert

Germany

L

111

5

7

10

5

5

3

4

3

162917

 

+

Fredeen 

Sweden

L

111

5

8

18

9

12

4

6

4

 

WHFQB

 

Froetscher

Germany

 

67

6

5

14

9

11

2

4

 

N91348

 

---

Butman

England

 

111

8

8

11

4

7

2

2

2

E8272

 

L241

Abdurrah

Kosovo

 

67

9

7

15

 

 

 

 

 

97005

CBF87

 

Strejc

Unknown

 

67

9

8

16

 

 

 

 

 

N81087

 

 

Folia

Romania

 

67

10

10

18

 

 

 

 

 

256345

UW9R2

---

Hunt

Ireland

 

67

10

11

15

 

 

 

 

 

143479

 

 

Mastel

United K

 

67

10

9

17

 

 

 

 

 

199300

EJ4B6

---

McKrell

Unknown

 

67

10

9

18

 

 

 

 

 

44601

 

 

Ogles

Scotland

 

67

10

10

10

 

 

 

 

 

31263

 

 

Steubinger

Germany

 

67

10

12

19

 

 

 

 

 

 

YHTGH

 

Hooper

USA

 

67

10

9

 

 

 

 

 

 

 

2A786

 

Frye

Unknown

 

67

10

10

 

 

 

 

 

 

98212

 

L241

Baber

England

 

111

11

11

22

 

 

 

 

 

39430

BAW43

L241

Johnson 

United K

 

67

11

9

11

7

10

5

6

3

56091

 

L99

Hardy

United K

 

67

11

13

19

 

 

 

 

 

E-M35

Database

 

11 More

 

 

>66

11

7 to 11

 

 

 

 

 

 

E-M35

 

---

7 Samples

 

 

>66

12

7 - 10

 

 

 

 

 

 

E-M35

 

L241

3 Samples

 

 

>66

12

10 - 11

 

 

 

 

 

 

E-M35

Database

 

20 More

 

 

>66

12

8 - 14

 

 

 

 

 

 

 

 

 

V22 Modal

 

 

67

 

23

33

 

 

 

 

 

 

 

 

V13 Modal

 

 

67

13

11

16

 

 

 

 

 

 

 

 

V78 Modal

 

 

67

 

12

18

 

 

 

 

 

N13674

 

 

Straub

Germany

 

37

 

 

 

9

12

5

7

3

144860

 

 

Giles

Unknown

 

37

 

 

 

9

13

6

9

5

 

           Comment 23 Mar 2013:  this discussion needs update for my new definitions.

           Update 9 Mar 2013:  Two samples, E10751 and 174240, are not in the E-M35 Project;  these two were brought to my attention by Paul Svercl (in the table), who noticed them on the Heinila site.

           The following discussion was updated 21 Jan 2013:

           The 24 cluster C samples are at the top of the table, with “C”.

           There are only two confirmed L540 outside C type, with “L”.

           Those numbers are step, which is the same as what Ysearch calls genetic distance - total mutations compared to the modal haplotype.

           Those headers are modal haplotypes, for example C49(67), the definition of C type, uses 49 of the standard 67 markers, with cutoff 7, meaning all samples at step less than 7 are likely C type.

           Similarly, L45(67) cutoff 6 is the definition of L540 including samples outside C type.

           C type is a hypothetical subdivision of L540, samples predicted C type are of course predicted L540+.

           Of the 1204 samples at 67 markers in the E-M35 Project (on 18 Mar 2013) these are all the samples that match L45(67) at less than step 11.  None of the E-M35 samples have step 6 or 7.  Steps 6 and 7 are the gap.  This demonstrates how well isolated L type is in STR matching.  There are only 3 samples at steps 8 and 9, and 2 of them are confirmed negative for L540.

           My recommendations for L540 testing:  Any future samples in the gap should be tested for the L540 SNP.  Just below the gap, samples at steps 4 and 5 have roughly 85% and 80% probability of being L540 members (my estimate), so those should also be SNP tested for assurance.  Samples below step 3 are more than 90% likely to be L540+, so testing is useful but less necessary.

           I have low confidence in prediction above step 7 in that L45(67) column, because there is little data.  It is misleading to observe that there are no L540+ above step 5, because I have adjusted the definition for new SNP results.  My very rough guess for L540+ result probabilities:  40% for each new untested sample at step 7, 20% at step 8, 10% at step 9, 5% at step 10.

           That L99 sample is “private”, with only one known member, at step 11 in the table.  Many L540+ samples have tested L99-, but that sample has not been tested for L540, so there is a remote (perhaps <1%) probability that L99 is a branch of L540.

           There is no SNP test yet for C type.  C type has evidence of a subtype:  the gap at step 3 in C27(67).

           Prediction has lower confidence at 37 or fewer markers.  Those columns for fewer markers are my old step numbers using my older definition for C type.  I plan to update this table soon, using a new C type definition.

           I use the Ysearch method for calculating step, which gives a result slightly different than the Haplozone method.  There is also a one marker discrepancy mentioned in the Ysearch topic.

           The column 33(37) uses the 33 markers from that 61 marker modal that fall among the standard 37 markers.  Similarly for 23(25).  For comparison, there are columns using full standard marker sets.

           The column 12 is for samples with only 12 markers, where prediction is highly unreliable.

           Topics above explain my Excel analysis files for more details about the definitions L45(67), C49(67), and others.  Also more detail about L450 and C type.

           If you are not in the table, see the Ysearch topic below for a way to evaluate your step separation from C type and L540.

           If you are a neighbor and wish to be added to this table, please let me know.

 

Gwozdz

 

           My sample is kit N16800.  N81304 is my 3rd cousin Gwozdz.

 

Kargul

 

           Aloysius Kargul (Kargol) is my closest STR match available on the web.  Kit 199446.  In May 2010, his daughter noticed, on ancestry.com, that he and I are perfect matches at 12 markers.  I studied the LDS microfilms and located his 1820’s Kargul ancestor living in a village in Poland only 20 miles away from the village of my Gwozdz ancestor.  I paid for his FTDNA sample.  Kargul is in the table above.  His L540 test came out positive, placing him in that new haplogroup.  We are 5 steps apart at 67 markers;  9 at 111 (4 Jul 2011 update).

           For estimating the size of C type or L540, my cousin and Kargul should not be included, because I recruited them, paying for their tests.  Family sets such as these distort size estimates, compared to random data.  In other words, C type really has only 16 samples, not 18 (on 4 July 2011) if compared to other clusters (which should also be adjusted for family sets).

 

Butman

 

           New topic 13 May 2011.  Update 17 Jul 2011:  Butman’s L540 SNP test just came out negative;  that means he is not a member of the new L540 haplogroup.

           Raymond Butman, kit N91348, is right on the edge of the predicted C type using the old 61 marker definition.  This sample is a recent addition to the M35 database.  His step using the 61 marker definition is 8.  My cutoff for the definition before this sample showed up was 7 because of the gap - no samples from step 7 to 12 at that time.  When I wrote this topic after Butman’s sample showed up, I changed my cutoff to 9 and I pointed out here that this sample might land in V13C (L540), but it was a close call, not a confident prediction.

           Although this sample matches most of the markers of the definition, it misses at the two best signature markers, DYS389-2 and DYS594.

           I subsequently developed a new L540 definition that excludes Butman.

           What does this mean?  The simplest explanation:  Butman’s node in the Y-DNA tree (his male line common ancestor - branch joint) seems slightly older than the nodes of L540 members (in the database so far today).  His node is older than the L540 mutation, and also seems older than the DYS389-2 mutation, and also older than the DYS594 mutation.  Most of his other STR values match the L540 definition because his node is not much older, so there has not been much time for more mutations.

           This simplest explanation is a good statistical prediction, not a proof.  Other less likely explanations are possible.  For example Butman might be an outlier from a clade very distant from L540, where most men have different STR values, where he might have many matching STR values to L540 due to the luck of random mutations in his male line.  Yet another possible explanation:  Butman might belong to a very small clade with a much older node with L540, even older than the node for the men at steps 7 and 8, but the ancestor at that node might just happened to have STR values very close to the values for the L540 ancestor, due to the luck of random mutations, and now Butman is the only sample available from that very small clade.

 

Gebert

           Update 30 Sep 2011:

           I noticed Gebert sample on Ysearch and encouraged him to join the E-M35 project, which he did, kit 166692 in the table below.  I helped pay for the orders for the L540 test and for the 111 extension.  This data is important because his STR values place him near the predicted cutoff of C type.

 

Fredeen

           Update 30 Sep 2011:

           Kit 162917, Fredeen, came up L540+ in May 2011.  This was significant because this was the first L540+ sample outside C type, later joined by Gebert.

           Logically, Fredeen and Gebert might be outliers from cluster C, with back mutations in 389II and multiple other mutations just due to the luck of random mutations.  This is very unlikely

           These two samples most likely represent two old branches of L540, with older nodes than the C type node.  They do not match each other well at 67 markers.  Their closest matches at 67 markers are each other and C type samples.  There are no close STR matches to either in the 67 marker data.  Any future close STR matches would of course be predicted L540+ in the corresponding branch.

           Three markers are of interest here:  (385a,406,444) = 17,11,13) are the values for L540 and C type.  Most samples in the STR neighborhood just outside L540 have the ancestral values (16,12,12).  Fredeen is ancestral for two of these, (16,11,13) while Gebert is ancestral for one of them (17,12,13).  This is evidence that Fredeen’s node is older, but the evidence is statistically unconvincing.  We realize that these 3 markers may have experienced mutations after the two nodes of interest.  Both Fredeen and Gebert have 111 marker data, which does not help out for this question.

 

Hohnloser

 

           Hohnloser (kit N39989) fell into C type at 37 markers in 2010 (marginally), but not at 67 markers (not particularly close).  He is not a member of cluster C because his sample does not match the Friedman signature (originally at 25 markers).  He provides an interesting example of how statistics works - in this case, due to the luck of random STR mutations, the sample is close to C type only at 37 markers.

           Hohnloser does not belong to the L540 haplogroup because his SNP test came out negative.

           Hohnloser has extensive family tree research results.  He administers a Hohnloser project at FTDNA.  He exchanged helpful email discussions with me.

 

Structure of the L540 Haplogroup

           Complete rewrite 16 Jul 2011.

           C type is a hypothetical haplogroup within L540.  The evidence is presented throughout this web document, particularly in topics about my 67 marker analysis files and about DYS389II.  My estimate is about 99% confidence that my C type definition corresponds to a clade that will be proven to be a haplogroup by a newly discovered SNP - someday when tests for new SNPs are more comprehensive and lower cost - continuation of the current trend.  My estimate is that 98% (80% confidence range more than 90%) of the samples predicted C type by my definition will end up in such a haplogroup.

           This confidence is based on a combination of statistical calculation, plus judgment where calculation is not possible;  for more discussion see my confidence topic.

           C type seems to be about 80% of L540, based on only 2 L540+ not C type, compared to 7 independent C type samples, for 77.8%, in the 67 marker data.  Confidence in this 80% is not high because it is based on only two samples.  There may be more samples in the STR neighborhood that will test out L540+ in the future, with STR values quite different than my current L540 definition.  Also, there may be L540- samples in the future that match my current L540 definition.

           In other words, I have high confidence that C type predictions are L540+, but not high confidence in the L540 predictions outside C type.

           I expect to update my L540 definition as data accumulates, thereby improving the confidence for L540 STR predictions outside C.

           For now, I recommend the L540 test to everyone in the neighborhood table, with particular emphasis on those with closest step to my current L540 definition, and with the caveat that there may be more L540 beyond that table.

           Cluster P probably corresponds to a small Polish clade within C type.  It may take some time to find an SNP to define such a small haplogroup.

           C type is close to the same as a clade defined by a particular mutation(s) at the DYS389II marker, from value 30 to 32.  We can even speculate that clade is identical to C type data.

           A particular mutation at the DYS594 marker, from 12 to 11, seems to define a “father” clade that is slightly older than L540.  The same might be true for a particular mutation at the DYS636 marker, although more 111 marker data is needed to estimate if 636 is younger or older than 594 and / or L540.  With more data, these two markers will provide a foundation from which the age of other mutations can be estimated, because independent mutations in other markers are unlikely to also have mutations in these two foundation markers.

           The L540 data is bimodal in a number of markers.  Each of these bimodal markers is evidence that there is a significant subclade division than might be determined for L540 structure.  However, these various suggestions point to different divisions;  so far no two of them are strongly correlated.  With more data, it may be possible to split L540 based on statistical correlation of STR markers, using my mountain method.

 

Call for 111 Marker Data

           Update 30 Sep 2011:

           FTDNA provides a 67 marker standard set of STR markers.  I have been using this 67 set for analysis.

           A standard set of 111 markers is now available.  For existing samples with 67 or fewer, an extension to 111 can be purchased from FTDNA.

           I am hopeful that 111 marker data will enable me to construct a high confidence family tree for the L540 haplogroup, as more data accumulates.  Indeed, already the marker DYS636, one of the 111 extension markers, already has provided an additional signature marker for L540.

           Additional markers make it more likely to subdivide haplogroups and types with confidence.  Indeed, already marker DYS445, one of the 111 extension markers, already has provided an additional signature marker for the P cluster division of C type.

           If you are in my neighborhood table above, please consider ordering the FTDNA panels of additional markers.  Some of us have already ordered, as indicated in the table.

           It is helpful for me to include the neighborhood just beyond L540 in this request, for better determination of ancestral STR values.  In addition, the marker CDYb may provide a definition for a “father” clade of L540 if additional markers correlate with CDYb.

           The current 67 marker data includes a few L540+ samples where I paid for the extra data, but it also includes one sample that was in the neighborhood table but has been removed because I paid for the extra data, which showed it is not an L540 neighbor.  So my help is causing a bias in that table above, but not a significant bias yet in my Excel analysis files.

 

Ysearch

 

           Update 17 Jul 2011:

           479H7 is a direct link to the my modal haplotype for L540.

           QAZ7P is a direct link to the my modal haplotype for C type.

           If you are not listed in the table above you can compare your data on Ysearch.  You can compare your step genetic distance to these modal haplotypes if you have the standard 12, 25, 37, or 67 markers.  The comparison may not work if you have a non standard marker set.  The cutoff for each marker set is given in the legend in the table above.

           Brief description of Ysearch.  Link to the site home:  http://www.ysearch.org.

           To join Ysearch, click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services.  Or, you can type in your data.  You end up with a “User ID”.

 

           Instructions for comparison to V13C at Ysearch:

           Click here:  Research Tools (or click on the tab with that name)

           Copy the following line into the “UserIDs” bar at the Research Tools page:

                                  USEID, 479H7, QAZ7P

           Change USEID to your User ID.

           You need to type the Captcha puzzle for access.

           Click on ‘Show genetic distance report” to see your step genetic distance from C type and from L540.

 

Ancestry.com

 

           www.Ancestry.com  is the web page for a commercial DNA testing company.  Men with Y-DNA test results can choose to make results available for matching to others.  Kargul originally matched with me at this site.

           I last checked for matches 16 May 2011.  There are 9 close matches of Y-DNA to Kargul & me, but these are not close enough to include in my Neighborhood Table.

 

Age of C Type

Age of L540

 

           Comment 25 Mar 2013:  this topic needs update modification using the latest data.  The new on-line Excel file versions do not yet have the ASD sheets.  Coming soon.

           Topic update 11 Jul 2011:

           The discussion in this topic is based on the sheet “ASD” in the two 67 marker analysis files.

           Average Squared Distance (ASD) is equivalent to variance of STR values.  Most people use ASD to calculate age in genetic genealogy, as I explain in my publications.  The ASD method has large known systematic uncertainties, discussed in my publications, which make age calculation uncertain.  It is not possible to calculate a confidence range because the systematic errors might be larger than the statistical errors, even for small samples of data.

           C type is quite young.  The age using all 67 markers comes out 805 years, cell N12 on the “ASD” sheet in C67xls.

           Although I do not have high confidence in estimating the exact age of C type, there is additional evidence that C type and L540 are young.  The fact that C37, using the first 37 markers, provides a reasonable definition of C type is evidence of youth.  Old haplogroups do not provide reasonable modal definitions using all 37 markers, because of the wide variation in the rapidly mutating markers.  Another way of saying this:  using all 37 markers, there is a lot of overlap of old haplogroups.  Another way of saying it:  Isolation in STR values is evidence of youth.  My publication elaborates on this.

           The fact that C type samples can be extracted from V13 using only a 4 marker signature is also evidence of isolation and youth.

           Low SBP is evidence that C type and L540 are well isolated;  see the analysis for SBP.

           I expect the age to creep up somewhat as new data is discovered.  The correct well known statistical way to correct for this expectation:  divide by N-1 instead of N when figuring ASD.  My files do not use N-1 because that is not the common practice in genetic genealogy.  Using N-1 the age is 939 years instead of 805.  This estimate includes a best guess for future samples;  eventually samples will show up that fit C type but have more mutations than the samples so far - either because of true older nodes with the Y-DNA tree or just due to bad luck in random STR mutations.

           Conclusion:  My best guess for the age of C type is about 1,000 years. 

           Of course, this result has very low confidence, because it is based on only 7 samples.  The true age might be up to a factor of 4 older.  Or it might be a lot younger.  Although it is not possible to calculate a confidence range that includes systematic errors in age estimation, I suppose a reasonable high confidence range estimate would be 500 to 2000 years old.

           Discussion of C age estimation.  My calculation excludes two samples, Gwozdz and Kargul, whom I recruited, because a random selection is required.  Including them yields 806 years, corrected to 907, not significantly different.  All 67 markers are valid for use in a young clade like this, because there are no recLOH issues in the C data.  Issues in recLOH usually cause problems using all 67.  Those analysis files include comparison to a classic “Thomas” method, which I explain in my publication, and which yields 292 (vs 805) years, but I consider that just a demonstration that the old method uses too few markers, which by luck do not vary much in the C data.  My files allow the user to easily vary the markers and easily vary the samples.  Boyd seems to be the oldest sample;  removing him lowers 805 to 712, corresponding to the age without his hypothetically older clade.  The “oldest” two markers (far right of the ASD sheet has a sort by age) are DYS460 and DYS385a, with 3975 and 3058 years respectively;  this is evidence that those markers might correspond to clades within C type, but the evidence is not compelling yet because some markers are bound to come out old just by luck.  My definition of C type of course excludes such old markers, which is a proper procedure for a good definition.  But it is not fair to exclude old markers in age estimates because the zeros balance the old ones statistically;  C type has 45 markers with zero age - the lucky ones with no mutations.  Many zeros is evidence of youth but also evidence of few samples - only 7.  With more samples there will be fewer zeros.  Most published ASD age calculations include a correction making the result older, but the reason for that correction is population bottlenecks, which reduce ASD;  since C type is large for a very young clade, I’m guessing the age corresponds to a rapid population expansion after which there were no significant bottlenecks.  If I am right, no such correction would be appropriate.  Also, most published ASD ages use N instead of N-1, so a published correction method should be applied to my first number, 805.

           An age calculation for L540 is not appropriate, although L67.xls does it, getting 971 years in cell N12 of the ASD sheet.  That’s not fair.  C type dominates the data, but C type is really only one man, the MRCA.  Our data for L540 is really only 3 men:  the hypothetical C type MRCA, Fredeen, and Gebert.  It is well known that estimating the TMRCA of two men is highly uncertain, just due to the luck of random mutations.  For 3 men it is not much better.  The L540 age (TMRCA) is surely older than C type, just a bit older based on the meager data here of 2 men with STR values different than C type - but not very different at 67 markers than the C type samples with highest step.

           Age of a mutation is of course older than the TMRCA because there should almost always be multiple generations between nodes.  We know the L540 mutation happened after the node for Butman and before the nodes for Fredeen and Gebert.  That assessment will get better with more data.

           I said in the Abstract here that L540 might be twice as old as C type.  That is just a guess.

 

Origin of L540 and C type

           Update 17 Jul 2011:

           The neighborhood table shows that 9 of the 18 cluster C men indicated “Germany” in the “Origin” field of their data.  One of the two L540 men outside C indicate “Germany”.  That is very good (although not convincing) evidence that the ancestors (MRCA) of C type and L540 lived in what is now Germany.  A caveat:  men of German origin are more likely to purchase a DNA test and submit data to web databases.  It is obvious from data searches, in Ysearch for example, that men of east European ancestry are under represented.  This sample bias is difficult to measure, but I doubt the bias is sufficient to rule out a German origin as our best guess.

           The parent V13 haplogroup is concentrated in the Balkans, according to density maps on the web.  I’m guessing that our L540 MRCA lived in Central Europe, but I do not know that.  This cannot be checked with data available today, because the published Balkan Y-DNA data has too few markers to distinguish L540 or C type.  I look forward to the near future when data with more STR markers become available from the Balkans to verify my guess.  L540 SNP data would help, where I expect almost all to be negative from the Balkans.  On the other hand, if L540 is common in the Balkans, that would imply a probable MRCA origin in the Balkans, with subsequent expansion into Central Europe.

           Bird published evidence for a hypothesis that E1b1b1a1b (V13) appeared in England, concentrated at the two locations of ancient Roman garrisons, because of men from Moesia Superior who joined the Roman Legions when the Romans conquered the Balkans.

           Speculation:

           Model L540A:  V13 in Europe springs largely from Roman Legionnaires from the Balkans.  Due to the statistics of Y-DNA, most men do not form lasting clades, but many Balkan Legionnaires were lucky enough in their male line descendants so that many small V13 clades in Europe today correspond to individual Roman Legionnaires.  Because enlistment in Balkan armies, and subsequent enlistment in the Roman army, is largely random from the point of view of Y-DNA, these clades are a random selection from a much larger population, so the Legionnaire founders had very variable STR values.  The clades today have STRs quite different from each other.  Most clades are small enough that no samples, or only 1 or 2 samples, are present from each in the databases today.  The two L540 samples from outside C type, and the one sample just outside L540, and others not yet tested for the L540 SNP, represent such small clades.  C type is an exception, with 18 samples available today.  The MRCA of C type was a descendant of one of these Legionnaires, but that MRCA lived about 1,000 years later, in what is now Germany.

           Model L540A1:  C type is larger just by luck.  Statistically, some clades are necessarily larger than others.  It is not very surprising that one clade is unusually large.  We do not notice small clades, so of course we are now studying C type because it stands out.

           Model L540A2:  C type is larger because a descendant who lived about 1,000 years ago was a king or otherwise very prominent man, so his family grew much faster than others.  His ancestors were not prominent, so C type is isolated, but not more isolated than those other small clades from Roman times.

           Model L540A3:  C type is larger because of a local population expansion during Medieval times.  The C type MRCA was one of many who participated in these good times, but the others were from other haplogroups.  For example, the population expansion might have been associated with a Germanic R1b tribe, where the C type MRCA was an outsider who joined the tribe before the expansion, along with outsiders from other haplogroups.

           Model L540A1:  C type is larger for another population expansion reason.

           Model L540Aa:  The relatives of the Roman Legionnaires stayed in the Balkans.  In the near future L540+ samples will show up from the Balkans.  C type will show up, and it will not be particularly isolated in STR values, because the ancestors will not be the same at the C type MRCA.

           Model L540Ab:  There was a population bottleneck (or 2 or more bottlenecks) in the Balkans during the past 2,000 years.  War, famine, whatever.  There was a later population expansion in other haplogroups, so very few if any L540+ samples will show up in the Balkans.

           Model L540Ac:  That parent population did not get entirely wiped out.  It survives, in a remote area.  Maybe a group of villages in the Balkan mountains.  The population has not grown much over the centuries.  If we go there and test for Y-DNA we’ll find lots of L540 men, some of them C type.

           Model L540B:  Not Roman.  A tribe of barbarians showed up in Germany about 1,000 years ago.  Prior to that, they passed through another region, not necessarily the Balkans, where a lone L540 individual joined them.  This is similar to ModelL540A3.  There are Ba, Bb, Bc variations similar to Aa, Ab, Ac.

           Model L540C:  He didn’t join a Roman army.  He was a medieval trader.  A very charming traveling salesman.  He fathered children all over central Europe, mostly in what is now Germany.  This model has similar variations to models A and B.

           Model L540D:  I can think of other speculative scenarios.  I’m sure you can, too.

           The point of these examples:  we don’t know the history, but C type is unusual in that it is young, small, and well isolated in STR values (a small mountain in haplospace).  It is not closely related to the rest of L540 or V13.  The migration history of the very large V13 parent haplogroup may or may not be relevant to the history of the relatively small C type hypothetical haplogroup.

 

Validity of C Type

 

           Update 10 Jul 2010.  Quite frankly, I was surprised by cluster C.  Friedman did a good job finding this one.  I admit I dismissed it when I first saw cluster C in 2007 because it was so small that statistical significance did not seem possible to me.  I postponed analysis until Jan 2010, independently verifying cluster C as C type.

           By “valid” I mean a cluster whereby most of the samples belong to a single clade, and whereby very few other samples in the database belong to that clade.  In other words, a valid cluster should eventually have a corresponding SNP discovered.  Throughout 2010 I confidently predicted such an SNP here in this topic, although I doubted it would be discovered soon.  L540 turned out to be almost the same as C type, although slightly larger and quite a bit older, as discussed elsewhere in this web page.  As samples predicted C type test L540+, this adds evidence that C type corresponds to a clade.

 

My WTY Analysis

 

           Update 23 Feb 2012:  Fifteen new SNPs were discovered in my “Walk Through the Y” (WTY).  L535 through L547, L614, and L618.  All 15 are available as commercial SNP tests from  FTDNA.

           My WTY test read about 200,000 base pairs in Feb 2011.  In Feb 2012 the test has expanded to twice that many.  For details, here is a link for this "WTY" commercial product from FTDNA.

           I announced 8 new SNPs here on 29 Mar 2011.  The count on 30 Mar was 13 new SNPs in my WTY.  L614 was added in June.  L618 was added in August.  That was a lot more than I expected.  I now realize that’s because FTDNA expanded the number of DNA bases included in WTY just before my test.  Also, I seem to have been the first WTY from E-M78 in quite some time.  Since then, a few others from M78 and V13 have tested, so there are quite a few more new SNPs of interest recently discovered.

           I tracked the status on these 15 SNPs right here on this web page for a year, in detail.  Recently the positions in the Y-DNA tree have been determined for most of these SNPs of interest to me.  I recently removed most of the detail from this page.  I’m leaving the Summary, below, for a while because other people have links to that Summary.  I’ll remove most of this, including the Summary, later in 2012.

           In late 2011 the SNP Tracker was set up, as part of the E-M35 Project, to track all new SNPs of interest.  That’s another reason for me to drop my details here.  That SNP Tracker merges data from WTY, from the 1000 Genomes, and from SNP tests by members of the E-M35 Project.

 

SNP Summary

 

           Update 18 Jan 2013.  For a detailed SNP tree of the E-M35 haplogroup, see the SNP Tracker.  This topic used to have a summary of the SNPs found in my WTY, but the SNP Tracker is now a better place to find an update.  Only L540 defined a new haplogroup.  L542 is equivalent to V13.  The others are all equivalent to known haplogroup SNPs older than V13.

 

SNP Test Orders

 

           SNP tests cost $29 each from FTDNA if your sample is already there from previous testing.  Click on “Order an Upgrade” from your FTDNA home page (top right), then click “Order an Advanced Test” (do not click on “Order Advanced SNP Test”).  In the box “Test Type” select “SNP”.  Type the SNP code (for example L540) into the “Find” box to search for it.

 

References & Sources

 

           E-M35, a project at FTDNA, is my main source of data.  Previously called E3b.  Link:  http://www.familytreedna.com/public/E3b.  The official name today would be E1b1b1.  ISOGG changes the name when new defining SNPs are discovered, so the name may change again in the future.  M35.1 is the name of the SNP that defines E1b1b1 within haplogroup E.

           Haplozone is a web site for analysis of data from the E-M35 project.  Link:  http://www.haplozone.net/e3b/project.  Data from E-M35, plus some data added from sources other than FTDNA, so this database is larger than the E-M35.  Page with a listing of proposed clusters:  http://www.haplozone.net/e3b/project/cluster/.  Discussion Forum:  http://www.haplozone.net.

           SNP Tracker is a web page added to the E-M35 project in late 2011, to keep track of all the new SNP branches in M35.  http://tinyurl.com/e-m35-snps

           The V13 data:  http://www.haplozone.net/e3b/project/cluster/10.  V13 is the defining SNP for E1b1b1a1b, the largest haplogroup in E.  This page of data does not have the data for samples that have been assigned to clusters as subdivisions of V13, just the data that does not fit any downstream proposed cluster.  The number code for other clusters can be typed over that “10” to quickly get to other cluster data.

           Cluster C Data:  http://www.haplozone.net/e3b/project/cluster/42.

           Victor Villarreal is an administrator for the E-M35 (E3b) Project.

           Andrew Lancaster is an administrator for the E-M35 (E3b) Project.  Andrew has been particularly patient with me with long helpful email discussions.

           Elise Friedman a co-administrator for the E-M35 (E3b) Project and is administrator for the Jewish E3b project.

           Marko Heinila provides Y-DNA trees for E-M35 samples, including samples not in the E-M35 Project.  Links:

https://dl.dropbox.com/u/50201824/old/jstrees/e67tree.html

https://dl.dropbox.com/u/50201824/old/jstrees/etree.html

           Peter Gwozdz.  That’s me.  pete2g2@comcast.net.

 

Revision History

2010 Jan 14 original draft version

2010 13 updates

2011 Feb - Jun, 12 updates

2011 Jul - 10 updates

2011 Aug 6 minor update of SNP Summary, 2 more L540 results, other minor edits

2011 Aug 24 major update of SNP Summary;  M78 WTY results

2011 Sep 30 update of 111 markers, P cluster, Fredeen

2011 Oct 3 SNP Summary update

2011 Nov 19 L618 paragroup announcement

2011 Nov 20 slight edit of L618 news

2012 Jan 1 update of my WTY and status of the SNPs - not finished

2012 Jan 3 update of L542 status

2012 Jan 4 update of the five M78 SNP candidates

2012 Jan 5 update of the seven M34 SNP candidates;  finished SNP update

2012 Feb 23 update of SNP Summary;  remove most of the details of SNP tracking

2012 Mar 16 update Neighborhood Table

2013 Jan 21 update Neighborhood Table, drop SNP Summary details

2013 Mar 9 add Svercl cousin to Table

2013 Mar 21 new L45(67) definition;  more update of the Neighborhood Table;  update not finished

2013 Mar 23 new C46(67) defintion;  more update of the Neighborhood Table;  update not finished

2013 Mar 24 continue update of table;  rewrite of first 6 topics

2013 Mar 25 continue update of table;  edit several topics