A New Y-DNA Haplogroup:
E1b1b1a1b8 (L540)
25 Mar 2013
Peter Gwozdz
pete2g2@comcast.net
News
24 Mar 2013 update of the L540 Neighborhood table using new definitions
for L540 and C type
17 July 2011 complete rewrite of this
web page
10 Jul 2011 ISOGG added E1b1b1a1b8 (L540) to the official
Y-DNA tree
Rewrite 25 Mar 2013:
This web document is a summary of my
information on a small haplogroup
of Y-DNA based on an SNP mutation named L540. The subject is
genetic genealogy.
There is a Neighborhood
table below with a list of samples (men)
predicted to belong to the L540 haplogroup, and also samples predicted to be in
the Neighborhood just outside L540. The
samples near the cutoff (borderline fit)
are the ones that should be tested for
the L540 SNP to see if they belong to the L540 haplogroup; probability of belonging decreases with the step number, as explained in the discussion
below that table.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you might
prefer to first read an Introduction
that I wrote for another of my web documents.
L540 was discovered in my Walk
Through the Y (My WTY).
I purchased WTY, a commercial product that reads more than 200,000 base
pairs of the Y chromosome.
The clade
that we now call L540 was originally called cluster C,
a hypothetical cluster proposed as a
division of haplogroup E1b1b1a1b, which is defined by the SNP named V13. Hence I coined the name V13C.html for this
document about cluster C in early 2010.
I rewrote this document and renamed it L540.html on 30 Apr 2011. E1b1b1a1b (V13) is the largest haplogroup
division of haplogroup E, but cluster C is small. There are 50 samples predicted C type in that Neighborhood table.
I am not planning a separate L540
project, because it is more convenient to run this informally through the E-M35 project. Villarreal and Lancaster are
very helpful.
My equivalent name for V13C was “C type”, or just “C”.
I independently verified C type on 9 Jan
2010 as a good candidate type. I use the word “type” to mean an STR cluster
with statistical validity as established by my Mountain Method. I published my methods in the Fall 2009 issue of JoGG.
C type is quite young,
maybe only about 1,000 years old. The
L540 mutation is older, maybe more than twice the age of C. Such age estimates are uncertain,
particularly with so little data in this case.
L540 seems to be roughly 90% C type
plus 10% other, older branches, as explained in the next topic about L540.
Watch this document. I’ll add more information as data
accumulates.
Rewrite 25 Mar 2013:
L540 is the code name for an SNP that was discovered in my WTY,
announced 29 March 2011. On 27 Apr 2011
it was demonstrated that L540 defines a new haplogroup within E1b1b1a1b (V13).
I use the code name L540 for the SNP,
for the associated haplogroup, and for the samples
(men) in that haplogroup.
The L540 haplogroup includes C type, and C type is most of L540. The C that have been tested are all L540+.
The table
below has my predictions for L540 and for C type.
My sample has been tested negative
for all the 7 confirmed branches of V13, so L540 is an 8th branch of V13, not a
branch of one of those prior 7 branches.
Three of the 7 branches are too small
- few or no samples available on-line for testing: M35.2, V27, and P65.
Samples from the other 4 main
branches of V13 are available; I
recruited and paid for testing as needed:
L143, L250, L17, and L241. All
are L540-. That means none of these are
branches of L540. There are L241
samples in the Neighborhood table; this is evidence (not proof) that L540 and
L241 might be brothers, with a common undiscovered SNP branch of V13.
Anyone in the Neighborhood would benefit from ordering the L540 test.
In the discussion following the table, I give my recommendations for which samples would
most benefit from the L540 test.
ISOGG
names change as new SNP divisions are discovered. L540 was officially added 10 Jul 2011, code name E1b1b1a1b8. Code names change as new branches are
discovered. For example in early 2011
V13, the father of L540, was changed to E1b1b1a1b from E1b1b1a2. The official FTDNA tree does not have L540 yet (Mar
2013), but the unofficial
FTDNA tree does have L540 listed.
Update 24 Mar 2013: More results added to the Neighborhood.
Total 50 Cluster C. Two L540+
that are not C type. Total of 12 L540+
test results, 23 L540-. Most negative
results are not in the table, not close in STR step
value. Many V13 samples in WTY have L540 results during the past two years, mine is
positive (included in the count of 12) and the others are all negative (not
included in the count of 23, and not in the table).
Rewrite 25 Mar 2013:
Clusters
are based on STR correlation. There are 50 samples predicted C type in the
Neighborhood table, labeled “C”. Some of these are marginal, with fewer than
67 markers, with only 80% confidence of belonging. 26 of those are are listed at the haplozone
site, in the V13 + L540 branch.
Cluster C includes me and my 3rd cousin (Gwozdz).
Friedman
proposed cluster C more than 4 years ago, based on STR and SNP
correlations, when the data was less than what is available today.
New samples appear when Friedman updates cluster C, or when I
update the Neighborhood.
Rewrite 24 Mar 2013:
I use the word type for an STR
cluster with statistical validity as
established by my Mountain Method. “Type” is my own term. I chose the word “type” because it is not
generally used in genetic genealogy and I wish to distinguish my types from
haplogroups and from other clusters. By “type” I mean the cluster data, the
hypothetical clade, the modal haplotype,
and the set of all possible haplotypes, at any number of markers. Accordingly, by “C type” I mean any or all
of these 4 things. I sometimes use just
“C” as short for “C type”. I also have
a previous C type identified in R1a; unrelated;
please don’t get confused.
My analysis files define C type. Sorry, it can be a bit confusing because I
have multiple STR definitions for C type, for various marker
sets. The number of markers in my
definitions change slightly when new samples show up with unusual STR
values. I hope the meanings are clear
from the context of my discussions in this web document.
I also provide STR definitions for
L540, discussed below, treating L540 as L type.
C type is roughly 90% of L540. My evidence: Considering only the independent samples with 67 or more STR
markers in the Neighborhood: There are 17 type L samples and only 2 do
not fit C type but have tested positive (L540+) for the L540
SNP test: I say “independent”
because some C type have been recruited based on known genetic relationship to
C type men (my Gwozdz cousin, Kargol, Svercel); these should not be counted in this estimate. I do not include samples with fewer than 67
markers in this estimate, because type prediction has lower confidence with
fewer markers, so I have been actively incouraging the L540 test primarily for
samples with 67 markers. Notice in the
table that most borderline samples (near the L type cutoff)
have L540 test results. I say “roughly”
for this estimate because I have been recruiting samples for this table at 67
markers with equal emphasis on borderline samples, in order to properly sample
the STR borderline, but it is difficult to prove no bias toward samples that
fit well; also there is a chance of outliers showing up in the future; either L540- among those samples that fit C
type very well; or L540+ among beyond
the borderline. Also, with only 2 L540+
outside C type, the sampling confidence is not good.
The evidence today indicates that
L540 is much older than C type - details below.
Rewrite 24 Mar 2013:
I coined the name V13C in 2010 to
represent C type, cluster C, the hypothetical haplogroup, and the samples (men)
in the hypothetical haplogroup. This
web document used to be named V13C.html.
Now that C type is a subdivision of L540 I am editing away the name “V13C”, but I’ll continue to
use “C type” for the hypothetical clade that is part of (most of) L540.
V13 is the defining SNP for
E1b1b1a1b, so I similarly use “V13” to mean the “father” haplogroup - the large
branch in the Y-DNA tree from which L540 is a small twig. I also use “V13” to mean the associated database of V13 samples at E-M35
or at Haplozone, or at other databases.
www.gwozdz.org/L67.xls is analysis of L540 STRs.
www.gwozdz.org/C67.xls is analysis of C type STRs.
L45(67) means a modal haplotype for the L540 haplogroup using 45 of the 67 standard markers. My definition
for L540 is L45(67), all samples less than the cutoff (genetic distance, or step) 6.
L45(67) is new on 21 Mar . I
also typed L45 into Ysearch as 479H7.
Similarly, C46(67) with cutoff 5 is
my definition of C type, introduced 23 Mar 2013, Ysearch QAZ7P.
My previous definitions are available
in those files, sheet “Haplotypes & Masks”.
Comment: This topic is currently being rewritten; coming soon. The remainder of this topic was written 10 Jul 2011, so it need
modification.
Both those files have a sheet named
“Comments” with technical discussion.
Those files have “ASD” sheets,
discussed below in the Age topic.
The C49 definition is slightly better
than the older C61 because of the new data that accumulated over a year. Any reasonable C type definition using 30 to
55 markers does a great job extracting C type, because C type is very well
isolated in STR values - a mountain in
haplospace.
The reason for a separate L540 definition: Three new samples showed up, closest to C type but not C type; Butman, Gebert, and Fredeen. They are in the table
below. Butman tested L540- and the
other two are L540+, members of the L540 haplogroup. An optimized C type definition does not distinguish Butman from
the other two. The L540 definition
does.
The simplest explanation: Butman has
a male line with node in the Y-DNA tree slightly
older than the L540 mutation. Gebert
and Fredeen each have male lines with nodes in the Y-DNA tree younger than the
L540 mutation. The node for C type is
much younger. More complex explanations
are possible; for more discussion see
the Structure topic.
L540 Summary. L540 is
isolated: All samples predicted L540
have step less than the cutoff 5. The gap is
steps 5 and 6 with no samples.
SBP
= 12.4% for the definition L27(67)
with cutoff 5. However, two samples
should be excluded from SBP calculation because I recruited them - Kargol and my 3rd cousin Gwozdz; recruitment artificially increases the size
of clade (as % of database), lowering SBP.
Adjusting for this SBP = 15.9%.
This is good. See the Excel
file for details. In other words, I
have confidence that less than 16% of samples predicted as L540 by this
definition will not come out positive for the L540 SNP test, most of them at
step 4.
Two of the 5 samples at step 7 have
been tested L540-; I suppose the other
3 should be tested, since I do not have high confidence they will all come out
negative.
The table
below has the samples for which I recommend ordering
the L540 SNP test. The discussion
below the table provides my estimate of probability for L540+ vs step.
C type Summary. C type is
very well isolated: All C type samples
have step less than cutoff 3. There is
a gap with no samples from step 3 to
6. Steps 7 and 8 have only one sample
each. Step 9 has 2. 10 has 3.
11 has 12. Step 12 has 26
samples.
SBP
= 7.9% for the definition C49(67)
with cutoff 3. This is very good. See the Excel file for details. Adjusting for those two recruited samples
SBP = 10.9%. In other words, I have
confidence that less than 10% of sample identified as C type by this definition
will not belong to the C type haplogroup if and when an SNP marker is
discovered for C type in the future.
Again, most of those outsiders will be at step 2.
See the validity
topic below.
History. My quality factor is Statistical Background
Percent (SBP). My SBP is very sensitive to the confidence interval due to small sample
sizes; a valid
cluster usually starts out with high SBP when few samples are available and
then gets better, with lower SBP, as more samples accumulate over time for
better statistics. I introduced this
web page (as V13C.html) on 14 Jan 2010 using 37 markers with SBP = 37%, which
was not very good. By April 2010, C
type using C(61)67 markers came out with SBP = 11.4%, which is very good. 7 Feb 2011 with more data, C61(67) improved
to, SBP = 7.3%, excellent. That C61
definition now has SBP = 7.0% (before correction) showing steady improvement as
data accumulated. However, that old C61
definition does not discriminate the new samples well; see column BZ in that C67.xls file. The new C49 definition has higher SBP but
discriminates better. It is likely as
more data accumulates an even better definition will become available, with a
slightly different number of markers.
17 July 2011 comment: This and the following 2 topics are based on
my Feb 2011 analysis. I update less
often at fewer than 37 markers. These 3
topics are C type only; data is not
good enough yet for L540 analysis below 67 markers.
I also have a file using only 37
markers for analysis: www.gwozdz.org/C37.xls. 189 samples.
The SBP using all 37 markers is
28%. That 37 marker column, cutoff 9,
captures the 10 samples assigned to cluster C and none
others. However, the other columns,
using the best markers, consistently capture only 8 of those samples, casting
some doubt on the 2 marginal ones. It
seems 67 markers are statistically required for marginal cases.
The C67.xls file does a
correlation; the data is copied here below: The
samples with all 67 markers are evaluated using only the first 37.
I also have a file using only 25
markers for analysis: www.gwozdz.org/C25.xls. 228 samples.
The standard 25 marker STR set is
used by a number of Y-DNA testing companies.
The Haplozone data include Sorenson data that is not in the E3b data, and 3 of these land in cluster C. Those samples have kit numbers starting with
“S” in the Neighborhood table below.
At 25 markers there is no valid C
type. SBP comes out greater than 100%
for any combination of markers. The SBP
formula gives a result regardless, but the result is meaningless far above 50%.
The modal haplotype C25 (all 25
markers) captures all 13 cluster C samples plus 7 others, cutoff 5.
C18(25) is the best definition I
found from the 25 marker set, but even that definition is not satisfactory,
because it does not correlate with the results at 37 and 67 markers; see the Excel analysis file. Just about any definition using 25 markers
captures the samples that fit well at 67 markers, but at 25 markers different
definitions capture different marginal samples.
I also have a file using only 12
markers for analysis: www.gwozdz.org/C12.xls. I used the full database
but truncated that analysis to the closest 33 samples to keep the file
small. The 3 Sorenson samples in the table below are not in this database.
Best STR Markers
New topic 14 July 2011:
STR
markers that mutate relatively slowly are statistical indicators for clades in
which they are recently mutated, but they are not perfect because of subsequent
independent mutations. When a clade has
a few such good STR markers those provide a signature set of STR markers. A signature is statistically expected to be
a more probable indicator of a clade than just one marker. Indeed cluster C is
characterized by the Friedman Signature. The definitions
of C type and L540 use other helpful
markers, not just the signature.
For example DYS389II is the best STR
indicator for cluster C and C type because all but three
of the cluster C and C type samples identified so far at the Haplozone site have the 32 value and very few other samples
in the STR neighborhood have the 32. The ancestral 30 value is most common in the
neighborhood. Those three exceptions
have a 31 value, which is not common in the neighborhood. We expect that subsequent mutations from 32
to 31 to 30 must occur rarely within C type, so eventually C type samples with
30 should show up, as data accumulates with time.
My analysis files
automatically rank markers using a method that I published. The exact ranking of markers varies slightly
from month to month due to the random nature of mutation values in new samples,
and due to the somewhat arbitrary cutoff that I use to restrict the database to
the neighborhood (using too many samples provides a ranking of the father clade
instead of the clade of interest). For
example a sample that ranks 6th one month might come out 4th or 7th the next
month. For example 389II always comes
out 1st for C type but ranks 5th (11 Jul 2011 analysis) for
the father haplogroup L540.
An SNP
that defines a haplogroup is very
unlikely to have happened exactly at the time of the most recent common
ancestor (TMRCA) of a haplogroup. Most likely the SNP is somewhat older,
because usually there are many generations between nodes.
By definition an SNP cannot be younger than the haplogroup. Similarly, we can consider a clade defined
by a particular STR mutation, which is likely somewhat older than the TMRCA of
that clade. However, for clusters
defined by signatures, and for types defined by definitions, one rare STR
mutation that contributes to the signature might have happened shortly before
or after the TMRCA of that cluster or type.
Very slow mutators should make the
best markers. However the slowest are
rarely mutated, so those with intermediate rank show up more often as signature
markers. My “Haplotypes & Mask”
sheet in my analysis files has the mutation rate rank (slowest is 1st) for the
67 standard markers. My publication has the Chandler reference.
Usually it is silly to speculate
about clusters defined by a single STR value.
In this case, however, we have a hypothetical haplogroup, C type, which
seems quite young, with relatively little STR
variation, so some speculation is in order:
DYS389II = 32 (389II minus 389I = 19); Best Marker for C type
Update 15 July 2011:
DYS389II=32 is one of the Friedman markers for the C cluster. It always ranks 1st in my C type analysis files. The
standard 12 marker set, used by all samples at most DNA companies, includes
389II.
[Technical detail: DYS389 is a compound marker, where 389I is
the first STR chain and (389II minus 389I) is the second STR chain. So the marker of interest here is really
delta = 19 (389II minus 389I = 19).
However, 389I mutates more slowly and has the value 13 for all C samples
and for most samples in the neighborhood.
At Ysearch or Haplozone, both 389 markers need to be used together; if one is omitted both are ignored. My analysis file allows 389-2 to be used
alone, using 389-1 only to calculate the delta for comparison; this is signaled by using a negative number
in the “mask” in the analysis file. In this
discussion topic, by “32” I really mean 19 for the delta value.]
The two L540+ samples that do not fit
cluster C, Fredeen and Gebert,
have the ancestral value 30. Butman, the closest STR match with L540-, also has 30. On this basis, it seems likely that the
mutation(s) to from 30 to 32 happened close to the TMRCA for C type, and some time after the
L540 mutation.
DYS389II is the only signature marker
that distinguishes C type from the 2 known L540+ samples that are not C
type. None of the other 110 markers in
the standard set at FTDNA
does this. The other signature markers
work well for both C type and L540. It
is possible with more data that another marker might statistically distinguish
(weak correlation due to relatively rapid mutations). On this basis, it seems likely (not certain) that the clade of
descendants of the initial 389II mutation is the same as the C type clade.
Those two non C L540 samples differ
from C type by other markers that are not signature markers. C type has only this one very good
marker. Actually, a type does not need
any very good markers if it is very young and very isolated, so that
neighborhood samples all differ at a significant number of STR markers, even if
not the same markers for each sample.
The 32 value is rare throughout V13
but shows up in E-M35 branches outside V13.
DYS389II (actually the delta value)
ranks 43rd in Chandler mutation rate. Near the middle. So exceptions are expected, due to recent mutations.
Speculation:
Model A: A mutation from 30 to 31 happened close to the TMRCA for C
type. A little later in the history of
C type another mutation happened from 31 to 32. Most C type samples with 31 represent the oldest nodes, and only a minority are back mutations
from 32 to 31. The samples in the STR
neighborhood with 31 that do not match C are independent mutations. This seems to me the simplest model, so I
favor it, but only tentatively.
Model B: There was a double mutation from 30 to 32 in one man close to the
TMRCA. Or two single mutations too
close in time to be distinguished.
Almost all 31 in C are back mutations, most of them from a single
subclade.
Model C: The 32’s that do not match C belong to the same 389II=32 clade,
but there was a population bottleneck. C is only one of two or more nodes, from
MRCA’s who survived to produce descendants with corresponding STR clusters
today. Only C is large enough to be
noticed so far, due to a population expansion for C. This model predicts at least one other small 389II=32 clade will
be discovered as L540+ branches with STR values different than C.
Model D: I can think of more complicated models.
The data is not good enough to
distinguish these models. Maybe more
data in the future will show correlation with other markers to distinguish a
model like A through D.
Model I: The initial mutation to 31 (or double to 32) is very close to the
same age as C type, so the mutation(s) defines C type.
Model II: Mutation(s) younger than C type.
Eventually samples with the 30 value will show up, isolated in
haplospace together with the C samples.
Model II: Mutation(s) older than C type.
Eventually L540+ samples with 31 or 32 will show up that are too old to
fit C type.
Models I vs II vs III cannot be
distinguished from back mutations and outliers
until a new SNP is discovered to distinguish them.
DYS594 = 12; Best Marker for L540
Update 15 July 2011:
In my analysis,
DYS594=12 is the best marker for L540, and is also a good marker for C
type. 594 is not in the 37 marker
set. 594 helps a lot in defining C type
and L540 using the 67 marker set, but does not distinguish C from the rest of
L540.
The 11 L540 samples with 67 markers,
including 2 that are not C type, all have the 594=12 value. Butman, the closest
STR match with L540-, has the ancestral 11.
Two samples in the neighborhood have
594=12 but are L540-. These are not a
random sample; I recruited them based
on STR matches closest to, but beyond the 10 closest matches to, C type at 67
markers. Other 12 values have not been
tested for L540.
The 594=12 value is more common in
the L540 neighborhood than in the rest of the V13 data.
DYS594 ranks 12th in Chandler mutation rate.
Quite slow, so independent recent mutations should be rare.
Speculation:
Model A: The 11 to 12 mutation in DYS594 is significantly older than L540.
Model A1: Quite a few branches, both younger and older than the 11 to 12
mutation, survived the population
bottlenecks. Only C had a
significant subsequent population expansion, so only C stands out today. The other 11 vs 12 branches will not be
distinguished by STR values because they are too small and too old to be
isolated in STR haplospace.
Model A2: The 11 vs 12 branches in the neighborhood will be distinguished
by STR values when they are all evaluated for the L540 SNP, and when enough STR
data is available to identify the signatures.
Model B: The 11 to 12 mutation is not much older than L540. There is only one other significant
independent 12 mutation in the neighborhood outside L540. By luck.
That clade is the reason there are more 12s in the neighborhood.
Future data will probably eliminate
one or more of these models, and perhaps suggest other models.
Model C: The 11 to 12 is younger than L540. An old 11 branch with L540+ will be found as data
accumulates. If this happens, STR data
will not likely have enough correlation to distinguish if such a branch is
really due to a back mutation. A new
SNP would probably need to be discovered.
Other Good Markers: DYS390=25, DYS444=13, DYS406=11; DYS456=15, CDYb=33, DYS447=25
Update 15 July 2011:
These typically rank among the best
in my analysis files, usually in about the order listed in
the title here. Good for both C type
and L540 prediction. The exact ranking
is very sensitive to the choice of database.
In the close neighborhood of L540, 390=25 and 406=11 do very well.
Using the entire E-M35 database, 444=13 does better; by luck 444 does not have any major clades
with the 11 value; there is a cluster
in E1b1b1c1a (M84) that has samples with the L540 signature (389II minus
389I)=19 and 594=12, but that interfering cluster in M84 has 444=11, two steps
away from L540 at that 444 marker. This
is an example of why the database should be restricted to reasonably close STR
samples for analysis.
Chandler rank
for these, in the same order: 47th,
49th, 35th, 60th, 67th, 45th.
390 is tied with 594 for 1st place in
my current L540 analysis file, but I suppose it will end up in 2nd place because
it has a higher published mutation rate.
Models for 390 are similar to the
models for 594.
None of the other markers in the
title here are as good as 389II for C type or as good as 594 or 390 for
L540. Each clearly has confounding
mutation in the data. For example, 444
has one L540+ sample with the ancestral 12 indicating that L540 is older, but
there is also one L540- sample with the signature 13 indicating L540 is
younger; one of those must be an
independent mutation (or an error).
More data will help this get sorted out.
Many complicated models can be
constructed combining 2 or more good markers.
Complication comes from figuring out the age order of the markers. More data might point to a compelling model.
Update 15 July 2011:
An excellent signature for C type is (389I, 389II,
594, 444) = (13, 32, 12, 13). Seven of
9 C type samples with 67 markers have this signature, and the two that miss are
at step 1; no other samples in the neighborhood
have step 1; Gebert
is the only one at step 2. In the vast
E-M35 Haplozone database there is only one confounding
sample at step 1, but that one is from E1b1b1c1a; all others differ from this signature by 2 or more steps. In other words, this 3 marker signature,
cutoff 2, extracts all the C type samples and none others from V13 data. Eventually, of course, exceptions will turn
up.
There are better markers than 389I. I included that one because it enables C4 in
the search function at the Haplozone site.
Update 15 July 2011:
In my 25 marker analysis (updated 12
Feb 2011), the markers ranked 1, 2, and 3 are 389II, 390, and 447. These are the same as the 3 markers that Friedman had been calling the “characteristic marker
values” for cluster C at the Haplozone
site before I started working on this.
The signature is (390, 389-2, 447) =
(25, 32, 25).
This original Friedman signature by
works surprisingly well by itself for samples with only 25 of the standard markers,
but not with high confidence.
67 marker data; correlation to 25 marker signature: One sample that matches perfectly those 3 of
25 does not fit C type or L540 at 67 and has indeed come up L540-; 3 samples (actually only 1 independent
sample - me - the other two I recruited) that miss by 1 step at 3 of 25 do fit
at 67 and are L540+; 2 samples that
miss by 1 step at 3 of 25 do not fit at 67 are indeed L540-. Summary:
67 marker prediction is working perfectly so far, but 25 marker
predictions on 67 marker data miss by a few.
In early 2011 Friedman added 594=12
to the “characteristic marker values”, for 67 marker samples.
DYS447 typically ranks about 8th or
9th out of 67 for C type or L540.
Friedman uses a more complicated
analysis than just this simple signature in her C type assignments. I do not know what her method is exactly,
but most definitions (not all) that I tried selecting well ranked markers
extracted the same samples that she did.
My table has a few others that look good
with 25 markers, but there is not high statistical confidence for extraction
using only 25.
CDYb = 33; Another Good Marker
Update 16 July 2011:
The marker CDYb is very unusual in
L540; 10 of the 11 samples have the
value 33. See my 67 marker analysis file. It ranks
tied for 6th. Just beyond L540 in STR
step from the definition, less than half the samples have the 33 value, and
some of them might turn out to be L540+ because most are not tested.
The CDY pair is the most rapid mutator of the 67. I have never seen a cluster or type that is so uniform for one of
the CDY markers.
It is possible but very unlikely this
is a coincidence. If it is just a lucky
coincidence, then as data accumulates over the months lots of C type samples
should show up with values other than CDYb = 33. I originated this CDYb topic in April 2010, and the accumulated
data since then has strengthened the evidence that CDYb=33 dominates what is
now L540.
Of the 5 tested among the 31 samples
at steps 7 to 10 beyond L540: 2 are 33
and 3 are 34.
My hypothesis: There is a mutation within the CDYb
chain. Either a point mutation, or a
foreign insertion, or a deletion (a deletion that is not a simple STR chain
deletion of a motif, but a removal of only part of an STR motif). It is known that a mutation within an STR
that spoils the motif effectively splits that STR into two small STRs. Smaller STRs have lower mutation rates. A spoiler mutation near the middle of the
main CDYb chain would turn that marker into a slower STR mutator.
(By the way, 447 is known have two
such defects, so 447, which looks like a long STR, is really 3 short STRs, so
447 is not very rapid, and in fact 447 is a good marker, discussed above.)
CDY, also called DYS724, is a
compound marker, so that means one of the pair often copies onto the other,
providing equal values, CDYa = CDYb.
This is called recLOH, my publication
has references if you want to read more about recLOH. Or check Wiki,
where there is an explanation that 459, 464, and CDY are all on the same
“palindrome” P1, where P1 has two arms that are mirror copies. Sometimes all three of these markers get the
values copied from one arm of P1 onto the other arm. A mutation at CDYb, making it unlike CDYa, would make copy
mutations less common.
Such seems to be the case, providing
more evidence for a CDYb spoiler mutation.
There are no recLOH mutations in CDY or in 464 in the L540 data. Of course there is not enough data yet to be
compelling. The 459 marker has both
values = 9, so an recLOH would not be noticed there.
It is not clear if the ancestral CDYb
value is 33, or 34. Both are common in
the neighborhood.
Model A: More than a millennium ago, for a man in the L540 male line, not
very long before or after the L540 mutation, a mutation destroyed the middle of
the CDYb STR chain, turning CDYb into a unique marker, distinct from CDYa, and
a much slower STR mutator than CDYa, and less likely to combine with CDYa in an
recLOH event. This rare mutation
happened in a man who ended up with the equivalent net value of 33 at CDYb after
the mutation. A descendant of this man
would be the most recent common ancestor (MRCA)
for the clade corresponding to this mutation.
There were some normal STR mutations at CDYb in the descendants of that
MRCA, and some of those mutated CDYb men were lucky enough to have male
descendents living today, and one of them shows up in our C type data today
with the 34 value.
Although more data will add evidence
to this model, I doubt the data will be good enough to determine if 33 or 34
was ancestral. If 33, samples from old
nodes might show up, which I doubt will be distinguished older or younger than
that spoiler mutation, based on STR analysis.
If 34 is ancestral, I doubt the data will distinguish older branches
from more recent back mutations.
More models can be constructed along
the lines of the models discussed in previous topics above.
In a previous version of this web
document I speculated that this CDYb mutation might define a new
haplogroup. I asked Thomas Krahn at
FTDNA about sequencing my CDYb to prove there is an SNP in there. Krahn explained that the P1 palindrome is
very difficult to sequence with standard methods because the data is a mixture
from the two arms of P1. Krahn also
pointed out that even if the SNP were proven it would not be accepted as a
haplogroup division because an recLOH can still happen, and if CDYa gets copied
onto CDYb that wipes out the SNP in the clade descending from that recLOH.
My 2010 versions of this topic pointed
out that the CDYb mutation is definitely older than C type, because the 33
value predominated in the very near neighborhood of C type. That prediction has been validated. The two L540+ from outside C type both have
the 33 value. With my new L540
definition, the 33 is not dominant beyond the L540 data, but of course a few of
those that are there might end up in L540 when they are tested.
DYS636 = 12; A New, Excellent Signature Marker for L540
New topic 3 July 2011:
DYS636 is not one of the standard 67 STR maker set,
but has been available this year as part of the extended 111 STR marker
set. I have been encouraging men in the
L540 neighborhood to purchase the extension to 111, helping out with the cost
where necessary. I was hoping that
there would be a slowly mutating marker among those extra 44 with a mutation
unique to the L540 haplogroup. Sure
enough, DYS636 provides such a marker.
Of the 64 samples (2 Jul 2011) in the
E-M35 database with all 111 markers, 9 are L540, and all of them have DYS636 =
12. The others are DYS636=11 with only
two exceptions at DYS636=12, but those two both have many STR mutation
differences from L540 and V13, and are not predicted to belong to the V13 parent,
so those are obviously independent rare mutations.
The two L540 samples that are
not in C type have all 111 markers now, and indeed carry the DYS636=12 value
(Gebert & Fredeen).
The one nearest neighbor to C type
that came out L540- has all 111 markers now, and indeed carries the ancestral
DYS636=11 value (Butman).
It is not a coincidence that the
critical samples (all the samples from the 67 marker data near the L540 cutoff)
have all 111 markers so soon - I recruited the data and paid for it as needed.
13 of the 111 marker samples are
confirmed V13 (in addition to the 9 from L540), plus a few more are predicted
V13. (2 Jul data - will increase
quickly because there are several more with panels beyond 67 obviously in
process for 111).
Actually, my main motive for
encouraging 111 markers was to better subdivide L540. No luck yet. There is no
slowly mutating marker among those new 44 that obviously mutated during the history
of L540, like 389II for C type. Recall
that DYS389II=32 is the best marker for C type,
distinguishing C from the parent L540.
None of the new 44 does this.
With more data I might find a reasonable way to further subdivide C
type; there are several hints in the
data on how to do this with combined markers, but none of them are compelling.
Recall that DYS594=12
is also unique to L540, also with ancestral 11. This pair of STRs provide a firm foundation from which to notice
any new clusters in the L540 neighborhood as data accumulates.
P Cluster: (385a,439,447,464c,445) = (16,13,26,17,11)
Update 30 Sep 2011:
C type includes “P cluster”, a Polish
cluster defined by that 5 marker signature. The cluster is not convincing, because there
are only 3 samples, me and two others that I recruited. I call it a cluster because I reserve the word type for clusters with statistical
significance.
DYS454 is one of the markers recently
available in the extension to 111 markers. The other 4 markers in the signature are
part of the standard 25 marker set. As
luck would have it, there are no other markers that correlate with these in the
67 marker set, so only 25 markers are required to match a sample to P cluster
within the L540 haplogroup, using 4 of those markers.
Each of these markers individually is
variable. There are quite a few
E1b1b1a1b(V13) samples that match all 4 of those out of 25. There are no matches in the STR neighborhood
of L540 and C type, although eventually a match should show up just due to the
luck of random mutations. Within L540
and C type, only one sample so far matches the signature at two of these
markers, and 4 other samples match at only one.
DYS385a=16 is ancestral to L540, so
most samples in the neighborhood outside L540 match this signature. This marker apparently mutated to 17 near
the time of the origin of L540, then mutated back to 16 near time of origin of
C type. Fredeen has the 16 value; Gebert has the 17 value; this is an inconclusive hint that Fredeen
belongs to an older branch. I say
“inconclusive” because Fredeen also has the ancestral value at DYS444, but
Gebert has the ancestral value at DYS406.
Both have numerous mutations so they are probably members of two
separate old branches of L540,
The
signature can only be used for L540+ samples, where the signature has no other
matches in the data to date. Fredeen
matches at one of the two; more such
single matches are expected eventually with enough data, due to the luck of
recent mutations.
Because both values of the signature
are ancestral, it is reasonable to wonder if cluster P corresponds to the
oldest clade in the C type data.
However, Gebert has the mutated L540 values
(17,25) for this pair; Gebert is L540+
but does not match C type, so he seems to belong to a branch with node older
than C. It is not convincing to
speculate that these two markers both mutated in C type close to the node for
cluster P, and that Gebert by luck mutated to the same C type values in an
older branch. It is more convincing to
assume that cluster P is not necessarily the oldest clade, and that these are
the only 2 out of 111 STR markers where this young clade has unique markers,
and by luck those 2 markers are both back mutations to the values ancestral to
L540. To add confusion, Fredeen, from another L540 branch older than C, has (16,25)
for this pair, matching cluster P on the first but ancestral on the
second; it is reasonable to assume
Fredeen has one independent mutation matching cluster P, because neither marker
is slow; both have variation of value
in the neighborhood.
Another unconvincing speculative
model: cluster P might be older than
all the L540 data, with an independent 389II mutation of 2 steps matching C by
coincidence. This is unreasonable
because cluster P matches C type at many markers; the step count matches C even with 2 steps added. Samples with older nodes than C each
mismatch C at different markers, for higher step counts.
Most C type men have German
ancestry. It makes sense that a C man
moved to Poland and founded cluster P a few centuries ago. Of course, future Polish samples need not
match cluster P because another C type man may have also moved to Poland (or
multiple male line ancestors may have each moved a short distance, diffusing
into Poland).
Partial updates during March, most
recent 24 Mar 2013, not finished
Previous full update 19 May 2012
L540 is small
enough that I can insert a complete table here, including neighbors just beyond
in STR values
My modal haplotype for L540 uses 45 of
the 67 standard STR markers
My modal haplotype for C
type uses 46 of the 67 standard
STR markers
Also shown are columns for full
modals
The columns show step using each modal
+ vs --- means confirmed positive vs negative by the L540 test
L241
and L99 mean positive for those haplogroup
markers, implying negative for L540
There are more negative
L540 results from outside this neighborhood (higher step)
Red step
numbers are predicted C type and violet are
predicted L540 not C type, on the basis of STR values using my Mountain Method
Boldface C are members of E-M35; C without
boldface are from other projects, or from Ysearch
|
|
|
|
|
|
|
Modal> |
L45(67) |
C46(67) |
C67 |
33(37) |
37 |
23(25) |
25 |
12 |
|
|
|
|
|
|
|
Cutoff > |
6 |
5 |
12 |
8 |
9 |
4 |
5 |
3 |
|
Kit |
Ysearch |
L540 |
Name |
Origin |
Type |
Markers |
|
|
|
|
|
|
|
|
|
N16800 |
KFKGM |
+(WTY) |
Gwozdz |
Poland |
C |
111 |
1 |
0 |
7 |
3 |
6 |
3 |
4 |
2 |
|
155155 |
|
+ |
Svercl |
Czech |
C |
111 |
1 |
1 |
8 |
3 |
4 |
1 |
2 |
2 |
|
N81304 |
|
|
Gwozdz |
Poland |
C |
111 |
1 |
1 |
9 |
4 |
7 |
3 |
5 |
3 |
|
N45041 |
UQR4B |
+ |
Hochreiter |
Germany |
C |
111 |
2 |
0 |
4 |
1 |
5 |
1 |
3 |
1 |
|
199446 |
TK98K |
+ |
Kargol |
Poland |
C |
111 |
2 |
1 |
6 |
3 |
5 |
3 |
4 |
2 |
|
140927 |
9JM9U |
+ |
Donovan |
Prussia |
C |
111 |
2 |
3 |
6 |
3 |
2 |
1 |
1 |
1 |
|
51282 |
|
+ |
Wion |
Germany |
C |
111 |
3 |
3 |
10 |
5 |
8 |
3 |
4 |
1 |
|
225596 |
6S4J6 |
+ |
Nowak |
Poland |
C |
111 |
4 |
4 |
9 |
5 |
5 |
0 |
0 |
0 |
|
171456 |
79QF7 |
|
Glasser |
Germany |
C |
67 |
1 |
0 |
3 |
2 |
2 |
0 |
1 |
0 |
|
262750 |
|
ordered |
Svercel |
Slovakia |
C |
67 |
2 |
1 |
8 |
2 |
3 |
0 |
1 |
1 |
|
174240 |
|
|
|
Unknown |
C |
67 |
2 |
2 |
3 |
|
|
|
|
|
|
229581 |
|
|
Zinin |
Unknown |
C |
67 |
2 |
2 |
4 |
|
|
|
|
|
|
|
5XP46 |
|
Stelz |
|
C |
67 |
2 |
2 |
|
|
|
|
|
|
|
243901 |
|
|
Stubblefield |
Unknown |
C |
67 |
2 |
3 |
10 |
|
|
|
|
|
|
E10751 |
|
|
Schulz |
Germany |
C |
67 |
3 |
2 |
7 |
|
|
|
|
|
|
6104 |
4HJ3D |
|
Boyd |
Unknown |
C |
67 |
3 |
2 |
10 |
3 |
8 |
0 |
1 |
0 |
|
207878 |
|
|
Frind |
Germany |
C |
67 |
3 |
3 |
10 |
6 |
7 |
4 |
4 |
2 |
|
70482 |
6HMRD |
+ |
Ostholm |
Sweden |
C |
67 |
4 |
4 |
10 |
5 |
6 |
2 |
2 |
1 |
|
226416 |
|
+ |
Sabieka |
Belarus |
C |
67 |
5 |
4 |
11 |
|
|
|
|
|
|
158091 |
QHU8Y |
+ |
Kline |
Germany |
C |
37 |
|
|
|
4 |
4 |
1 |
2 |
2 |
|
141863 |
W5JHS |
|
Pohl |
Germany |
C |
37 |
|
|
|
7 |
7 |
3 |
3 |
3 |
|
N109412 |
|
|
Howe |
Unknown |
C |
37 |
|
|
|
|
|
|
|
|
|
S10193 |
|
|
Engel |
Germany |
C |
34 |
|
|
|
|
|
1 |
1 |
1 |
|
S10194 |
|
|
Kochtitizky |
Hungary |
C |
34 |
|
|
|
|
|
3 |
3 |
1 |
|
S10195 |
|
|
Jordan |
Germany |
C |
34 |
|
|
|
|
|
3 |
4 |
1 |
|
A10196451 |
|
|
Stavbon |
Sweden |
C |
34 |
|
|
|
|
|
|
|
|
|
N26163 |
R38X2 |
|
Fritsch |
Czech |
C |
12 |
|
|
|
|
|
|
|
0 |
|
N39377 |
|
|
Obendorf |
Germany |
C |
12 |
|
|
|
|
|
|
|
0 |
|
N57225 |
XKCE3 |
|
Livingston |
Germany |
C |
12 |
|
|
|
|
|
|
|
0 |
|
42790 |
RAQXW |
|
Brenneman |
|
C |
37 |
|
|
|
6 |
12 |
3 |
6 |
2 |
|
B0003 |
|
|
Krak |
Czech |
C |
25 |
|
|
|
|
|
1 |
3 |
3 |
|
N74113 |
|
|
Kiewat |
|
C |
25 |
|
|
|
|
|
2 |
2 |
2 |
|
A2983 |
|
|
Undisclosed |
Austria |
C |
33 |
|
|
|
|
|
2 |
4 |
1 |
|
S10290 |
|
|
Layer |
Germany |
C |
34 |
|
|
|
|
|
3 |
4 |
4 |
|
S10231 |
|
|
Karozewski |
Hungary |
C |
34 |
|
|
|
|
|
3 |
5 |
4 |
|
|
A9FVE |
|
Weiand |
Germany |
C |
37 |
|
|
|
4 |
7 |
3 |
4 |
|
|
|
5XP46 |
|
Ludwig |
Germany |
C |
37 |
|
|
|
5 |
5 |
2 |
2 |
0 |
|
|
4Q933 |
|
Kephart |
PA |
C |
37 |
|
|
|
5 |
5 |
|
|
|
|
|
2N3UM |
|
Oppitz |
Germany |
C |
37 |
|
|
|
6 |
6 |
|
|
|
|
|
3K4Y2 |
|
Lintner |
Germany |
C |
37 |
|
|
|
6 |
7 |
2 |
4 |
|
|
|
7R3RV |
|
Mowers |
Canada |
C |
25 |
|
|
|
|
|
2 |
3 |
|
|
|
K48RR |
|
Mowers |
Canada |
C |
25 |
|
|
|
|
|
2 |
3 |
|
|
|
CQUMV |
|
Muise |
France |
C |
25 |
|
|
|
|
|
3 |
4 |
|
|
|
K8VAB |
|
d’Entremont |
France |
C |
25 |
|
|
|
|
|
3 |
4 |
|
|
|
Ysearch |
|
6 more |
|
C |
|
|
|
|
|
|
|
|
0 |
|
166692 |
8FTXT |
+ |
Gebert |
Germany |
L |
111 |
5 |
7 |
10 |
5 |
5 |
3 |
4 |
3 |
|
162917 |
|
+ |
Fredeen |
Sweden |
L |
111 |
5 |
8 |
18 |
9 |
12 |
4 |
6 |
4 |
|
|
WHFQB |
|
Froetscher |
Germany |
|
67 |
6 |
5 |
14 |
9 |
11 |
2 |
4 |
|
|
N91348 |
|
--- |
Butman |
England |
|
111 |
8 |
8 |
11 |
4 |
7 |
2 |
2 |
2 |
|
E8272 |
|
L241 |
Abdurrah |
Kosovo |
|
67 |
9 |
7 |
15 |
|
|
|
|
|
|
97005 |
CBF87 |
|
Strejc |
Unknown |
|
67 |
9 |
8 |
16 |
|
|
|
|
|
|
N81087 |
|
|
Folia |
Romania |
|
67 |
10 |
10 |
18 |
|
|
|
|
|
|
256345 |
UW9R2 |
--- |
Hunt |
Ireland |
|
67 |
10 |
11 |
15 |
|
|
|
|
|
|
143479 |
|
|
Mastel |
United K |
|
67 |
10 |
9 |
17 |
|
|
|
|
|
|
199300 |
EJ4B6 |
--- |
McKrell |
Unknown |
|
67 |
10 |
9 |
18 |
|
|
|
|
|
|
44601 |
|
|
Ogles |
Scotland |
|
67 |
10 |
10 |
10 |
|
|
|
|
|
|
31263 |
|
|
Steubinger |
Germany |
|
67 |
10 |
12 |
19 |
|
|
|
|
|
|
|
YHTGH |
|
Hooper |
USA |
|
67 |
10 |
9 |
|
|
|
|
|
|
|
|
2A786 |
|
Frye |
Unknown |
|
67 |
10 |
10 |
|
|
|
|
|
|
|
98212 |
|
L241 |
Baber |
England |
|
111 |
11 |
11 |
22 |
|
|
|
|
|
|
39430 |
BAW43 |
L241 |
Johnson |
United K |
|
67 |
11 |
9 |
11 |
7 |
10 |
5 |
6 |
3 |
|
56091 |
|
L99 |
Hardy |
United K |
|
67 |
11 |
13 |
19 |
|
|
|
|
|
|
Database |
|
11 More |
|
|
>66 |
11 |
7 to 11 |
|
|
|
|
|
|
|
|
E-M35 |
|
--- |
7 Samples |
|
|
>66 |
12 |
7 - 10 |
|
|
|
|
|
|
|
E-M35 |
|
L241 |
3 Samples |
|
|
>66 |
12 |
10 - 11 |
|
|
|
|
|
|
|
Database |
|
20 More |
|
|
>66 |
12 |
8 - 14 |
|
|
|
|
|
|
|
|
|
|
|
V22 Modal |
|
|
67 |
|
23 |
33 |
|
|
|
|
|
|
|
|
|
V13 Modal |
|
|
67 |
13 |
11 |
16 |
|
|
|
|
|
|
|
|
|
V78 Modal |
|
|
67 |
|
12 |
18 |
|
|
|
|
|
|
N13674 |
|
|
Straub |
Germany |
|
37 |
|
|
|
9 |
12 |
5 |
7 |
3 |
|
144860 |
|
|
Giles |
Unknown |
|
37 |
|
|
|
9 |
13 |
6 |
9 |
5 |
Comment 23 Mar 2013: this discussion needs update for my new definitions.
Update 9 Mar 2013: Two samples, E10751 and 174240, are not in
the E-M35 Project;
these two were brought to my attention by Paul Svercl (in the table),
who noticed them on the Heinila site.
The following discussion was updated
21 Jan 2013:
The 24 cluster C
samples are at the top of the table, with “C”.
There are only two confirmed L540
outside C type, with “L”.
Those numbers are step, which is the same as what Ysearch calls genetic distance - total mutations compared
to the modal haplotype.
Those headers are modal haplotypes, for example C49(67),
the definition of C type, uses 49 of the standard 67 markers, with cutoff 7,
meaning all samples at step less than 7 are likely C type.
Similarly, L45(67) cutoff 6 is the
definition of L540 including samples outside C type.
C type is a hypothetical subdivision
of L540, samples predicted C type are of course predicted L540+.
Of the 1204 samples at 67 markers in
the E-M35 Project (on 18 Mar 2013) these are all the samples
that match L45(67) at less than step 11.
None of the E-M35 samples have step 6 or 7. Steps 6 and 7 are the gap. This demonstrates how well isolated L type
is in STR matching. There are only 3
samples at steps 8 and 9, and 2 of them are confirmed negative for L540.
My recommendations for
L540 testing: Any future samples in the gap should be tested for the L540 SNP. Just below the gap, samples at steps 4 and 5
have roughly 85% and 80% probability of being L540 members (my estimate), so those should also be SNP
tested for assurance. Samples below
step 3 are more than 90% likely to be L540+, so testing is useful but less
necessary.
I have low confidence in prediction
above step 7 in that L45(67) column, because there is little data. It is misleading to observe that there are
no L540+ above step 5, because I have adjusted the definition for new SNP
results. My very rough guess for L540+
result probabilities: 40% for each new
untested sample at step 7, 20% at step 8, 10% at step 9, 5% at step 10.
That L99 sample is “private”, with
only one known member, at step 11 in the table. Many L540+ samples have tested L99-, but that sample has not been
tested for L540, so there is a remote (perhaps <1%) probability that L99 is
a branch of L540.
There is no SNP test yet for C
type. C type has evidence of a
subtype: the gap at step 3 in C27(67).
Prediction has lower confidence at 37
or fewer markers. Those columns for
fewer markers are my old step numbers using my older definition for C
type. I plan to update this table soon,
using a new C type definition.
I use the Ysearch
method for calculating step, which gives a result slightly different than the Haplozone method.
There is also a one marker discrepancy mentioned in the Ysearch topic.
The column 33(37) uses the 33 markers
from that 61 marker modal that fall among the standard 37 markers. Similarly for 23(25). For comparison, there are columns using full
standard marker sets.
The column 12 is for samples with
only 12 markers, where prediction is highly unreliable.
Topics above explain my Excel analysis files for more details about the definitions L45(67),
C49(67), and others. Also more detail
about L450 and C type.
If you are not in the table, see the Ysearch topic below for a way to evaluate your step
separation from C type and L540.
If you are a neighbor and wish to be added
to this table, please let me know.
My sample
is kit N16800. N81304 is my 3rd cousin
Gwozdz.
Aloysius Kargul (Kargol) is my closest STR match available on the web. Kit 199446.
In May 2010, his daughter noticed, on ancestry.com, that he and I are
perfect matches at 12 markers. I
studied the LDS
microfilms and located his 1820’s Kargul ancestor living in a village in
Poland only 20 miles away from the village of my Gwozdz ancestor. I paid for his FTDNA sample. Kargul is in the table
above. His L540 test came out
positive, placing him in that new haplogroup.
We are 5 steps apart at 67 markers;
9 at 111 (4 Jul 2011 update).
For estimating the size of C type or
L540, my cousin and Kargul should not be included, because I recruited them,
paying for their tests. Family sets
such as these distort size estimates, compared to random data. In other words, C type really has only 16
samples, not 18 (on 4 July 2011) if compared to other clusters (which should
also be adjusted for family sets).
New topic 13 May 2011. Update 17 Jul 2011: Butman’s L540 SNP test just came out
negative; that means he is not a member
of the new L540 haplogroup.
Raymond Butman, kit N91348, is right
on the edge of the predicted C type using the old 61 marker definition. This sample is a recent addition to the M35 database. His step using the 61 marker definition
is 8. My cutoff for the definition
before this sample showed up was 7 because of the gap - no samples from step 7 to 12 at that
time. When I wrote this topic after
Butman’s sample showed up, I changed my cutoff to 9 and I pointed out here that
this sample might land in V13C (L540), but it was a close call, not a confident
prediction.
Although this sample matches most of
the markers of the definition, it misses at the two best signature markers, DYS389-2 and DYS594.
I subsequently developed a new L540 definition that excludes Butman.
What does this mean? The simplest explanation: Butman’s node in the Y-DNA tree (his male
line common ancestor - branch joint) seems slightly older than the nodes of
L540 members (in the database so far today).
His node is older than the L540 mutation, and also seems older than the
DYS389-2 mutation, and also older than the DYS594 mutation. Most of his other STR values match the L540
definition because his node is not much older, so there has not been much time
for more mutations.
This simplest explanation is a good
statistical prediction, not a proof.
Other less likely explanations are possible. For example Butman might be an outlier
from a clade very distant from L540, where most men have different STR values,
where he might have many matching STR values to L540 due to the luck of random
mutations in his male line. Yet another
possible explanation: Butman might
belong to a very small clade with a much older node with L540, even older than
the node for the men at steps 7 and 8, but the ancestor at that node might just
happened to have STR values very close to the values for the L540 ancestor, due
to the luck of random mutations, and now Butman is the only sample available
from that very small clade.
Update 30 Sep 2011:
I noticed Gebert sample on Ysearch
and encouraged him to join the E-M35 project, which he did, kit 166692 in the table below.
I helped pay for the orders for the L540 test and for the 111 extension.
This data is important because his STR values place him near the
predicted cutoff of C type.
Update 30 Sep 2011:
Kit 162917, Fredeen, came up L540+ in
May 2011. This was significant because
this was the first L540+ sample outside C type, later joined by Gebert.
Logically, Fredeen and Gebert might
be outliers from cluster C, with back
mutations in 389II and multiple other mutations just due to the luck of random
mutations. This is very unlikely
These two samples most likely
represent two old branches of L540, with older nodes
than the C type node. They do not match
each other well at 67 markers. Their
closest matches at 67 markers are each other and C type samples. There are no close STR matches to either in
the 67 marker data. Any future close
STR matches would of course be predicted L540+ in the corresponding branch.
Three markers are of interest
here: (385a,406,444) = 17,11,13) are
the values for L540 and C type. Most
samples in the STR neighborhood just outside L540 have the ancestral values
(16,12,12). Fredeen is ancestral for
two of these, (16,11,13) while Gebert is ancestral for one of them
(17,12,13). This is evidence that
Fredeen’s node is older, but the evidence is statistically unconvincing. We realize that these 3 markers may have
experienced mutations after the two nodes of interest. Both Fredeen and Gebert have 111 marker data, which does not help out for this
question.
Hohnloser (kit N39989) fell into C
type at 37 markers in 2010 (marginally), but not at 67 markers (not particularly close). He is not a member of cluster C because
his sample does not match the Friedman signature
(originally at 25 markers). He provides
an interesting example of how statistics works - in this case, due to the luck
of random STR mutations, the sample is close to C type only at 37 markers.
Hohnloser does not belong to the L540
haplogroup because his SNP test came out negative.
Hohnloser has extensive family tree
research results. He administers a
Hohnloser project at FTDNA. He
exchanged helpful email discussions with me.
Structure of the
L540 Haplogroup
Complete rewrite 16 Jul 2011.
C type is a
hypothetical haplogroup within L540. The evidence is presented throughout this
web document, particularly in topics about my 67 marker analysis
files and about DYS389II. My estimate is about 99% confidence that my C type
definition corresponds to a clade that will be proven to be a haplogroup by
a newly discovered SNP - someday when tests for new SNPs are more comprehensive
and lower cost - continuation of the current trend. My estimate is that 98% (80% confidence range more than 90%) of
the samples predicted C type by my definition will end up in such a haplogroup.
This confidence is based on a
combination of statistical calculation, plus judgment where calculation is not
possible; for more discussion see my confidence topic.
C type seems to be about 80% of L540,
based on only 2 L540+ not C type, compared to 7 independent C type samples, for
77.8%, in the 67 marker data. Confidence in this 80% is not high because it is based on only
two samples. There may be more samples
in the STR neighborhood that will test out L540+ in
the future, with STR values quite different than my current L540
definition. Also, there may be
L540- samples in the future that match my current L540 definition.
In other words, I have high
confidence that C type predictions are L540+, but not high confidence in the
L540 predictions outside C type.
I expect to update my L540 definition
as data accumulates, thereby improving the confidence for L540 STR predictions
outside C.
For now, I recommend the L540 test to
everyone in the neighborhood table, with particular
emphasis on those with closest step to my current L540 definition, and with the
caveat that there may be more L540 beyond that table.
Cluster P
probably corresponds to a small Polish clade within C type. It may take some time to find an SNP to
define such a small haplogroup.
C type is close to the same as a
clade defined by a particular mutation(s) at the DYS389II
marker, from value 30 to 32. We can
even speculate that clade is identical to C type data.
A particular mutation at the DYS594 marker, from 12 to 11, seems to define a “father” clade
that is slightly older than L540. The
same might be true for a particular mutation at the DYS636
marker, although more 111 marker data is needed to estimate if 636 is younger
or older than 594 and / or L540. With
more data, these two markers will provide a foundation from which the age of
other mutations can be estimated, because independent mutations in other
markers are unlikely to also have mutations in these two foundation markers.
The L540 data is bimodal in a number
of markers. Each of these bimodal
markers is evidence that there is a significant subclade division than might be
determined for L540 structure. However,
these various suggestions point to different divisions; so far no two of them are strongly correlated. With more data, it may be possible to split
L540 based on statistical correlation of STR markers, using my mountain method.
Update 30 Sep 2011:
FTDNA
provides a 67 marker standard
set of STR markers. I have been
using this 67 set for analysis.
A standard set of 111 markers is now
available. For existing samples with 67
or fewer, an extension to 111 can be purchased from FTDNA.
I am hopeful that 111 marker data
will enable me to construct a high confidence family tree for the L540
haplogroup, as more data accumulates.
Indeed, already the marker DYS636, one of the 111
extension markers, already has provided an additional signature marker for
L540.
Additional markers make it more
likely to subdivide haplogroups and types with
confidence. Indeed, already marker DYS445, one of the 111 extension markers, already has
provided an additional signature marker for the P cluster
division of C type.
If you are in my neighborhood table above, please consider ordering
the FTDNA panels of additional markers.
Some of us have already ordered, as indicated in the table.
It is helpful for me to include the
neighborhood just beyond L540 in this request, for better determination of
ancestral STR values. In addition, the
marker CDYb may provide a definition for a “father” clade
of L540 if additional markers correlate with CDYb.
The current 67 marker
data includes a few L540+ samples where I paid for the extra data, but it
also includes one sample that was in the neighborhood table but has been
removed because I paid for the extra data, which showed it is not an L540
neighbor. So my help is causing a bias
in that table above, but not a significant bias yet in my Excel analysis
files.
Update 17 Jul 2011:
479H7
is a direct link to the my modal haplotype for L540.
QAZ7P
is a direct link to the my modal haplotype for C type.
If you are not listed in the table above you can compare your data on Ysearch. You can compare your step genetic distance
to these modal haplotypes if you have the standard 12, 25, 37, or 67
markers. The comparison may not work if
you have a non standard marker set. The
cutoff for each marker set is given in the legend in the table above.
Brief description of Ysearch.
Link to the site home: http://www.ysearch.org.
To join Ysearch, click on the Create A New User tab,
where you can upload your Y-DNA STR data from a number of testing
services. Or, you can type in your
data. You end up with a “User ID”.
Instructions for comparison to V13C
at Ysearch:
Click here: Research
Tools (or click on the tab with that name)
Copy the following line into the
“UserIDs” bar at the Research Tools page:
USEID, 479H7, QAZ7P
Change USEID to your User ID.
You need to type the Captcha puzzle
for access.
Click on ‘Show genetic distance
report” to see your step genetic distance from C type and from L540.
www.Ancestry.com is the web page for a commercial DNA testing
company. Men with Y-DNA test results
can choose to make results available for matching to others. Kargul originally
matched with me at this site.
I last checked for matches 16 May
2011. There are 9 close matches of
Y-DNA to Kargul & me, but these are not close enough to include in my Neighborhood Table.
Comment 25 Mar 2013: this topic needs update modification using
the latest data. The new on-line Excel
file versions do not yet have the ASD sheets.
Coming soon.
Topic update 11 Jul 2011:
The discussion in this topic is based
on the sheet “ASD” in the two 67 marker analysis files.
Average Squared Distance (ASD) is
equivalent to variance of STR values.
Most people use ASD to calculate age in genetic genealogy, as I explain
in my publications. The ASD method has large known systematic uncertainties,
discussed in my publications, which make age calculation uncertain. It is not possible to calculate a confidence
range because the systematic errors might be larger than the statistical
errors, even for small samples of data.
C type is quite young. The age using all 67 markers comes out 805
years, cell N12 on the “ASD” sheet in C67xls.
Although I do not have high
confidence in estimating the exact age of C type, there is additional evidence
that C type and L540 are young. The
fact that C37, using the first 37 markers, provides a reasonable definition of
C type is evidence of youth. Old
haplogroups do not provide reasonable modal definitions using all 37 markers,
because of the wide variation in the rapidly mutating markers. Another way of saying this: using all 37 markers, there is a lot of
overlap of old haplogroups. Another way
of saying it: Isolation in STR values
is evidence of youth. My publication
elaborates on this.
The fact that C type samples can be
extracted from V13 using only a 4 marker signature is
also evidence of isolation and youth.
Low SBP is evidence that C type and
L540 are well isolated; see the analysis for SBP.
I expect the age to creep up somewhat
as new data is discovered. The correct
well known statistical way to correct for this expectation: divide by N-1 instead of N when figuring
ASD. My files do not use N-1 because
that is not the common practice in genetic genealogy. Using N-1 the age is 939 years instead of 805. This estimate includes a best guess for
future samples; eventually samples will
show up that fit C type but have more mutations than the samples so far -
either because of true older nodes with
the Y-DNA tree or just due to bad luck in random STR mutations.
Conclusion: My best guess for the age of C type is about 1,000 years.
Of course, this result has very low
confidence, because it is based on only 7 samples. The true age might be up to a factor of 4 older. Or it might be a lot younger. Although it is not possible to calculate a
confidence range that includes systematic errors in age estimation, I suppose a
reasonable high confidence range estimate would be 500 to 2000 years old.
Discussion of C age estimation. My calculation excludes two samples, Gwozdz and Kargul, whom I
recruited, because a random selection is required. Including them yields 806 years, corrected to 907, not
significantly different. All 67 markers
are valid for use in a young clade like
this, because there are no recLOH issues
in the C data. Issues in recLOH usually
cause problems using all 67. Those
analysis files include comparison to a classic “Thomas” method, which I explain
in my publication, and which yields 292 (vs 805) years, but I consider that
just a demonstration that the old method uses too few markers, which by luck do
not vary much in the C data. My files
allow the user to easily vary the markers and easily vary the samples. Boyd seems to be the oldest sample; removing him lowers 805 to 712,
corresponding to the age without his hypothetically older clade. The “oldest” two markers (far right of the
ASD sheet has a sort by age) are DYS460 and DYS385a, with 3975 and 3058 years
respectively; this is evidence that
those markers might correspond to clades within C type, but the evidence is not
compelling yet because some markers are bound to come out old just by
luck. My definition of C type of course excludes
such old markers, which is a proper procedure for a good definition. But it is not fair to exclude old markers in
age estimates because the zeros balance the old ones statistically; C type has 45 markers with zero age - the
lucky ones with no mutations. Many
zeros is evidence of youth but also evidence of few samples - only 7. With more samples there will be fewer
zeros. Most published ASD age
calculations include a correction making the result older, but the reason for
that correction is population
bottlenecks, which reduce ASD; since
C type is large for a very young clade, I’m guessing the age corresponds to a
rapid population expansion after which there were no significant
bottlenecks. If I am right, no such
correction would be appropriate. Also,
most published ASD ages use N instead of N-1, so a published correction method
should be applied to my first number, 805.
An age calculation for L540 is not
appropriate, although L67.xls does it, getting 971 years in cell N12 of the ASD
sheet. That’s not fair. C type dominates the data, but C type is
really only one man, the MRCA. Our data
for L540 is really only 3 men: the
hypothetical C type MRCA, Fredeen, and Gebert. It is well known that estimating the TMRCA
of two men is highly uncertain, just due to the luck of random mutations. For 3 men it is not much better. The L540 age (TMRCA) is surely older than C
type, just a bit older based on the meager data here of 2 men with STR values
different than C type - but not very different at 67 markers than the C type
samples with highest step.
Age of a mutation is of course older
than the TMRCA because there should almost always be multiple generations
between nodes. We know the L540
mutation happened after the node for Butman and before
the nodes for Fredeen and Gebert. That
assessment will get better with more data.
I said in the Abstract here that L540
might be twice as old as C type. That
is just a guess.
Update 17 Jul 2011:
The neighborhood
table shows that 9 of the 18 cluster C men indicated “Germany” in the
“Origin” field of their data. One of
the two L540 men outside C indicate “Germany”.
That is very good (although not convincing) evidence that the ancestors
(MRCA) of C type and L540 lived in what is
now Germany. A caveat: men of German origin are more likely to
purchase a DNA test and submit data to web databases. It is obvious from data searches, in Ysearch
for example, that men of east European ancestry are under represented. This sample bias is difficult to measure,
but I doubt the bias is sufficient to rule out a German origin as our best guess.
The parent V13 haplogroup is
concentrated in the Balkans, according to density maps on the web. I’m guessing that our L540 MRCA lived in
Central Europe, but I do not know that.
This cannot be checked with data available today, because the published
Balkan Y-DNA data has too few markers to distinguish L540 or C type. I look forward to the near future when data
with more STR markers become available from the Balkans to verify my guess. L540 SNP data would help, where I expect
almost all to be negative from the Balkans.
On the other hand, if L540 is common in the Balkans, that would imply a
probable MRCA origin in the Balkans, with subsequent expansion into Central
Europe.
Bird published evidence for a hypothesis
that E1b1b1a1b (V13) appeared in England, concentrated at the two locations of
ancient Roman garrisons, because of men from Moesia Superior who joined the
Roman Legions when the Romans conquered the Balkans.
Speculation:
Model L540A: V13 in Europe springs largely from Roman
Legionnaires from the Balkans. Due to
the statistics of Y-DNA, most men do not form lasting clades, but many Balkan
Legionnaires were lucky enough in their male line descendants so that many
small V13 clades in Europe today correspond to individual Roman
Legionnaires. Because enlistment in
Balkan armies, and subsequent enlistment in the Roman army, is largely random
from the point of view of Y-DNA, these clades are a random selection from a
much larger population, so the Legionnaire founders had very variable STR
values. The clades today have STRs
quite different from each other. Most
clades are small enough that no samples, or only 1 or 2 samples, are present
from each in the databases today. The two L540 samples from outside C type, and the one sample just
outside L540, and others not yet tested for the L540 SNP, represent such small
clades. C type is an exception, with 18
samples available today. The MRCA of C
type was a descendant of one of these Legionnaires, but that MRCA lived about
1,000 years later, in what is now Germany.
Model L540A1: C type is larger just by luck. Statistically, some clades are necessarily
larger than others. It is not very
surprising that one clade is unusually large.
We do not notice small clades, so of course we are now studying C type
because it stands out.
Model L540A2: C type is larger because a descendant who
lived about 1,000 years ago was a king or otherwise very prominent man, so his
family grew much faster than others.
His ancestors were not prominent, so C type is isolated, but not more
isolated than those other small clades from Roman times.
Model L540A3: C type is larger because of a local
population expansion during Medieval times.
The C type MRCA was one of many who participated in these good times,
but the others were from other haplogroups.
For example, the population expansion might have been associated with a
Germanic R1b tribe, where the C type MRCA was an outsider who joined the tribe before
the expansion, along with outsiders from other haplogroups.
Model L540A1: C type is larger for another population
expansion reason.
Model L540Aa: The relatives of the Roman Legionnaires
stayed in the Balkans. In the near
future L540+ samples will show up from the Balkans. C type will show up, and it will not be particularly isolated in
STR values, because the ancestors will not be the same at the C type MRCA.
Model L540Ab: There was a population bottleneck (or 2 or more
bottlenecks) in the Balkans during the past 2,000 years. War, famine, whatever. There was a later population expansion in
other haplogroups, so very few if any L540+ samples will show up in the
Balkans.
Model L540Ac: That parent population did not get entirely
wiped out. It survives, in a remote
area. Maybe a group of villages in the
Balkan mountains. The population has
not grown much over the centuries. If
we go there and test for Y-DNA we’ll find lots of L540 men, some of them C
type.
Model L540B: Not Roman.
A tribe of barbarians showed up in Germany about 1,000 years ago. Prior to that, they passed through another region,
not necessarily the Balkans, where a lone L540 individual joined them. This is similar to ModelL540A3. There are Ba, Bb, Bc variations similar to
Aa, Ab, Ac.
Model L540C: He didn’t join a Roman army. He was a medieval trader. A very charming traveling salesman. He fathered children all over central
Europe, mostly in what is now Germany.
This model has similar variations to models A and B.
Model L540D: I can think of other speculative
scenarios. I’m sure you can, too.
The point of these examples: we don’t know the history, but C type is
unusual in that it is young, small, and well isolated in STR values (a small mountain in haplospace). It is not closely related to the rest of
L540 or V13. The migration history of
the very large V13 parent haplogroup may or may not be relevant to the history
of the relatively small C type hypothetical haplogroup.
Update 10 Jul 2010. Quite frankly, I was surprised by cluster C. Friedman did a good job finding this one. I admit I dismissed it when I first saw
cluster C in 2007 because it was so small that statistical significance did not
seem possible to me. I postponed
analysis until Jan 2010, independently verifying cluster C as C
type.
By “valid” I mean a cluster whereby most of the samples belong to a single clade, and whereby very few other samples in
the database belong to that clade. In
other words, a valid cluster should eventually have a corresponding SNP discovered. Throughout 2010 I confidently predicted such an SNP here in this
topic, although I doubted it would be discovered soon. L540 turned out to be almost the same as C
type, although slightly larger and quite a bit older, as discussed elsewhere in
this web page. As samples predicted C
type test L540+, this adds evidence that C type corresponds to a clade.
Update 23 Feb 2012: Fifteen new SNPs were discovered in my “Walk Through the Y” (WTY). L535 through L547, L614, and L618. All 15 are available
as commercial SNP tests from FTDNA.
My WTY test read about 200,000 base
pairs in Feb 2011. In Feb 2012 the test
has expanded to twice that many. For
details, here is a link for this "WTY"
commercial product from FTDNA.
I announced 8 new SNPs here on 29 Mar
2011. The count on 30 Mar was 13 new
SNPs in my WTY. L614 was added in
June. L618 was added in August. That was a lot more than I expected. I now realize that’s because FTDNA expanded
the number of DNA bases included in WTY just before my test. Also, I seem to have been the first WTY from
E-M78 in quite some time. Since then, a
few others from M78 and V13 have tested, so there are quite a few more new SNPs
of interest recently discovered.
I tracked the status on these 15 SNPs
right here on this web page for a year, in detail. Recently the positions in the Y-DNA tree have been determined for
most of these SNPs of interest to me. I
recently removed most of the detail from this page. I’m leaving the Summary, below, for a while because other people
have links to that Summary. I’ll remove
most of this, including the Summary, later in 2012.
In late 2011 the SNP
Tracker was set up, as part of the E-M35 Project, to
track all new SNPs of interest. That’s
another reason for me to drop my details here.
That SNP Tracker merges data from WTY, from the 1000 Genomes, and from
SNP tests by members of the E-M35 Project.
Update 18 Jan 2013. For a detailed SNP tree of the E-M35
haplogroup, see the SNP Tracker. This topic used to have a summary of the
SNPs found in my WTY, but the SNP Tracker is now a better place to find an
update. Only L540 defined a new
haplogroup. L542 is equivalent to
V13. The others are all equivalent to
known haplogroup SNPs older than V13.
SNP tests cost $29 each from FTDNA if your sample is already there from
previous testing. Click on “Order an
Upgrade” from your FTDNA home page (top right), then click “Order an Advanced
Test” (do not click on “Order Advanced SNP Test”). In the box “Test Type” select “SNP”. Type the SNP code (for example L540) into the “Find” box to search
for it.
References & Sources
E-M35, a
project at FTDNA, is my main source of
data. Previously called E3b. Link:
http://www.familytreedna.com/public/E3b. The official name today would be
E1b1b1. ISOGG
changes the name when new defining SNPs are
discovered, so the name may change again in the future. M35.1 is the name of the SNP that defines
E1b1b1 within haplogroup E.
Haplozone
is a web site for analysis of data from the E-M35 project. Link:
http://www.haplozone.net/e3b/project. Data from E-M35, plus some data added from sources
other than FTDNA, so this database is larger than the E-M35. Page with a listing of proposed
clusters: http://www.haplozone.net/e3b/project/cluster/. Discussion Forum: http://www.haplozone.net.
SNP Tracker
is a web page added to the E-M35 project in late 2011, to keep track of all the
new SNP branches in M35. http://tinyurl.com/e-m35-snps
The V13
data: http://www.haplozone.net/e3b/project/cluster/10. V13 is the defining SNP for E1b1b1a1b, the
largest haplogroup in E. This page of
data does not have the data for samples
that have been assigned to clusters as subdivisions of V13, just the data that
does not fit any downstream proposed cluster.
The number code for other clusters can be typed over that “10” to
quickly get to other cluster data.
Cluster C Data: http://www.haplozone.net/e3b/project/cluster/42.
Victor Villarreal
is an administrator for the E-M35 (E3b) Project.
Andrew Lancaster
is an administrator for the E-M35 (E3b) Project. Andrew has been particularly patient with me with long helpful
email discussions.
Elise Friedman
a co-administrator for the E-M35 (E3b) Project and is administrator for the
Jewish E3b project.
Marko Heinila
provides Y-DNA trees for E-M35 samples, including samples not in the E-M35
Project. Links:
https://dl.dropbox.com/u/50201824/old/jstrees/e67tree.html
https://dl.dropbox.com/u/50201824/old/jstrees/etree.html
Peter Gwozdz. That’s me. pete2g2@comcast.net.
Revision History
2010
Jan 14 original draft version
2010
13 updates
2011
Feb - Jun, 12 updates
2011
Jul - 10 updates
2011
Aug 6 minor update of SNP Summary, 2 more L540 results, other minor edits
2011
Aug 24 major update of SNP Summary; M78
WTY results
2011
Sep 30 update of 111 markers, P cluster, Fredeen
2011
Oct 3 SNP Summary update
2011
Nov 19 L618 paragroup announcement
2011
Nov 20 slight edit of L618 news
2012
Jan 1 update of my WTY and status of the SNPs - not finished
2012
Jan 3 update of L542 status
2012
Jan 4 update of the five M78 SNP candidates
2012
Jan 5 update of the seven M34 SNP candidates;
finished SNP update
2012
Feb 23 update of SNP Summary; remove
most of the details of SNP tracking
2012
Mar 16 update Neighborhood Table
2013
Jan 21 update Neighborhood Table, drop SNP Summary details
2013
Mar 9 add Svercl cousin to Table
2013
Mar 21 new L45(67) definition; more
update of the Neighborhood Table;
update not finished
2013
Mar 23 new C46(67) defintion; more
update of the Neighborhood Table;
update not finished
2013
Mar 24 continue update of table;
rewrite of first 6 topics
2013
Mar 25 continue update of table; edit several
topics