Polish Y-DNA Clades
20-Oct-12
Peter Gwozdz
pete2g2@comcast.net
News
17 Oct 2012 news: See the L1029 And
Np News topic. L1029 was a new SNP last March. L1029 is
almost equivalent to N type, which is almost equivalent to M458+
L260-. The N type samples that are
coming out L1029- seem to be concentrated in Poland. Most of these fall into what we are calling the “Np cluster”. It is too early to say for sure, but Np
might be a new item for my list of Y-DNA clades that are
concentrated in Poland.
18 Jul 2012 news: See the L260 and
M458 News topic.
Results
Np added 17 Oct 2012.
Update 15 Feb based on 7 Feb 2012
data from the Polish Project, taken to be
representative of Historical Poland.
The % column shows the percentage for each clade in the Polish Project.
Haplogroup codes are from ISOGG (early Feb 2012).
In the Short Code Name column, click
on the link to jump down to a discussion of that clade.
|
Haplogroup |
SNP |
Proposed Sub - Clade |
Short Code Name |
% |
Concentration |
|
I1 |
M253 |
I1 P type |
0.6 |
Poland |
|
|
N1c1d1 |
L551 |
|
0.6 |
Lithuania |
|
|
N1c1d2 |
L591 |
|
0.6 |
Lithuania |
|
|
R1a1a1g1* |
M458+ L260- |
N type |
8.8 |
Eastern Europe |
|
|
R1a1a1g1* |
M458+ L260- |
Np cluster |
1.1 |
Poland |
|
|
R1a1a1g1b |
|
8.9 |
Poland |
||
|
R1a1a1g2 |
Z280 |
B type |
2.7 |
Eastern Europe |
|
|
R1a1a1g2 |
Z280 |
I type |
2.8 |
Eastern Europe |
|
|
R1a1a1g2 |
Z280 |
K type |
3.6 |
Eastern Europe |
|
|
R1a1a1g2a |
P278.2 |
H type |
0.9 |
Eastern Europe |
|
|
R1a1a1g2b |
L365 |
G type |
1.8 |
Pomerania |
|
|
R1a1a1ig2d |
Z92 |
E type |
2.1 |
Eastern Europe |
|
|
R1a1a1h1 |
L342.2 |
A type |
1.7 |
Ashkenazi |
|
|
R1b1a2a |
L23 |
R1b EE type |
1.3 |
Eastern Europe |
|
|
R1b1a2a1a1a5b2a |
L47 |
R1b A type |
0.9 |
Ashkenazi |
|
|
R1b1a2a1a1a5b2a |
L47 |
R1b P type |
0.6 |
Poland |
The table above is a brief summary, with
some interesting results, and some recent results. For more results, please click on the following links:
|
Short overview of all major haplogroups in the
Polish Project |
|
|
All proposed clades; many with Ysearch links |
|
|
Discussion of my methods |
|
|
Instructions and links for comparison of your
Ysearch data to these proposed clades |
|
|
Table of STR Modal
Haplotypes, Definitions, SBP,
etc. |
Abstract rewrite 20 Oct 2012.
The Polish
Project has assignments of men (samples) to haplogroups and to proposed subdivision clades
based on their Y-DNA data. Lawrence Mayka, administrator of the Polish Project, provides
data for this web site of mine. I help Mayka with statistical methods for assignment of
samples. This web document is for
explanation, details, and update news.
The Results Table has a summary of assignment
statistics. Some assignment categories
have a link to more detailed discussion.
If you know your assignment you can click on the link in the right
column of the Results Table to read more about your assignment category.
Many of the assignments are to well
established haplogroups, based on SNPs. Some assignments are to hypothetical
haplogroup branches, based on STRs. Such branches are proposed by many people,
including Mayka and me. In addition, I
hypothetically subdivide haplogroups into types when
division can be done with 80% confidence. With less than 80% confidence, my assignment
categories are tentative, not called types, usually called clusters.
About half of Polish men belong to
haplogroup R1a. Most of my work has been on R1a. The R1a Project
has lots of additional information.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you
might prefer to read the Introduction first.
This web document has three
purposes: 1. More detailed explanations
for the sample assignments in the Polish
Project. 2. Summary of my published results.
3. Update with recent results.
The topic is common Polish Y-DNA
clades - identification of male line Y-DNA clades that are concentrated in the
region of Historical Poland.
I use the word type
to mean an STR cluster with statistical confidence as
established by my Mountain Method. Many of my types have been validated by discovery of new SNPs that
qualified the corresponding clades as official haplogroups. I expect more than 80% of my types to be
validated some day, but my method is intended to be slightly aggressive, so I
do not expect 90% validity. I chose the
word “type” because it is not generally used in genetic genealogy and I wish to
distinguish my types from haplogroups and from other clusters. All types have associated clusters but not
all clusters qualify as types. In my publications and web pages I make it clear which
types I have discovered in web data and which types were suggested to me by
others, with references. Often when I
discover a type I later find out someone else had mentioned it earlier on the
web; let me know if you the reader have
more clues and references for me.
Most types that I discuss seem to be
1,000 to 5,000 years old, so all the men in each type seem to be descended in
direct male lines from one man (MRCA) who lived that long
ago (TMRCA). A few
of my types might be younger or older than that range.
I use phrases like “seem to be” over
and over because the methods are statistical.
Click here for a slightly longer
abstract-style summary of Polish Project assignments.
The Polish Project is considered
representative of Historical Poland, with caveats explained in my Publication.
I am interested in Polish
origins. This web document, however, is
not for historical analysis and conclusions, except for occasional comments to
remind us of the goal. This document is
dedicated to identifying haplogroups and types and clusters concentrated in
Poland, with detailed explanations. I
am aware that some people object
to the use of Y-DNA for historical analysis, so I try to mention caveats along
with my comments.
Abstract rewrite 20 Oct 2012.
About half of Polish men belong to
haplogroup R1a. The R1a Project
has lots of additional information about that haplogroup.
When I originally posted this web
page in December 2007, no significant haplogroup subdivision of R1a was
available, so this page started with hypothetical subdivisions of R1a. A major division,
roughly 50-50, between R1a1a* and R1a1a7 (M458), became available in November
2009. Now, 2012, there are many
haplogroup branches known in R1a, and this page continues with proposed further
division of Y-DNA clades common in the region of Historical Poland.
In 2010 I expanded this page to include
clades from other haplogroups.
More than 99% of Polish R1a are R1a1a1 (M17).
See the Results Table for
breakout.
Actually, the largest category in the
Polish Project is the R1a - U category, for
“Unassigned” samples without sufficient data for confident assignment. The Results Table is based upon the samples
with sufficient data. If you are in
this R1a-U category, you can promote yourself out by purchasing the full 67
marker STR set, since all R1a samples with 67 markers get a detailed assignment.
There are two large categories in the
Polish R1a data. Since 2007, I have
been calling them P type and N type. P type is now known to be more than 98%
equivalent to the haplogroup Ra1a1b1a1a(L260).
N type is more than 95% equivalent to the paragroup
M458+L260-. P type is concentrated in
Poland, rare with increasing distance from Poland. N type seems to be mostly Slavic, widespread in eastern Europe.
Since 2007, I had been calling
another major R1a Polish category K type. Over the years I had subdivided K into
several smaller types and clusters, although I did not have high confidence that all of them in fact belonged to a single
unique clade, as discussed at this web page over the
years. My K group is now known to be a
mix of independent haplogroups, so the Polish Project no longer uses K as a
category, although quite a few small clusters with names such as Kx and Kz are
still predicted, because the confidence in the clusters has always been higher
than the confidence in K. The various K
categories are now clusters, types, and confirmed haplogroups within the two
major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93).
Another large R1a1a clade, the one I
call L type, is very rare in the Polish Project. It is common in Scandinavia, and now known
as R1a1a1b1a3 (Z284).
Thanks go to Lawrence
Mayka, Polish Project administrator, for extensive email information and
assistance.
You can compare data to my types by
clicking this link to instructions for Ysearch.
Reminder: I am concentrating on Poland.
The statistics of STR clusters depend a lot on the database. For example, P type stands out dramatically
in Polish data. In other countries far
from Poland P type is rare. If you
belong to an R1a1 cluster that is rare in Poland, I’m sorry, but I’m not
covering you.
Recent updated graphical
representations of the full R1a tree are available at the R1a
Project and at Russian sites, for example Semargl.
This Introduction is for people
unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites with
a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JOGG
are good general references. The Y Chromosome Wikipedia article
is about male line DNA, also called Y-DNA.
The following several paragraphs are
a brief introduction to genetic genealogy for Y-DNA, providing some definitions
of jargon needed to read my web pages.
The definition words are boldface.
I often use links to those definitions when I use a jargon word for the
first time in a topic. There are more
boldface definitions in the summary of my Methods.
The Y chromosome gets passed from
father to son, so it works just like a male family name. Men are divided into haplogroups based on known rare mutations (most of them
are called single nucleotide polymorphisms SNP) in the Y
chromosome. Division into haplogroups
is done in a manner that has virtually 100% confidence. I say “virtually” because your confidence in
your DNA result from your DNA testing company might be 98% or 99% or
99.9%; the confidence for haplogroups
is better than that. We can be
virtually certain that all the men in a haplogroup descend in direct male lines
from one man, called the “Most Recent Common Ancestor” (MRCA)
for that haplogroup. The MRCA
corresponds to a node, or branching point, in the Y-DNA
tree of male line ancestry. Time of the
Most Recent Common Ancestor (TMRCA) is an estimate of
how long ago he lived - the age of the node.
Lots of people, including me, are
working to discover more SNPs on the Y chromosome so that the haplogroups can
be divided further into smaller haplogroups.
Haplogroups have alphanumeric codes,
like R1a1a. A paragroup
is a haplogroup considered without its known haplogroup branches. An asterisk is often used in paragroup
codes, like R1a1a*. When a new branch
is discovered within a paragroup, it gets removed from the definition; that changes the meaning of that
paragroup. The meaning of a paragroup
varies at different web sites, depending upon which branches are used in the
associated database.
Many people, including me, try to
“stay ahead” of the haplogroups by analyzing other mutations that are not so
rare (called STR) on the Y chromosome. Men submit their Y-DNA data to various web
sites. There are lots of STR data
available on the web. Men are divided
into STR clusters as hypothetical subdivisions of the haplogroups. All such clusters are hypothetical. Some will be validated in the future by new
SNP discoveries. There are various
statistical methods for estimating the confidence of STR clusters. I recently published
a method that I developed. That
publication has references to other methods.
There is a brief summary of my method
below.
A few STR clusters are small family
clusters, with the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA
does not match the DNA of their male line cousins identified by genealogy
research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people
prefer to avoid genetic genealogy. The
male line associated with the Y-chromosome is only one ancestral line. Humans have 24 chromosomes. Anyone who tries to make a family tree going
back 300 years has more than a thousand root tips to be filled by names of
ancestors who lived back then; the one
man at the tip of the male line root is only one of those thousand. That is another reason some genealogists
avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of
many. That said, many people enjoy the
challenging hobby of figuring out to which ancient extended male line they
belong.
Most STR based clusters have an MRCA
who lived thousands of years ago, before family names were common, so most men
assigned to a typical cluster do not have the same family name.
Many SNP based haplogroups have an
MRCA who lived more than ten thousand years ago, so these span multiple ethnic
groups and nationalities. For example,
the R1a
haplogroup is of interest to me. R1a is
most common in Slavic countries but calling R1a Slavic is misleading because it
is found throughout Europe and west Asia.
The MRCA lived so long ago that he may have spoken a language that we
would not consider Slavic if we could hear it.
It is possible that he did not even live in what is now the Slavic
region of Europe; maybe his descendants
moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and
culture, by now some of his descendants long ago moved to other parts of Europe
and Asia. One of the appeals of genetic
genealogy is trying to figure out ethnic descent and migration from the
statistics of haplogroups. Some people
object, pointing out that ethnicity cannot be defined genetically because of
all the moving and mixing of people over the millennia, and because the Y
chromosome is only one of many. True
enough. Some individuals and some web
sites go too far with genetic claims.
That said, statistical analysis of haplogroup data provides many clues
on human origins.
Again, some people try to stay ahead
of haplogroups, using statistical analysis of STR based clusters to gain
insight into more recent human origins.
I am one of those people. My
interest is Polish origins. This web
document, however, is not for the historical analysis and conclusions, except
for occasional comments to remind us of the goal. This document is dedicated to STR data and analysis, identifying
clusters concentrated in Poland, with detailed explanations.
The bottom of my Method section has
more definitions for a number of genetic genealogy
terms.
There are a number of organizations
and commercial companies on the web where you can order a cheek swab kit to
mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company
FTDNA; I mention them because I make
extensive use of their data; check
Google for competitors. At FTDNA, click
on Products for cheek swab kits. DNA
results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your
data into one of the many databases;
for example, most of my analysis is from the data in the FTDNA Polish Project.
I use the FTDNA standard set of 67 STR markers (plus a few
non-standard ones occasionally). I do
some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets. Other companies use standard marker sets
that may not overlap with all the FTDNA markers.
Ysearch is the
largest web database for Y-DNA, run by FTDNA, open to all men, including men
who also register with projects and including men with data from other testing
services. I use Ysearch often for
analysis so of course I encourage you to register your Y-DNA data at
Ysearch. From the FTDNA site, you can
register your data with Ysearch. Or you
can type your Y-STR data into Ysearch.
New topic 17 Oct 2012.
L1029 was a new SNP
last March. L1029 provides a branch of
M458, added to the ISOGG tree this year. The other branch is L260 (update next topic).
L260 was discovered in 2010.
Most M458+ L260- samples are coming out L1029+. I have been calling M458+ L260- samples N
type (very few exceptions - next topic).
It is now clear that L1029 is a major branch, capturing more than 90% of
N type (more than 90% of M458+ L260-).
In the Polish
Project, most of the N type L1029- results are samples with Poland given as
the ancestral country. This spring, Mayka started classifying these as the “Np” cluster.
In this topic I present preliminary
evidence that Np corresponds to a Y-DNA clade concentrated in Poland. I also explain why all Polish N type
samples (tested or predicted M458 and not L260) would benefit from the L1029
test, because Np cannot be predicted precisely, and because there is a low
fraction of L1029- outlier samples, not fitting Np.
So far (10 Oct data) there are 20
results L1029- (including a few samples that are not M458+) and 42 results
L1029+. N type requires 67 or more of
the standard markers for confident
assignment. Using samples with those 67
markers the numbers are 114 N type, of which there are 12 L1029- and 41
L1029+. Of the 61 remaining N type
samples (at 67 in the Polish Project) not tested for L1029, I estimate only
about 5 might come out L1029-, because testing has been concentrated on STR predictions, discussed below in this topic.
One M458+ L260- L1029- sample is not
counted as N type, as discussed in the next topic as Ry type. This seems to be a very small outlier clade with an old node in M458.
Two of the others differ
significantly from the other 12, so I am predicting these two as outliers, with
M458 nodes older than the main Np hypothetical clade.
Np Cluster Definition: I constructed an STR definition
for the remaining 10 samples with similar STR values and L1029- result. The definition uses 37 of the 67
markers. The cutoff
is 2 (step less than 2 are considered matches). I uploaded this definition to Ysearch, code CHFXB.
My analysis file is L1029Study.xls
On this basis, 3 of the untested N
type samples fit the definition and are predicted L1029- members of the
hypothetical Np clade. Two more are
marginal, so perhaps there are 14 Np samples among the 114 N type. N type is 8.8% of the Polish Project, so
that means 14 / 114 * 8.8% = 1.1% Np samples in the Polish Project. The statistical uncertainty is wide, so my
estimated 80% confidence range is 0.5% to 2 %.
Insofar as the Polish Project is representative of Historical Poland, it
seems the Np hypothetical clade has roughly 1% frequency in the region of
Historical Poland. Of the 10 confirmed
Np samples, 8 provide “Poland” as origin, one “Russian Federation” and one
“Lithuania”. The 3 predicted Np samples
have two “Poland” and one “Belarus”.
There is no need to subtract the samples without “Poland” because the Polish
Project as a whole has a similar frequency of samples not “Poland”; such samples come from men with evidence of
male ancestry from Historical Poland.
DYS460 = 10 is a very strong signature marker for Np.
All 13 of the confirmed and predicted Np samples have this value. Those two outlier samples also have this
value. Among those 41 L1029+ samples,
only 6 have this 10 value; 3 have 12
and the 32 others all have the N type modal 11 value. The statistics of this paragraph are
misleading because DYS460=10 was used to encourage L1029 testing in the Polish
Project. I would expect a few Np to
show up in the future with 460 value other than 10 (mutated from the Np
ancestral value), and I would expect in the long run a lower fraction (less
than 6 / 32) L1029+ to have the 10 value (independent mutations). Among the 49 N type samples not confidently
assigned to sub-categories, only 5 have the 10 value, and 1 of these is a
marginal Np sample mentioned above.
CDYa = 33 is another good
signature. These two markers alone with
cutoff 1 (that means both markers match) capture 9 of the 13 Np samples (Np
defined as 13 captured by 37 markers cutoff 2). These two markers also capture 2 marginal samples (at the step 2
cutoff of Np at 37), plus only one other N type, plus a few D
type (D are not members of the M458 clade, but DYS460=10 is modal in
D). CDY is a fast mutator, so it is
unusual to serve as a signature marker.
I ran into this on one other occasion, where I postulated a mutation
disabled CDYb; see my discussion at http://www.gwozdz.org/L540.html#CDYb. Actually, another reasonable explanation is
that this CDYa=33 signature is just luck, because using only 10 samples we
should not be too surprised that one of the rapid mutators looks like a
signature, by the luck of random mutations.
Yet a third explanation: Np
might really be 2 or more clades where the ancestors (MRCAs)
of each clade had the CDYa=33 value by luck, but those ancestors differed at
other markers; this explanation is
discussed more below.
There are no more good Np signature
markers. Np modal values differ from N
modal values at only 4 of the 67 markers.
There are only two Np samples at 111 markers, and they do not seem to
differ from N at those additional 44 markers.
On this basis, I am not confident that my definition is very precise,
because it takes as little as 2 mutations in the male line history for a sample
to be incorrectly predicted, using any STR definition.
There is another reason for my
uncertainty about my 37 marker Np definition:
I worked harder than usual to construct this definition, so there is
selection bias. Markers that just
happen to have no mutations in those 10 samples are all in the definition. Any marker got dropped if it produced 2 or
more mutations in any sample of those 10.
Surely as more samples show up I’ll need to modify my definition. Those 37 markers are only a “good bet”
definition for Np prediction today.
I
published my SBP
method of quantifying confidence in clade predictions
based on Y-DNA STRs. Lower SBP means
higher confidence. I reserve the word type for clusters with SBP < 20%. I consider SBP meaningless for SBP > 50%. Np comes out with SBP = 64%. This does not necessarily mean that Np is
invalid as a clade prediction. My SBP
method gives larger values for SBP with few samples, so valid clades improve
with more data (SBP becomes smaller). A
clade with modal STR values close to the father clade (N is the father of Np)
necessarily comes out with large SBP.
Concentration in Poland is evidence of validity for Np. That 460=10 is also evidence of
validity. In my estimation, Np has about
80% confidence of validity, all evidence considered, but only 50% confidence of
being a unique clade. Np might be primarily
one clade with interference from other independent small clades with similar
STR values. Or, Np might be 2 or more
clades, about the same size, all concentrated in Poland, but distantly related. Clarification: two clades with very close nodes to the father branch might be
considered a single clade; here I mean
that Np might be 2 clades with nodes that are not close in the tree, perhaps
with other small clade nodes between them that do not fit Np STRs (by the luck
of random mutations in the ancestor).
More discussion below on this idea.
In the R1a
Project, my 37 marker definition captures 11 samples with SBP = 95% (data
at 67 markers, download 14 Oct). Eight
of the 11 have L1029- result and the others are not tested yet. Seven of the 11 are of “Poland” origin. Two L1029-
are N type that do not match Np.
There are 38 L1029+ that do not match Np. Summary: L1029- are rarer
in the R1a Project (compared to the Polish Project) and the L1029-
predominantly match Np. SBP is worse
(higher) because of interference at the cutoff by more R1a samples from outside
Poland. This paragraph is not
conclusive, however, because the administrators of both projects work
together; many of the samples come from
men who joined both projects. Both
projects worked hard on getting L1029 results this year, using 460=10 fit as a
guide for emphasis.
As an independent test, I checked (11
Oct) the “RussiaDNA” Project (another FTDNA project). Of 260 R1a total, only 12 have been tested
for L1029, and only 2 of these 12 came out L1029-: one Poland and one Russian Federation. This is preliminary evidence that Np is rare in the Russian
federation, although N is common in all Slavic countries.
More projects checked (14-15 Oct):
Russian_impire: 4 L1029 tests, one negative, not Poland
LituaniaPropria: 4 L1029 tests, two negative, both
“Lithuania” origin, one L1029- also in the Polish Project, both also in the R1a
Project
in addition,
both L1029+ are also in the Polish Project, and one in the R1a Project, so
these are not independent data
Scottishdna: no L1029 tests
Finland: 1 L1029 positive
BritishIsles: 1 L1029 positive
Other projects are not concentrating
on L1029 tests. I hesitate to encourage
them, because M458+ L1029- seem to be mostly from Poland.
I have an R1a database at 67 markers
with 1816 samples from 15 FTDNA projects.
I collected this 20 June, when there were fewer L1029 results. My 37 marker definition captures 13 samples,
but 12 of these are in the Polish Project, and the other is in the R1a
project. No additional samples fit
Np. There are more marginal samples at
the cutoff step 2: 10 of them: only 2 in the Polish Project; only one from Poland. This is my strongest evidence that the Np
cluster is concentrated in Poland.
Ysearch: 9 samples are captured by my Np definition CHFXB. Only 2 are from Poland. Only 2 of the 13 Polish Project Np joined
Ysearch (one Poland and one Lithuania).
SBP is poor for Np at Ysearch because there are 6 samples at the step 2
cutoff, none from Poland. In addition,
2 “Central European” modals fall at step 2 (37 markers used), emphasizing how
hard it is to separate Np. A simple
explanation for these Ysearch results is that there are 1 or more other clades
concentrated outside Poland, which might be L1029- or L1029+.
At the top of this topic, I reported
“more than 90% of N type” (M458+ L260-) are L1029+. Since L1029- are concentrated in Poland, it may actually be more
than 95% worldwide. However, there is a
reasonable possibility of one or more small clades showing up L1029- from
outside Poland when more samples are tested.
Age of Np: It is too soon to estimate the age (TMRCA)
of L1029, and age based on STR variation is uncertain because of known caveats.
However, L1029 is probably not much younger than N type because L1029
includes almost all of N type. N type
is surely older than 2,000 years.
Indeed, variation of L1029 STRs is looking similar to N type
variation. The L1029- node is
necessarily the same or older than the L1029 node, so Np has an old node. However, the age of the node is almost
always older than the age of the clade (TMRCA). Np seems very young, as evidenced by the unique 460=10 value
discussed above. On the other hand,
other markers have significant variation within Np; that may mean Np is not so young; or, that may mean Np is composed of 2 or more clades, each of
which is young.
Speculation: Np reminds me of P type (L260 update, next
topic). In my 2009 publication, and at
this web page, I have speculated that L260 may have a very old node, but the P
type ancestor (MRCA) may have lived more recently, perhaps not long before
formation of the tribes that led to the Polish nation. It seems to me that M458 is quite old, but
not many M458 individuals survived over the millennia, and a few of the M458
survivors were lucky enough to found clades during the population expansion of
the last 3 millennia. Perhaps the Np
ancestor, with L1029- and 460=10, also lived long ago and left few
survivors; most of those few formed
what are today very small clades, and one was (or perhaps 2 or more, all with
460=10, were) lucky enough to found the medium sized cluster today apparent as
Np. I find it interesting to consider
the men who lived 1,000 to 2,000 years ago in the region that is now Poland
(and / or maybe in another region from which there was a migration to Poland). Due to the statistics of Y-DNA inheritance,
most men do not form clades that last long, and very few men form large
clades. Human behavior may perhaps
broaden the statistical spread of clade size, allowing rare men to produce
relatively larger clades. I speculate
that among those proto-Polish men who founded clades that survive today, most
were R1a, and many of those were M458, and one or a few of those were Np, and
one was P.
For background explanation of L260
and M458, see the topic L260 and M458.
Comment 17 Oct 2012: This topic need update rewrite. L1029 has been updated in the previous
topic.
Update 18 Jul 2012:
The new SNP L1029
includes most but not all of N type. A few M458+ samples
have turned up that are neither N type nor P type. One L260+ sample has turned up that is
clearly not P type.
For this update, I collected in June
2012 a database of 3,586 unique R1a samples, from 15
large FTDNA Projects, each with significant R1a data.
At this web page, I have been saying
that L260 seems almost equivalent to P type, and that M458 seems almost equivalent
to N type plus P type. I say “almost”
because there have always been borderline
samples. Recent data continues to
confirm this general summary, as discussed in the following paragraphs:
Ry type: There is a family set (five samples with the
same family name, very close STR match to each other) where
one of them recently (Mar 2012) tested M458+ L260- L1029-. These five are clearly not N or P. Not even close in STRs. These 5 samples are now categorized in the Polish Project as “Ry type”. These were independently noticed by Lapinski, the administrator of the R1a
Project, with a new category for these in that project, also.
L260+: In that 3,586 database, there are 79 with an L260+ result. 53 of these give Poland as “Origin” of
ancestor. There are 6 from Germany, 4
each from Czech Republic and Russian Federation, 3 each from Hungary, Slovakia,
Ukraine and Unknown. L260 is clearly
concentrated in Poland. Many men
(samples) join multiple projects; 67 of
the 79 joined the Polish Project and 12 did not. 71 of the L260+ samples have the 67
marker standard set, 61 of them are in the Polish Project.
The following analysis uses all 1,816
samples that have the 67 marker standard set.
There is one sample recently tested
L260+ that does not match P type. The P
type cutoff is step 7 using the P43 definition (samples are predicted P type if mutation
count is less than 7, from the P type modal haplotype,
using 43 of the 67 standard STR markers).
That one recent outlier is step 11, not even
close to P. There are six other L260+
outliers: one at step 9, two at 8 and
three at 7.
By comparison, those Ry samples vary
from step 11 to 14 from P type. The N46
definition has cutoff 8. That P type
step 11 outlier is at step 16 from N type, and the Ry samples vary steps 10 to
13.
For now, we are categorizing those P
type outliers as P type for convenience, although I suppose these samples are
evidence (not proof) that the L260 SNP is somewhat older than P type (the
hypothetical clade with strongly correlated STR values - see L260
and M458 for clarification of these words). I expect future updates of this web page will introduce new
categories for some of these outliers.
P type has no foreign outliers. No samples predicted P type (step <7)
have so far come out L260-.
There are 14 samples at the cutoff
step 7, compared to 152 P type samples <7, for SBP =
13.6%. Of those 14, 5 have been L260
tested: 3 positive and 2 negative.
That’s for the large 1,816
database. The Polish Project, part of
this large database at 67, includes 114 of those 152 P type, and 8 of those 14
at step 7, for SBP = 11.7%. The lower
SBP is a demonstration that P type is concentrated in Poland. That 11.7% is an upper limit estimate of
borderline P type samples at 67 markers, but P type is actually much more
isolated than that in the Polish Project:
Five of those 8 at step 7 are confidently assigned to other types and
haplogroups. Of the remaining three,
two have tested L260+ and the third remains classified as P Borderline (L260
test needed), along with a couple samples at step 6.
This is all evidence (not proof) that
P type is likely younger than L260, and that there are probably at most only
very few small branches (twigs) older than P type within the L260 haplogroup
(see L260 and M458 for clarification).
The new L1029 haplogroup: Only N type samples are coming out L1029+,
but many N type are coming out L1029-.
L1029 is clearly splitting N type, including more than half but less
than 90% of N type. Watch this topic
for an update of the percent as more data comes in. These are categorized as “Np type” at the Polish Project; there are 14 of them so far (13 July
update): 7 with M458+L1029- test
results, 5 with only an L1029- result and an STR match to N type, and 2 with
neither SNP test that match the other Np samples closely in STR values. I need to add a new topic for Np to this web
page.
In that 3,586 database there are only
14 samples tested L1029- (late June):
one was that Ry sample discussed above and one was P type. The other 12 match N type: 9
were Np samples in the Polish Project (in June) and 3 were “L1029- N
type” in the R1a project; 8 of the 12 are confirmed M458+.
Np seems to be concentrated in
Poland, but it is too soon to be sure, because the Polish Project administrator has been active encouraging testing of
these. In those 12 Np samples in that
3,586 database : 8 give Poland as
origin, 2 Russia, 1 Lithuania, 1 Germany.
N type has relatively more samples at
or just beyond the cutoff, plus a few foreign outliers. Next, I need to write, here in this topic, a
similar summary for N type.
I last updated the P type definition
16 Aug 2011. I last updated the N type
definition 13 Sep 2011. Previous
updates used only the Polish Project data.
Next, I plan to update both definitions using the larger June
database. Because of selection bias,
definitions improve as more data accumulates.
Watch this topic for updates.
For full definitions see Haplotypes.xls.
See also L260
and M458 Signatures.
Polish
Project R1a Assignment News
This topic was updated 5 Jul 2010.
If you are R1a
but not a Polish Project member, the Ysearch instructions topic has my method for matching to my types. The news in this topic applies to you if you
know your assignment.
If you are P type
or N type you would likely come out positive in the SNP test for M458 (M458+). If you are P type you are likely L260+. N type is
likely L260-. If you have not already
tested you can pay the small fee to test for these SNP tests to confirm that
you belong to the corresponding haplogroup.
If you are assigned to P borderline
or to N borderline you would benefit more from the M458 and L260 tests, because
that would provide for you a definite assignment within R1a.
The assignment rules are done with
high probability, so if you are unassigned (category U) there is a low probability
that you would test positive for M458, with probability that decreases with
your step (genetic mutation distance) from P or N.
If you have less than the standard 67 STR markers it
is generally better to purchase the remaining markers. That way, you are more likely to get an
assignment, because the statistics for STRs improves with more markers. Nevertheless, if you are not many steps from
P or N you might consider doing the M458 test even with fewer than 67 markers.
There is a slight chance that you might test
positive for L260 or M458 even if you do not match P or N. The haplogroup corresponding to M458 is old enough that there may be small clades with STR markers very different than P or N. I have not seen one yet, but there is no way
to estimate this probability. I
hesitate to recommend the M458 SNP test for men whose samples are distant from
both P and N in STR values. I admit you
can just wait to see if anyone with STR values similar to yours matches an SNP,
then test for that SNP. However, we all
benefit when some men test for all the new SNPs within an established
haplogroup, because that way we find out the size and rough age of the
corresponding new haplogroup branches. FTDNA offers “deep clade” test packages to test for all possible
haplogroup branches, but my understanding is that L260 and M458 are not yet
included in the R1a deep clade test.
You need to purchase them separately from the advanced markers menu. No doubt FTDNA will add them soon to the
deep clade package.
The Fall 2009 issue of the Journal of Genetic Genealogy has my
publication is split into two parts:
Part I is my “mountains in
haplospace” method for evidence that certain “types” of STR clusters correspond
to clades.
Part II is the application
of that method to Common Polish Clades.
That article has a lot more detail than this web page, but that article
was last updated in September 2009, so this web page is an update.
PolishCladesUpdate is my
folder for future updates of the Excel analysis files for those two articles.
This web page will continue as an
introduction and summary, without as much jargon and detail as the articles and
update folder.
The Fall 2010 issue has my
publication announcing the L260 SNP.
R1a Worldwide
Wikipedia has a nice R1a entry with primary
contribution by Andrew Lancaster.
11 Jan 2011 update: There is a lot of activity these days in the
discovery of new SNPs for dividing R1a into branch haplogroups.
You can follow the activity at the R page of the ISOGG
Y-DNA tree, and also at the FTDNA Draft tree.
The new SNP named L365 includes what
I have been calling G type, based on preliminary
data. It is too early to say if other
samples in addition to G type are positive for this new SNP.
The new SNP named M417 excludes what
I have been calling C type, based on preliminary
data. So far very few R1a samples are
negative for this new SNP, but it is too early to estimate the rarity of M417-.
In early 2011 FTDNA
released some new SNPs for commercial testing, including the following for
R1a: L365, M417, L366, L291, and
others. To order new SNP tests, go to
your home page at FTDNA, on the left under “My Account” click on “Order Tests
& Upgrades”, then click on “Go To Advanced Orders” and check “SNP”. Use your browser search to find the SNP of
interest. If you wish to publish your
results, join one of the projects (click on “Projects”) and the administrator
with analyze your data.
L260 and M458
are discussed below.
There are other new experimental SNPs
discussed on the web. I’m not trying to
list everything here, just the ones that are of interest for discriminating new
R1a haplogroup branches.
29 Feb 2012 comments: See the next topic, L260 and M458, for an
update. I’ll not update this
“Confusion” topic because the confusion issue is explained by example in this
2010 discussion. The confusion applies
to all haplogroups, not just R1a.
25 Oct 2010 update: The new SNPs cause
confusion in the alphanumeric notation for the haplogroups
and paragroups.
In my fall
2009 publication I used the notation that was well known at the time, where
more than 95% of R1a was known to be paragroup R1a1. The R1a1 samples with one of four very rare SNPs that have been
known for a few years were called haplogroups R1a1a through R1a1d. Ysearch still (25 Oct) uses the notation
described in this paragraph. FTDNA Projects still use this notation for automatic
assignment of samples. Individual
samples are not actually assigned to a paragroup because most have not been
tested for all SNPs. Most R1a samples
are listed as R1a1. Many samples are
listed as just R1a but almost all of those would come out R1a1 if tested for
the appropriate SNP (the well known M17 or M198, or one of the new ones that
all seem to be equivalent). I mentioned
in my publication that all Polish Project R1a were coming out R1a1. Since then only one sample (out of 1441 R1a
total in the Polish Project) has come out M198-.
New SNPs were discovered equivalent
to SRY10831.2, the original R1a SNP.
Subsequently, rare samples were found positive for some of these new
SNPs but negative for SRY10831.2. I’ll
use L62 to represent these; there are
others that seem to be equivalent.
Those define two small paragroups, R1a(L62, SRY10831.2-) and
R1a1(SRY10831.2, M198-). That previous
R1a1 paragroup becomes R1a1a(M198).
Accordingly, when Underhill announced the M458
SNP, he called that haplogroup R1a1a7.
L260 was called R1a1a7b when first discovered. Last spring I rewrote this entire web page using the notation
described in this paragraph.
The recent new SNPs change the
notation again. I shall not attempt to
rewrite this entire web page. As I
update topics, I’ll use the current notation.
For clarity, I’ll add the defining SNP in parenthesis when I do updates.
For example, what I have been calling
P type is equivalent to the haplogroup now called
R1a1a1g2(L260). What I have been
calling N type is equivalent to the paragroup
R1a1a1g(M458, L260-).
The choice of which SNP to put in
parenthesis is arbitrary for haplogroup notation. For example, R1a1a1(M17), R1a1a1(M198), and a few others, all
seem to be equivalent. But any day now
someone might announce a few samples that test negative for one of those SNPs
and positive for all the others, which would define a new paragroup and force
the renaming of all branches beyond that new node in the tree.
There is ambiguity in assignment of
samples. For example, a sample that
tests negative for M198 might be called R1a(M198-), but it is not clear if this
sample belongs to the paragroup R1a(L62) or to the paragroup R1a1(SRY10831.2)
if it has not been tested for the latter.
My types have an uncertainty similar
to SNPs. For example, I said N type is
equivalent to R1a1a1g(M458, L260-).
Recently two samples showed up in the Polish Project that are M458,
L260- but just beyond N type as defined by STR fit. We can think of these two as a new “paratype”, although I’ll not
use that word. We classify these two in
the Polish Project as “M458+R”, the Remainder in M458 excluding N type and P
type. Actually, as I discuss in the N type topic, it is not statistically certain where to place
the cutoff for N type, so you could argue that the M458+R category has more
than two samples in the Polish Project.
See the L260
and M458 News topic for updates and details.
This topic updated 18 Jul 2012.
L260 is an SNP
that I published in the Fall 2010 issue of JOGG. It
has been available as an SNP test since early April 2010 at FTDNA.
M458 was published by Underhill. It has
been available as an SNP test since early November 2009 at FTDNA. L260 is a branch of M458.
A new SNP, L1029, has been available
as an SNP test since March 2012, also a branch of M458.
Because of the confusion of recent SNP
discoveries ISOGG now uses the haplogroup
nomenclature R1a1a1g1 for M458 but FTDNA still uses R1a1a1g. Similarly, L260 is R1a1a1g1b at ISOGG but
R1a1a1g2 at FTDNA. ISOGG has the new
L1029 as R1a1a1c. The FTDNA draft tree has some more recent SNP
discoveries listed in the R1a branches.
The R1a Project home page has a nice recent
diagram of the proven R1a SNPs.
Both P type and N type are code names published
by me before these SNPs were discovered.
At this web page, I have been saying that L260 seems almost equivalent
to P type, and that M458 seems almost equivalent to N type plus P type. I say “almost” because there have always
been borderline samples. In Feb 2012, a few samples turned up that
are exceptions: M458+ samples that fit
neither P nor N. Not even close with STR values.
Reminder: There is a logical distinction between an SNP haplogroup and an
STR type: I use the
word “type” for clusters of samples where I have 80% or higher confidence that the type corresponds to a unique clade. I use the word
“borderline” for samples that seem to have 50% to 80% confidence of belonging
to that clade. P type and N type are
very well isolated in haplospace, with relatively few
borderline samples. P type is
particularly well isolated. The age of
a type is the MRCA (ancestor at the hypothetical node for the type). A
new SNP may be younger than a type, capturing only part of the type. A new SNP may be older than a type,
capturing all of the type plus additional samples that are not predicted into
the type. L1029 is a branch of N type,
clearly younger than the type.
Even if a new SNP captures all the
samples of a type, future samples may show up that are positive for the SNP but
do not fit the type, not even Borderline.
These might be members of older branches (branches with nodes older than
the type), or they might be statistical outliers
(members of branches within the type, where these particular samples have
significantly more mutations than statistically expected due to luck). Also, there may be some samples that fit the
type with STR values but do not test positive for that new SNP. These might be members of the oldest
branches of the type, older than the new SNP, or they might be outliers from
other clades with distant nodes.
Such exceptions are showing up. Rare for N type. Even rarer for P type.
MRCAs for N and P must have lived
long after the node for these two branches in the Y-DNA tree, because the STR definitions for N and P are very different (compared to
definitions of other haplogroups, with yet older nodes). I say “must have” because this is a statistical
conclusion; it is possible but very
unlikely that N and P have a node not much older than the two MRCAs, and that
those two men had unusually divergent STR values due to the random luck of
mutations.
Reminder: My definitions use selected STR markers for optimum statistical prediction of which samples
belong to a type. Because of selection
bias, the definitions change slightly (one or more markers added or subtracted)
as more data accumulates. See L260 and M458 News for status.
For full definitions see Haplotypes.xls.
See also L260
and M458 Signatures.
Update 29 Feb 2012; see the previous topic, L260 and M458, for
R1a update.
See also R1a
Confusion, two topics above.
I’m leaving my 2010 version here for
history reference:
22 June 2010 update:
Almost all of R1a divides into R1a1a1* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260). These correspond to my original predicted division.
R1a also has several known rare groups: R1a*, R1a1*, R1a1aN, where N = 1 to 6 and
8. There is also a very rare R1a1a7a. That asterisk is used for paragroups; R1a1a*, means haplogroup R1a1a without any
of those 8 known branches.
The rare R1a groups are not in my R1a Table. It’s a shame the corresponding STRs are
generally not published in SNP announcements.
I don’t know if the rare groups all together add up to 0.1% or 1% of
R1a. Surely they are less than 3%. My percentage calculations in my R1a Table
do not need adjustment because any Ysearch samples that might belong to these rare clades would
probably have unusual STR values, not falling into one of my types, but still
be counted in the totals. In my R1a Table, rare samples are included in row
R. That row R might have a few percent
from these rare groups, but I don’t know exactly how many.
Underhill mentions 7 samples
(men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.
Lawrence Mayka,
the administrator of the Polish Project, had been assuring me by email that all
the Polish Project member tests within R1a had been coming out negative for all
the rare SNP subgroups. So if you are a
Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the
men from Poland. About half of these -
about 1/4 of men from Poland - are R1a1a7.
These two “about” estimates are approximate; my data on these SNPs are not
random samples, so my population estimates are derived from the types in my
table, which are STR based.
On 17 June Mayka informed me of the
first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish
Project. My table, does not show this
single exception because the table is for samples with 67 markers, which that
one exception does not have. On 19 June
Mayka informed me of evidence that C type might define a
new rare subdivision of R1a slightly older than R1a1a; if this turns out correct it will be less
than 1% of R1a.
An article was published online, 4
Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
Abstract STR
Data See www.gwozdz.org/R1a.html for more
discussion
I call this article “Underhill” for
short, because his is the lead name in the list of 34 authors for this major
work.
This web page about Polish Clades was
completely rewritten using this new information. Recent L260 and M458 test results
are consistent with (albeit not full proof of) my previous R1a subdivision into
“types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by
this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a*
(M458-). See R1a Subdivision for a brief summary of other
groups, and for a clarification of what R1a1a* means.
R1a1a7 is the new M458
haplogroup. R1a1a7 includes what I have
been calling P type and N type here on this web page, even before M458 was
available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other types, particularly
K type.
This Underhill article has data for
158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for
R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data
is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458
Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a
few percent more R1a1a7 than R1a1a*, although the latter is more common
worldwide.
Format
Up to here, I have tried to write
this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity
with genetic genealogy jargon has understood.
If you read this top to bottom, it gets progressively more detailed,
with more and more jargon. I’m sorry
about that, but the audience is also readers with genetic genealogy experience
who want to know how I came to my conclusions.
If you cannot follow some of this, it is written in a manner that you
can jump around and pick out what you do understand, then come back after you
have read more about genetic genealogy.
If you open this html document with
Word, all the link targets (bookmarks) can be viewed alphabetically or by
location.
This topic was updated 29 Dec 2010.
Lawrence Mayka
is the administrator of the Polish Project.
Click on the Polish Project web link to see
how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA
and for Y-DNA. This web document of mine is restricted to Y-DNA, with emphasis on R1a. I help
Larry with assignments to types and clusters.
Haplogroups are defined by SNP mutations. STR mutations are easier to test, so many samples have STR data
without SNP data. Predicted assignments are based on STR
correlations when SNP data is not sufficient.
I mentioned above that FTDNA automatic haplogroup
predictions (red text means STR predicted vs green text SNP measured) have
about 99% probability. We use minimum 80% estimated probability for each individual sample in
the Polish Project that gets an extended assignment - a subdivision of its
FTDNA assignment. At 80%, many more
assignments are possible. Most extended
assignments are better than 80% probability.
Many are better than 95%.
Many samples do not have extended
assignments, but they still have their FTDNA green measured haplogroup (100%
probability) or their FTDNA red predicted haplogroup (99% probability). These bring up the average for the Polish
Project as a whole.
We are confident that the average is
better than 95%, which is to say that more than 95% of the Polish Project
samples would test positive for the SNP corresponding to their assigned
haplogroup. Excluding R1a the average
is likely more than 97%.
Example: E1b1b2a2 (V13) is an example of a haplogroup category with some
extended assignments: Larry has me in
this category, which is 100% probable because I tested positive for the V13 SNP
along with 14 other men in the Polish Project (data in this example is from 25
May 2010). However, Larry’s listing
includes 48 men in this category, based on his analysis of STR correlations:
15 green E1b1b1a2. These are of course certain.
28 red E1b1b1 because FTDNA does not
predict beyond that, but these would likely be E1b1b1a2 if tested, because they
have STR values close to those samples that have tested V13+, and unlike the
samples that have tested positive for other branches of E1b1b1. Each has at least 80% probability, and many
are even more probably correct.
2 green E1b1b1 tested for that
previous SNP but not for the current V13, but matching in STR values.
3 green E1b1b1a tested for that
previous SNP but not for the current V13, but matching in STR values.
Note that other E1b1b1 men, both
green and red, fall into other categories at the Polish Project, because they
do not match V13+ samples closely in STR values.
End of E1b1b2a2 example.
R1a is unique because almost half the
Polish Project samples are placed together by FTDNA into R1a1 (M198), which is elsewhere called R1a1a (M198). Many of our R1a assignments are to types,
which are hypothetical, without known SNP definitions. The minimum 80%
estimated probability still applies to each sample and again most are much
better than 80%. For type definitions
we are confident that the average is about 90%, which is to say that about 90%
of the Polish Project R1a samples assigned to a type would test positive
someday for an SNP, unique to that type, not yet
discovered.
“Cluster” and “Borderline” and
“Unassigned” category probabilities are discussed below.
I have been active helping Larry with
R1a assignments to types since late 2007.
Explanation
of the Results Table
Update 15 Feb 2012. The Results
Table is based on data from the Polish Project. The data was downloaded on 7 Feb 2012, at
which time there were 1903 Y-DNA samples (men). 1071 of these have data from 67 or more STR markers. Data was edited for family sets, 57 samples,
as explained in my publication. Net 1846 samples.
Polish Project Assignments are taken
as representative of Poland, with caveats explained in my publication.
I did the editing and tabulation in
an Excel file, which is available:
ResultsTable.xls
For haplogroups
I1, R1a, and R1b, assignments to clusters and types are made using 67 markers. Samples with fewer markers cannot be assigned with
confidence. For this reason, the Results
Table uses only the data at 67 markers for these three haplogroups. The totals are indicated in the Results
Table. That Excel file has
analysis sheets for each of these three.
As indicated in those sheets, some Unassigned samples with 67 markers
but insufficient SNP data were assigned to SNP haplogroups
based on my estimates of how many of the Unassigned samples would fall into
various haplogroups if SNP tested.
Column Haplogroup has the
conventional ISOGG haplogroup codes.
Column Haplogroup or Type has
labels determined by Mayka. Most of these are branch haplogroup (or paragroup)
codes, with the defining SNP in parenthesis.
Some of these are types as defined by me. A few of these are clusters. A few of these categories are for borderline samples, or for unassigned samples as
explained in the corresponding sections of this web page.
Column Short Code is my own
code for use in this web page. Some of
these have links for jump to a description of that particular clade. Some have a Ysearch
link in the far right column. Most do
not have links because I have not found the time to work on them; my priority is clades that seem to be
concentrated in Poland.
The Num and % columns
are the number of samples for each category, and percent of the total. The number of samples mentioned in those detailed
descriptions (below) may not correspond to the numbers in the table because the
particular description updates may have been done at a different times than the
table update. The description section
has descriptions of some experimental subtypes that are not listed in the
Results Table.
ISOGG names change often due to new
SNP discoveries. See R1a Confusion for examples.
Those types and subtypes and clusters
are my own code letters, for brevity.
Please do not confuse these code letters with official haplogroups. I have been using such code letters for R1a
assignments in the Polish Project since 2007.
Because half of Polish samples are R1a, I do not use “R” for R1a
codes; all other short codes start with
the haplogroup letter.
My Update
Folder has an Excel analysis file for most types, plus many more
files.
The Ysearch
links provide the modal haplotype definitions, using a selected subset of the standard
FTDNA set of 67 markers. I entered
these data into Ysearch for our convenience.
All my modal haplotype definitions are available in the Excel
file Haplotypes.xls,
which also has experimental types not mentioned here. Below are Ysearch instructions
for quickly comparing your haplotype to many
of my types at once.
Assignment to types is with at least 80% estimated probability.
Column % provides a good
estimate of the frequencies in Historical Poland, insofar as the Polish Project
is representative of Historical Poland, as discussed in my publication.
With just under 2000 samples, each
sample represents just over 0.05%. The
Results Table rounds to nearest 0.1%, so one sample or two samples both get
rounded to 0.2%. The statistical
uncertainty is very high for clades with few samples. Most worldwide haplogroups are not present in the Polish Project,
but it is statistically very likely that many haplogroups present in the Polish
population at 0.5% are not represented in the Polish Project just by the luck
of sampling statistics. At 67 markers, with just over 1000 samples, each sample
represents just under 0.1%.
Updated 20 Oct 1012. “Unassigned”,
or short code name “U”, is not a cluster, but a holding category for samples
with insufficient data, so assignment cannot be made with 80%
probability. U is a subcategory for
multiple main branch haplogroups, for samples that obviously belong to that
main branch but do not have data that if obtained should help assign them to a sub
branch.
Many samples with incomplete data are
assigned with 80% probability, but this is not generally possible. For example, some categories have rare STR signatures, so fewer STRs may be
sufficient for some samples. If a
sample has close STR matches (often obvious relatives with the same ancestor
named in the data) all such samples are assigned to the same category. In other words, your assignment to a
category other than U means the Polish Project judges no further data to be
required at this time. However, confidence
may often be increased beyond 80% probability by purchasing more STR markers
and / or recommended SNP tests.
Many R1b samples, even with 67 or
more markers, are in two different R1b - U categories, because R1b has had many
SNPs for confident assignments for more than two years.
All R1a samples with all 67 standard STR markers are assigned to the
best fit category, not to “Unassigned”.
In the future, when identifying SNPs become more than two years old,
some 67 marker R1a samples may be moved to U.
A standard
111 STR set has been available for some time, and many samples in the
Polish Project have all 111. In the
future, but not yet, samples at 67 may be placed in the U category if more STR
data is needed for 80% probability assignment.
Description
of the R1a Categories
There are separate topics below for descriptions
of selected categories in Haplogroups I, N, and R1b.
Comment added 20 Oct 2012: This is a long topic with many short
subsections, each for a category. Many
of these subsections are out of date and need to be rewritten. The subsections without a date on the first line
may be a few years old.
This large topic has descriptions for
many of the Y-DNA categories at the Polish Project.
Some of these are haplogroups, some are types, some are clusters. Types and clusters are high confidence hypothetical haplogroups. Borderline categories are lower
confidence. There is also the Unassigned category for uncertain samples.
Click the Ysearch web links in the Results Table for modal haplotypes, which
are my best fits of web data to groups of men with similar STR data.
Please don’t get confused. The following capital letter names are my
codes for R1a categories.
Capital letters are also used for the large official haplogroups, but that’s different.
Some of the following categories are
discussed in my November 2009 publication, and
may have archive copies of my 2009 Excel analysis files stored in the Supplementary
folder. Many of the following types
have my update Excel analysis at PolishCladesUpdate.
A.
Ashkenazi. Update 8 Mar 2012.
A type is hypothetical clade of L342,
which is a branch of Z93. A type does
not correspond to a haplogroup yet, because there are L342+ samples that do not
match the A type definition.
This type is discussed in my publication, Part II. The definition, using 67 markers, has been available since 2008
at Ysearch, as FCUFG.
I have consistently expressed more
than 98% confidence that A is a valid clade, not just
because of my work, but because the modal haplotype closely matches the various
versions of the most common Ashkenazi haplotype, which has been widely studied
and reported on the web. It should be
emphasized that not all Ashkenazi match this type, and some men in this type
may not be descended from Ashkenazi.
This type is not restricted to Poland.
Levy-Coffman wrote an
article about Ashkenazi genetic genealogy;
I noticed discussion in a recent Science
article. I expect an SNP
to show up someday equivalent to what I have been calling A type.
Between 2008 and 2011 I predicted
that A type was a subtype of K type, but I never had more than 80% confidence
in that prediction, which is now seen to be wrong, because K type is in Z283,
a brother SNP to Z93. See the R1a Project
for a recent SNP tree. The match of A
type to K type at the first standard set of 12
markers is now seen to be a coincidence.
Older publications call that 12 marker haplotype, very common in Eastern
Europe, the “Ashkenazi” haplotype, but we not know that only a small fraction
of men who match at 12 markers are Ashkenazim.
B.
Update 8 Mar 2012. A
hypothetical subtype of K type, identified by Mayka. Concentrated in Poland. I have more than 90% confidence
that B type represents a clade that will be verified some day with a new SNP
discovery. My confidence is only about
80% that it is a subtype of K; the node
for B type in the R1a tree might be slightly younger or slightly older than the
K definition node. Individual
assignments to B type have 80% or higher confidence, depending on how closely
each fits.
C.
Update 10 Mar 2012. This type
code name was dropped from the Polish Project in early 2011. The two C type samples are both now listed
as R1a1a (M198+,M417-), and they are the only samples in this paragroup, so
that is a better label. These are the
only two R samples in the Polish Project with the signature (385a,455) =
(13,10). C was added to Polish Project
in Dec 2009 by Mayka, who pointed out that Didier Vernade
originally pointed out the unusual DYS392=13 value in 2007. DYS392=11 is almost universal in R1a. C type is very small. There are only 2 Polish Project samples in C
type, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4
different samples with 67 markers. I
calculated SBP = 7% using only 37 markers with Ysearch data (in early
2010). None on Ysearch are identified
as “Poland”. C type differs very much
in STR values from the rest of R1a1.
That is because C type has a very old node in the R1a tree.
D.
Update 12 Nov 2011: Based
on 1 Nov 2011 Polish Project data. Analysis file: DType.xls. 59 marker definition,
cutoff = 9, no samples in the gap at 9
to 11; SBP =
5.3%.
Concentrated in Poland: Ysearch K49NZ; 34% have origin Poland.
This type was added to the Polish Project
in Jan 2010. The cluster was brought to
my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some
time ago, based on DYS462=12.
Signature
(460,481,462,560) = (10,<22,12,18).
Any one of these four markers by itself can distinguish D type with high
probability from other R1a1a1i (Z280) samples, but those values can be found
individually as independent mutations in other R1a clades. D type cannot be distinguished using the 25 FTDNA standard markers. At 37 markers, only 460 is available.
At 67 markers, 481<22 is an effective signature: 16 total D type: 13 D have 481=21, and only one other R1a sample has the 21 value.
2 D have <21, with no other R1a samples.
One D has the 22 value along with several other R1a. 481=25 is modal for R1a.
DYS462 is a standard STR marker at Sorenson, and has been
available for years at Ysearch; 462 is now available at FTDNA with the 111
marker set. In Nov 2011 I noticed that
DYS560=18 is another marker for D type from the 111 set, but that is not
available at Ysearch (Nov 2011).
That DType.xls analysis file provides
SBP = 5.3%, although I did manual editing of the definition to improve SBP,
providing some selection bias. On the
other hand, isolation of D type is even better than indicated by SBP for two
reasons: Samples just beyond D type, steps 12 and 13, all have solid assignments to other
types. Most of the D samples have
462=12 and a few have 560=18, and those samples beyond step 11 with data have
other values at those 2 markers, so a future definition using all 111 markers
should provide even better (lower) SBP.
Only 3 D type have 111 markers;
most of the DYS462 data was obtained some time ago by purchasing that
marker separately.
D type seems to be Z280+ Z92-, based
on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file). Z92 is a new SNP, so not
much data is available; confirmation
should be available soon. D is a
subtype of what I had been calling K type; I’m now using K as a code for the paragroup defined by Z280*.
D type is clearly a Polish type: In the Polish Project 10 of the 16 D type at
67 markers indicate “Poland” ancestry;
the exceptions are 2 “Unknown” (one with an obvious Polish name and one
with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.
On Ysearch, there are 32 samples
below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which
is quite high for Ysearch. SBP is 15%
on Ysearch, implying there are clades near the cutoff that
are rare in Poland; indeed none of the
5 samples in the gap at steps 9 and 10 indicate Poland. For details see the “Ysearch” sheet in
DType.xls.
Age (ASD sheet cell N12) comes out
1,385 years using all 67 markers. Old
human Y-DNA clades have age older than the raw ASD calculation because of
population bottlenecks and because of other statistical adjustments. However, D type is not very old, so this
correction may not be needed. On the
far right of that ASD sheet I sorted markers by age, and I added notes about
problem values, and suggested four markers that should be masked out, but the
age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216
years. I see evidence of subclades, so
D type might be composed of younger subclades that might be identified with
more data.
I noted three markers (on the far
right of the ASD sheet) that I consider hints for subclades. Last year in this topic I mentioned Da, with
the signature (458,576,444)= (16,20,14) and that still looks promising, but not
convincing. One of the three D samples
with 111 markers fits Da, and provides a hint that markers 463 and 715 from the
111 extension might help to resolve Da, so it will be interesting to see what
happens as more D men order the 111 extension.
E.
Update 8 Mar 2012. V. Rudich
entered a modal for this cluster into Ysearch as ID MW7DP,
named “North Eurasian”. Mayka modified
it slightly for the modal used here by me, GNYBG,
named “Belarus”. 67 markers. It’s an excellent type; on 25 May 2010 it had 16 samples at 67
markers in the Polish Project, with SBP = 14%.
In late 2011 E type samples tested positive for the new Z92 SNP, corresponding to the R1a1a1g2d
haplogroup (ISOGG early 2012). However,
not all Z92 + samples fall into types.
FH Clade. F and H types were suggested by Mayka. They have the signature
(439,511,452 = 11,11,28). They differ
from each other, so I could not make a combined FH type. I can make a reasonable FH cluster,
but it is not necessary, since the FH clade can be better defined as the
combination of the three types Fa, Fb, and H.
The original F type (introduced Jun 2010) was split into Fa and Fb in
Dec 2010. DYS452 is not one of the FTDNA standard markers, so not many Polish
Project members have this marker evaluated.
Mayka and I helped most of the Polish Project members in FH, and members
just beyond FH, to get 452 evaluated.
Samples beyond FH have 452=30.
My analysis files do not use 452 for
determination of SBP.
452 would not significantly lower SBP because most of the background near the cutoff for
each type are samples from the other two.
In other words, Fa, Fb, and H are very well isolated from the rest of
R1a, but not so well isolated from each other.
These three FH types do not seem to be specifically concentrated in
Poland (per Ysearch) although they are concentrated in Slavic countries
including Poland. All three types seem
quite young, without relatively low STR variance (see the ASD sheets in the
analysis files).
FH Borderline. The borderline samples from Fa, Fb, and H are combined into
a single FH Borderline category in the Polish Project, because these clearly
belong to the FH clade but have less than 80% probability of belonging to any
one of the 3 types.
Fa.
Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%.
See FH clade, above.
Fb.
Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%.
These samples were the original F type, before Fa was split off. See FH clade, above.
H.
Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.
G. This
type was suggested to me by Mayka, who calls it the
Pomeranian cluster. Pomerania is the name of the
region on the south shore of the Baltic Sea including regions of both Germany
and Poland. Marcin
Wozniak found the G modal haplotype (at 12 markers) to be very common among
Kashubians. Kashubians consider
themselves an ethnic group or nationality within Poland. It will be interesting to determine if
Kashubians in Poland have a higher % concentration of G type than German
Pomeranians. Meanwhile, “Pomeranian” is
a convenient neutral name, suggests Mayka.
G type is mentioned only briefly in
my publication because not much data was
available to me at that time. My GType.xls update
analysis file with June 2010 data has excellent results: There are 12 samples in a nice type with SBP
= 11.2%. There is preliminary evidence
of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in
Ysearch; see Haplotypes.xls
for a list including hypothetical working modals.
11 Jan 2011 news: Mayka informs me that one of the new SNPs,
L365, is positive for all of 5 G type samples that were tested so far. A few samples from other types all tested
negative for L365. It seems like G type
is included in the new haplogroup defined by L365. One of those 5 is in that tentative Ga subtype.
Of course, this is very
preliminary. It is possible, if
unlikely, that some of the G type samples still might turn out negative for
L365. It is quite possible other
samples not matching G type might be found L365 positive. I’ll provide updates here.
Those 5 samples are positive for
M417, negative for M458, and negative for a few other new SNPs.
L365 is one of a few new SNPs that
look like they will receive the notation R1a1a1x, where x = i, j, k, etc.
This type should not be confused with another G type in the N haplogroup.
14 May 2011 comment:
Sorry I have not taken the time to update this G type topic. Recent data continues to verify that G type
seems the same as the haplogroup divided by L365, now called R1a1a1i.
I.
Minor edits 5 Aug 2011. Complete
rewrite 4 Aug 2011. Based on 2 Aug 2011
Polish Project data. Three analysis
files: IType.xls; IaType.xls; IbType.xls.
I type is discussed in my publication, Part II, page 178.
On Ysearch, I type is concentrated in
Poland and in other Eastern European countries.
On 28 Jun 2011 Lukasz Lapinski suggested two small clusters based on recent I
Borderline samples. These are currently
called Ia and Ib types in
the Polish Project. Ia and Ib are
probably not really subtypes of I, as discussed in the following paragraphs.
I type seems to have structure. Some of the 67 STR markers are bimodal, which hints at subtypes. The bimodal markers are not correlated with each other, so I have
not been able to identify subtypes with confidence.
My published 2009 definition for I type, I59, uses 59 of the 67 STR
markers, cutoff 8. That definition
still works quite well, with SBP 17.8% (Aug 2011). I consider SBP <20% sufficient to use the term type. I found a better definition, I62, cutoff
9, SBP 12.3%. The two definitions
are compared in the file IType.xls.
That 2009 definition had 22.4% SBP in 2009, so it did not quite qualify
as a type back then. (Background means foreign samples with matching STRs that
do not belong to the hypothetical I type clade; SBP is a high confidence
statistical limit estimate.) Six of the
24 using that old definition are excluded by the new definition; if the latter is exactly valid that means
background was actually 25%, which is close.
The new SBP with the old definition is 17.8%, which is lower than 25%,
but I’m comfortable with this because most of my published SBP’s have been
shown to be larger than subsequent new data, as intended. The new definition also captures two samples
that were previously borderline, one of which was classified I type anyway
because that sample has close matches in I type. The new definition captures an A type
sample; that sample is a good fit to A
type; this false call is not
incompatible with the 12.3% SBP which predicts less than 3 samples background
(12.8% of 20). More about A type in a
paragraph below.
The new I type definition lacks
breadth - changing the number of markers increases SBP. This is displayed in Itype.xls as columns
for different marker sets. For such
analysis, the database needs to be restricted to the samples with step not too
far beyond the cutoff. For I type the
ranking of markers is sensitive to exactly where the database is truncated, so
the automatic definition comes out differently for different truncation of the
database. For the database in the
Calculator sheet I truncated the database by removing samples at step > 13,
except I left in two samples at steps 14 and 15 that had been classified Ib and
IB (discussed below). The definition
for I type is also sensitive to exactly which markers are assumed for the first
iteration as the type. The TypeRank
sheet in IType.xls uses the 19 I type samples, excluding only that one that A
type. I tried quite a few other
database truncations, and various assumed sets; those yielded different definitions with higher SBP. My published SBP formula is defined in a way
that provides a larger number to compensate in part for such selection bias.
On the other hand, for the dozen or
so samples that fit I type best, step < 7, the database and the number of
markers do not matter; the same dozen
or so samples are captured as I type for any reasonable definition using a wide
breadth of markers. We can be confident
that there is a valid clade corresponding to those dozen
best I type samples that will some day be captured as a haplogroup by a new SNP. Beyond those best
dozen samples, steps 7 to 8, there are another dozen or so samples that seem to
be I type but at lower confidence; the
background might be significantly more than the best fit SBP. In my publication I explain why background
increases very rapidly with step. I
suppose the actual percent of background might vary from maybe about 1%at step
2, to maybe about 40% at step 8.
What does this mean? The simplest explanation: There was a “father” haplogroup thousands of
years ago. Due to population bottlenecks, only a small
number of the males from that father haplogroup are MRCA’s
(ancestors of clades that exist today).
The descendants of the I type MRCA participated in a significant
population expansion. I type is the
only large clade from that haplospace neighborhood
showing up today in the Polish Project.
Other smaller “brother” clades show up, and because there are many more
haplotypes at larger step values, those brothers are randomly distributed at
large steps in my I type analysis. This
is a simple explanation; more complex
explanations are possible - for example involving migration of tribes from
distant lands.
IB are
Borderline, at step just beyond the cutoff for I
type, not fitting any other known type, with only about 50% confidence that
they will someday end up in a haplogroup corresponding to I type. Samples are also assigned to I Borderline
when the nearest matches at 67 markers are I type. There are two samples at step 10 (new definition) now changed from
I type (old definition) to IB using the new definition. There are 4 more prior IB samples at steps
12 to 15 now changed to K and KB. The
next update of the Results Table will
show slightly smaller totals in I and IB.
As 67 marker data accumulates in the
near future, it is likely a slightly better definition may turn up with even
lower SBP, and I type may separate into subtypes with <20% SBP. The 111 marker data is promising (discussed
in a following paragraph).
A clade that is very well isolated
(<5% SBP) has a high chance of soon being defined by a newly discovered SNP
haplogroup. For I type with 12.3% SBP,
a new SNP might be older, including some small older clades, or a new SNP might
be younger, leaving out some marginal I type small clades. For example, I recently discovered a new SNP
in my own Y-DNA that is slightly older than my predicted type - see L540.
My maternal
Iwanowicz grandfather was I type. This
explains my extra effort analyzing I type.
The two Iwanowicz samples are my maternal first cousin and a man that I
found in Poland who seems to be my 4th or 5th cousin. Technically, one of those should be removed for slightly higher
SBP because I recruited that data, but the bias for 20 samples is small (SBP
becomes 13.0%).
One of the Iwanowicz samples was
removed for the Results Table, along
with editing of family sets in other categories.
SBP for Ia and Ib
are 11.9 % and 17.0%. The definitions
have breadth. These are good results,
providing better than 80% confidence of validity for
each. However, these all fall outside I
type with my new definition. Even with
my old definition, only 4 of these were I type at high step, the rest were
IB. Using an I code was a bit
arbitrary. Now is not a good time to
change their code names, because quite a few new SNPs will soon be
available. With more SNP data small
types such as these can soon be renamed with more confidence.
Back in 2009, and still today, A type
overlaps with I type at the margin. So
does the newer D type.
However, A type is coming out positive for the new haplogroup based on
the L342 mutation, which seems to be rare in Poland. Mayka informs me that a WTY
for one I type sample has come up L342-, as have two D type samples. In the past, I have always speculated that A
type and I type are both subtypes of a larger K type. It now seems A type is really in a distantly
related branch (L342) of the Y-DNA tree with similar STR values by
coincidence. My prediction that I type
is a subtype of K type is still a low confidence speculation.
The best ranked marker for I type is DYS578=9. DYS578 has the second slowest mutation rate
of the 67 standard markers per the Chandler rates. The ancestral value is 8. The 9’s are colored orange in that analysis
file IType.xls. From the 450 Polish
Project samples at 67 markers, only 6 samples outside I type have the 9 value,
one sample has a 7, the remainder are all 8, consistent with very few
independent mutations. In the analysis
file, notice that all the predicted I type samples have the 9 value with one
exception, that A type (discussed above) at the last step of I has the
ancestral 578=8. There are two A type
with 578=9 at steps 11 and 12; the
former has been tested L342+ (coded SNP results are in column BX of the
file). All the other A’s have 578=8, so
the obvious interpretation is an independent mutation to 9 within the A type
clade. The only other 9 in that
analysis file is an IB sample at step 12;
that one might be another independent mutation; on the other hand, perhaps the mutation to 9
is much older than the TMRCA for I type, with that one sample representing a
very small clade with an older node. The Ia and Ib samples all have the ancestral
value 8; that’s evidence that Ia and Ib
have old nodes with I - older than the 8 to 9 mutation.
The second best marker is DYS458=14,
again orange in the file. This is a
rapid mutator, so there is more variance.
All but 2 of the I type samples with 578=9 have this 14 value. This is evidence of youth for I type. Those two, at 15 and 16, are probably
independent mutations, although we cannot rule out the speculation that the 15
is the ancestral value telling us that the 458 mutation to 24 came after the
578 mutation.
Only 8 I type samples have 111 STR
marker data and 2 of those are my Iwanowicz samples, so analysis at 111 is
premature. That said, all but 1 of the
8 have DYS532=12; that one
exception has 11. Value 11 also shows
up for the one Ia sample, and for the two IB samples at 111 markers. DYS532 seems slow, but there are quite a few
11’s and 12’s in the 71 R1a samples at 111, so 532 will not displace 578 as the
best marker for I type. Lapinski
pointed out to me that a couple other markers also show promise at 111 markers
for I type.
[Note inserted on 14 Sep 2011: There are now 9 I type samples and 7 of them
have the signature (532,,504) = (12,14).
All other R1a samples have the modal (532,,504) = (11,>14). This is evidence that the I type node with
R1a tree is not much older than the M458 mutation. DYS532 and DYS504 are two of the new 44 markers in the extension
from 67 to 111 markers. I'll call this
pair of values the signature for a hypothetical IPN clade. This is not strong evidence, because there
is a small chance those 2 mutations happened twice independently - in the M458
clade and in the I type clade. The two
exception samples were previously classified Ia and IB, so they might be from
branches older than the signature mutations.
I need to update my analysis to include these 2 markers, and update this
I type topic. I’ll be busy with other
things for a few months, so I added this note.]
I modified the Ysearch I type
definition, EKVHX
for the new I62. I type has no samples
at the step 9 cutoff in the Polish Project;
on Ysearch there is only one Russian sample at step 9 (plus a couple
modals), so I type is also well isolated on Ysearch, not just in Poland.
All 67 markers can be used for
estimating the age of I type, because there are no significant recLOH
problems with the compound markers in the I type data. Age comes out 1,208 years. See the ASD sheet in IType.xls. Raw ASD age is usually adjusted older due to
population bottlenecks, as explained in my publication, but the adjustment
should be small for I type because it is not very old and because I type
obviously went through a population expansion.
ASD age is highly uncertain due to caveats.
End
of 5 Aug 2011 rewrite of I Type.
Reminder: most of this web page
has not been updated for quite a few months.
J. This type was suggested by Mayka. Only 6 members in the Polish Project, but
this type is well isolated at SBP= 13%.
K. Update
20 Oct 2012. Since 2007, I had been using
the name “K type” for a large R1a Polish category. Over the years I had subdivided K into several smaller types and
clusters, although I did not have high confidence
that all of them in fact belonged to a single unique clade,
as discussed at this web page over the years.
As types were subdivided, the remaining samples that fit the general K definition did not form a type. My K group is now known to be a mix of
independent haplogroups, so the Polish Project stopped using K as a category in
Oct 2012, although quite a few small clusters with names such as Kx and Kz are
still predicted, because the confidence in the clusters has always been higher
than the confidence in K. The various K
categories are now clusters, types, and confirmed haplogroups within the two
major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93).
Most of the samples originally
classified as K are now in B type, D type, E type (now part of Z92), H type (now
equivalent to P278)), I type, and J type, all significant branches of
Z280. If an SNP shows up that captures many
of these branches, I’ll be inclined to use the short code name K to discuss
that branch.
The Kurgans are the ones
who domesticated the horse more than 6,000 years ago. Many scientist think that one pre-Kurgan man is the male line
ancestor of all R1a1a men who live today.
The Kurgan hypothesis is controversial, and not necessary for this web
page. You may have noticed that I used
the letters of “Kurgan” for my original types and categories during 2008. I know of no compelling evidence associating
the Kurgans with what I call K, the
largest part of R1a1a1b1a2 (Z280), but it’s fun to speculate that K became
widespread during a Kurgan population expansion.
Kv, Kx. Small
clusters, need documentation.
I have been using the subscripts “z”,
“y”, “x”, etc backwards through the alphabet because I am running out of
letters for new clusters and types.
These small hypothetical clades seem to be subclades of K, although I do
not have high confidence about the subclade status.
Kt, Ku, Ky. Cluster
with STRs similar to K type. These came
up Z92+, so their match to K type is a coincidence. Need documentation as a new topic at this web page.
Kw. These came up equivalent to
the new SNP L366. Needs documentation
here.
Ky. Update 8
Mar 2012: This is now called Z92y type.
Update 7 Oct 2011: Based on 1 Oct 2011 Polish Project data.
Analysis file: KyType.xls. Ysearch BBB9T.
Ky type was suggested to me by Mayka on 21 Dec 2010. There were only 3 samples in Ky last
year; now there are 5.
That KyType.xls file demonstrates
that the same 5 samples are extracted using any number of markers from 11 to
67, although at some of those definitions one or two other samples are also
extracted. The full 67 markers work
best, SBP=23%.
Ky was more isolated last year; a few samples showed up in the gap,
reducing SBP.
I’m using a hand edited definition,
Ky63, using 63 markers, for the following reasons:
Ky is unusual in that 4 of the 5
samples have an unusual value for at least one markers. I highlighted these values in red in that
file. Notice also the high step values
for those four, 8 through 11, using all 67 markers (column BX), although SBP came
out 23%, which is an excellent low result for 67 markers. The obvious (but speculative)
interpretation: each of the 5 samples
seems to be a representative of a branch of this hypothetical clade, where each
of the 5 branches has a node not much younger than the TMRCA.
Hand editing like this does introduce
some selection bias, so the calculated SBP=13.6% for Ky63 is misleading. Countering the selection bias, some if not
all of those 4 markers that I masked out might represent small tribal sized
subclades, so future prediction of new Ky samples should work better using Ky63
with those 4 removed. T
he far right of the “ASD” sheet has
the markers sorted by apparent age, with “M” indicating the markers that I
masked out. You can see that my
selection is a bit arbitrary; I could
have masked less than 4, or more than 4.
ASD age using all 67 markers comes
out 917 years, cell N12. ASD age using
the 63 markers not masked out comes out 878 years, cell N29, not much
less. ASD age has a number of caveats, and 4 samples are not significant, so
this age is highly uncertain. Ky seems
young, as haplogroups go.
Ky doesn’t not have a prominent signature.
Kz. Update 5 Oct 2011: Based on 1 Oct 2011 Polish
Project data. Analysis file: KzType.xls. Ysearch 9QJFQ.
Kz type was suggested to me by Mayka on 6 Oct 2010. Mayka speculates this might be a clade of
Kazakh origin. There were only 3 samples
in Kz last year; now there are 6.
That KzType.xls file demonstrates
that the same 6 samples are extracted using any number of markers from 2 to 67,
so the definition is not critical for this well isolated type.
Kz is effectively more isolated than
the SBP values (row 12 in that file) indicate, because the samples just beyond
Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.
I’m using a hand edited definition,
Kz59, using 59 markers, for the following reasons:
Kz is unusual in that 5 of the 6
samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those
6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%,
which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a
representative of a branch of this hypothetical clade, where each of the 6
branches has a node not much younger than the TMRCA.
Hand editing like this does introduce
some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but
moot). Countering the selection bias,
many if not most of those 8 markers that I masked out might represent small
tribal sized subclades, so future prediction of new Kz samples should work
better using Kz59 with those 8 removed.
Again, this is moot, because any number of markers extract the same
samples.
The far right of the “ASD” sheet has
the markers sorted by apparent age, with “M” indicating the markers that I
masked out. You can see that my
selection is a bit arbitrary; I could
have masked less than 8, or more than 8.
ASD age using all 67 markers comes
out 724 years, cell N12. ASD age using
the 59 markers not masked out comes out 704 years, cell N29, not much
less. ASD age has a number of caveats, and 6 samples are not significant, so
this age is highly uncertain. Kz is
clearly young, as haplogroups go.
Additional information supplied to me
by Mayka: Three of the Kz type samples
are from non-Polish men who suspect they have Polish male line ancestry, so it
is not certain Kz type is Polish. Kit
number 152824 in Kz is from a man who purchased WTY and
found the new SNP L399, but that SNP appears to be private, restricted to his
family. Insofar as that man recruited 3
more Kz samples into the Polish Project, Kz seems proportionally twice as large. My next edit of the Results Table will reduce the percent size
of Kz.
Kz has the prominent signature DYS459b=18.
Mayka points out the additional signature DYS461=12, not one of the 67
marker set; most of the samples in Kz
have been verified with this 12 value.
Since the Polish Project neighbors (step at or
beyond cutoff of Kz) are all assigned to other
hypothetical clades, we do not know if the signature markers define a larger
father clade.
L. This
cluster is highly hypothetical. It is
rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested
this cluster to me. It is a well known
Scandinavian cluster. I quickly checked
it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at
80% probability yet, so I am not yet using it for classification here. More documentation about L will be available
here when I find time to study it.
L1029. See L1029 And Np News.
See also L260 and M458 See also N
type.
L1080. New SNP needs documentation
here.
L260. See L260
and M458 See also P type.
L342.2. New topic 30 Oct 2011. This SNP was recognized
as a new haplogroup by ISOGG during the summer of
2011. This was an L342 haplogroup
category at the Polish Project for a short time in the summer and fall of 2011,
but it has been replaced by Z93, because it seems all the
L342.2+ samples are also Z93+ in the Polish Project. Apparently there are very few men elsewhere in the world found to
be Z93+ L342.2-.
Z93 is a more reliable SNP than
L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2,
discovered earlier in the E haplogroup.
L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests
together are more reliable. These 4
mutations are in the same segment, which is apparently a segment that mutates
relatively rapidly. Z93 is recommended
as the better test for R1a samples that do not fit STR definitions of other R1a
haplogroups; the Z93+ samples can do
the L342.2 test. This information about
L342.2 was supplied to me by Mayka.
The Z93 category has the samples that
do not fit the two known subdivisions: A type and L342T cluster (next topic).
L342T. New
topic 30 Oct 2011. Based on 26 Oct 2011
Polish Project data. Analysis
file: L542TCluster.xls. I just noticed this cluster.
L342T is not a type,
because SBP did not come out low enough. However, I included this cluster discussion
here for the following reasons:
Seven samples at 67 markers fit my
new 48 marker definition for L342T.
There are 19 A type samples, which should all be in
the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T; the closest A’s are at step
8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5
also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from
the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit
the L342T definition) and the four samples beyond the cutoff, are assigned to K
type and to subtypes of K; Z280 has
recently become available for K type;
as those background samples get tested in the future for Z280, my L342T
cluster will start looking better. Let
me say that another way: a cluster
should be analyzed with data from its own haplogroup, so L342T should be
compared only to L342.2 (Z93) data. But
there is very little L342.2 (Z93) data available, so I used the full R1a
database in that xls file. That means
L342T is likely more isolated than it seems right now, so it is more likely to
correspond to a valid haplogroup.
Mayka pointed
out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code
name. Of course, Tatars may belong to
only a branch of L342T; I have no idea
what fraction of L342T in Poland are Tatar.
And of course Tatars are expected to be a mix of multiple haplogroups.
Three of the L342T samples, with the
name Muchla, are apparently a family set, so they count statistically as only
one sample, reducing the current count from 6 to 4, so SBP as calculated in
that xls file should be increased (not as good). This is evidence against L342T being valid.
M458. See L260
and M458 See also N type.
N.
Comment 29 Feb 2012: See the M458 topic for discussion of a new SNP, L1029, that seems
to be equivalent to N type.
Complete rewrite of this topic 25 Sep
2011. Based on 5 Aug 2011 Polish Project data.
Analysis file: NType.xls
N type is concentrated in Slavic
countries. N type is discussed in my publication, page 179.
According to Ysearch
and Yhrd N type seems to be spread all around the Slavic
lands and central Europe, common from East Germany to Russia. Within Poland N type seems to be about the
same size as P type, both about 9% of men. Worldwide, N is much larger than P. N type should be properly studied in a
database that is not restricted to Poland.
However, there seem to be subtypes of N that are concentrated in Poland. See the discussions on N subtypes
below. I’ll continue to watch the
Polish Project, because it will be interesting if more data provide more Polish
subtypes within N.
During review of my publication in
2009, the SNP called M458 was
published. I added notes about this to
my publication on page 184. The
corresponding haplogroup is now called R1a1a1g. This
haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-). M458+ samples may turn up someday that do
not fit either N type or P type, but I have not noticed any yet.
My current definition
for N type, N46, is a modal haplotype using 46 of
the 67 standard markers. The cutoff is 8, which
means all samples less than step (genetic distance) 8 from
N46 are predicted N type (predicted M458+ L260-). That definition is available in the NType.xls analysis
file, in my Haplotypes.xls
file, and at Ysearch as 3SEJK.
N type age (age means TMRCA)
is about 2,000 years. That’s highly
uncertain, but I’m 80% confident that age of 2,000
years is not off by more than a factor of 2 - age 1,000 to 4,000 years. The M458 mutation is likely much older than
the age of N type.
I’m suspicious that N type includes
many younger clades that just happen to have similar STR values,
difficult to resolve into clusters or types. I offer some speculation along these lines
in the hypothetical subclade topics below.
I highly recommend that someone from
N type purchase WTY, a commercial product for discovering
SNPs. No sample from N type has been
submitted for WTY. That means there is
a good chance that the first N man to submit his sample to WTY will discover
one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an
SNP that captures about half of N type - or perhaps an SNP that captures a
small subclade - or perhaps multiple such SNPs. My WTY was the first in a long time
in my haplogroup, so I found 14 new SNPs.
It’s interesting to wonder why
R1a1a1g seems to be composed of two types that differ
substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type
topic. Much of my P type discussion is
also related to N type, so I avoided repeating all the details here; please read my P type
discussion if you are interested in more about N type.
N seems to be older than P. I wonder if there are subtypes of N about
the same age as P. I avoid too much
speculation in this web page - just enough to indicate my motivation. I’m wondering if there are clades in various
haplogroups, mostly P and N, associated with the origin of the Polish nation -
a few centuries more than a millennium ago.
I have only identified 4 small
subclades of N so far: I am quite
confident of Ng type, but less confident of N-Ashk type. The Nt and Ns clusters are hypothetical; I have about 70% confidence
in them. These 4 are used for
assignments at the Polish Project web page.
I also identified a few clusters with roughly 50%confidence; these are too speculative for formal
assignments. All are discussed below. I made speculative assignments based on all these types and
clusters within N type, in column CD of that file NType.xls,
Calculator sheet. My file NClusterAssignments.xls
has lots of details. If you are N type,
you can find your row with your kit number, and see your speculative
assignment. For the “clusters”, I
estimate a 50-50 chance an assignment will need to be changed in the next year
or so, as more data becomes available
In addition, N type has many bimodal markers, hints at yet more subclades not discussed
here. This is evidence that N type
experienced population expansion when it was young (not long after the TMRCA). More
discussion below.
The paragraphs up to here are a brief
summary. The rest of this topic is a
detailed discussion about N type and hypothetical subclades:
This Sep 2011 analysis includes only
data from the Polish Project. I’ll wait
a few months before reviewing data outside the Polish Project. My last analysis including data from outside
the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following
topics, which have not been updated for several months:
For the size of N type, please see
the table at the top of this page, where N has only 4 more samples than P (87
vs 83 - 5 Aug 2011 data). In my 2009
publication N had one less than P (28 vs 29, Table 6 page 169). The 70% confidence interval for 87 samples
is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by
implication in Poland) within statistical sampling accuracy, at about 9%.
My 2009 published definition for N
type, N45, still works very well. I did
not change that definition at my Jan 2011 update and analysis here in this
topic. This Sep 2011 N46 update is just
a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP
data that has accumulated over the past year.
Both definitions are compared in that analysis file NType.xls,
Calculator sheet, columns BZ to CC.
Tweaking the definition like this, to
better fit SNP data, introduces some selection bias. I discuss this issue in the P type topic,
where I did a similar tweak; please read
that topic if you are interested in the statistical justification. The justification is not as good for N type,
so I’ll return to this issue in the “old branches” paragraph below.
This new N46 definition fails to
capture only one M458+ sample, which falls at the cutoff step 8. This new N46 definition captures only one
foreigner, L540+, at step 7, the last step of the type. The other samples at step 8 have tested
either M458- or L260+, except one that fits D type well, so they are all
confirmed as not N type. Similarly, 7
of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to
other types, with only 2 that are Borderline fits to other types. In other words, the N46 definition captures
the M458+ L260- samples with apparent 98% accuracy. However, my confidence is about 80% for
step 7, about 90% for step 6, and 95% or better for step <6. Again, please see the P type discussion
about confidence for a general explanation.
P and N are similar in this regard.
I have related discussion about N type confidence in the “old branches”
paragraph below.
Almost all the samples near the
cutoff for the previous N45 definition have been SNP tested. This high testing rate is not a
coincidence; Mayka
and I have been encouraging men with marginal
samples to do the M458 and L260 tests.
(We paid if cost was a problem.)
The NType.xls analysis file has 10
columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as
tentative N type definitions, with automatic selection of the best
markers. For each column, I colored the
step count violet for samples captured by that definition. You can see at a glance that any definition
using 2 to 67 markers captures more than 80% of the N type (M458+ L260-)
samples, and not many foreigners, so just about any definition works
surprisingly well. In other words, N
type is very well isolated in haplospace.
For the two best automatic
definitions, I used boldface to highlight the N type samples missed by that
definition, and also boldface to highlight the foreign samples captured by that
definition. I used boldface similarly
for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the
effect of 3 different cutoff choices.
You might try resorting the sheet by
column (select everything from cell A14 to the end) to better compare the
results.
The issue of SBP
is moot for N type now that the SNPs M458 and L260 are available, but an
analysis is instructive: That NType.xls
file has automatic marker selection of N type, and automatic calculation of
SBP, disregarding the SNP data. The
best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%. However, N46 is a better definition because
N61 captures only 80 of the 87 N type plus that same one foreigner. But still, 8 misses out of 87 is not bad for
N61, better than the 13.2% SBP (SBP is a high estimate for statistical
confidence).
I considered calling N46 a definition for M458+ L260-, with a
different definition for N type as a slightly smaller subtype, leaving out some
samples that do not fit the N type definition with lowest SBP. I could not come up with a convincing definition
for such a smaller subtype. So at least
for now, I am considering N type as the same as M458+ L260-, with the
understanding that may change in the future.
The summary conclusion for all those
columns of trial definitions: My
preferred N46 definition (column CC) does the best job of capturing N type
(M458+ L260-). Most of the other
columns are trying to define N type as slightly smaller, leaving out a few of
the samples (not always the same samples).
Most definitions for N type have many samples at or near the
cutoff. My explanation is in the next
paragraph:
Old branches: A type is a hypothetical
unique clade. Of
course, every clade is composed of subclades - branches in the Y-DNA tree. Here is a simple explanation for the
previous few paragraphs of discussion:
N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type
MRCA at a few STR values from the standard 67 set. Those old branches have many younger branches (twigs) that differ
at yet more STRs. In other words: the N tree might have a few small branches
near the ground. Those small old clades
provide samples in the database with large step, but each
sample is from a different twig, so these do not correlate into obvious
clusters. Any clade has statistical outliers with large step;
a few small old branches would provide more outliers for N.
Those old branches may not be small
world wide. One possibility - a large
subclade of N concentrated outside Poland might have one small branch in
Poland, corresponding to a man or tribe that moved to Poland long ago. I am watching for evidence along these
lines, but so far this paragraph is speculative.
In addition, there might be
additional large old subclades that seem young. I consider this possibility in the discussions below. The age of a clade can be much younger than the
node. I discuss this in another topic,
where I call such clades smooth branches. The N tree might have a number of small
smooth trunks with nodes near the ground - that would not necessarily be
evident as STR correlations. On the
other hand, the N tree might have only one main trunk, almost smooth, with only
few small branches near the ground. The
actual situation might be more complicated, with multiple trunks of various
sizes, at various distances from the ground.
I can’t tell yet from the STR data.
Perhaps another year of additional STR data may help.
Why am I speculating about N type
smooth branches? I see plenty of hints
for more branches in the N type data, but little statistical confirmation. In the discussion below for subclades, I
offer evidence (not definitive proof) for many more significant sub clades
within N type.
This discussion is personal. It is my opinion, based on my statistical
analysis. Someone might send me an
email any day now pointing out a convincing cluster or type in N that I
missed. Someone else might disagree
with my analysis about particular hypothetical N subclades.
Reminder: This discussion is limited to Poland, as represented by the
Polish Project. Outside Poland there is
additional probability of M458 branches showing up someday that fit neither N
type nor P type. Outside Poland I
expect yet more N type branches.
Regarding concentration in Poland, I
use percent of samples in Ysearch with “Origin” Poland as an objective
measure. This is discussed in my
publication, where Table 1 shows P12 (the P type modal haplotype using only the
original standard 12 markers from the Polish Project) with 42%, while N12 has
only 14%. Those numbers 42% vs 14% are
not calibrated (because of the unknown concentration of men with Poland origin
in Ysearch) but those numbers are a relative indication of concentrated in
Poland vs not particularly concentrated in Poland. My file NYsearch.xls
has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift
due to more data. That same file has
the N46 definition at 24%. This is
evidence that N type, defined using 46 of 67 markers, is only slightly more
concentrated in Poland than the 12 marker equivalent. The simplest explanation:
There are probably large M458 clades outside Poland that match N12 and
also match N46 at less than the cutoff, but the Polish samples are only a twigs
on those branches, descended from one man or family or tribe that moved to
Poland a millennium or so age. It makes
sense that clades within M458 might be regionally concentrated. That 24% concentration for N46 is of course
an average; there are subclades of N
with higher and lower concentration. I
found a few, discussed below; that file
NYsearch.xls has a sheet for each subclade analysis.
Age:
N type comes out 2,340 years old using all 67 markers. See cell N12 in the ASD sheet in NType.xls.
Because of recLOH
issues, the compound markers 464, YCA, and CDY present difficulties estimating
age in the N type data. Other compound
markers are OK. The ASD sheet allows a
mask, row 21, where I masked out the 8 markers for these recLOH
difficulties. The result, using 59
markers, cell N29, is 2,010 years.
That’s my best guess for the age.
On the far right of the ASD sheet I
sorted the markers by apparent age.
YCAb comes out 20,704 years, demonstrating the recLOH problem.
The second (apparently) oldest marker
is DYS454, at 18,744 years. This old
age is due to only 5 mutations in this slow mutator. DYS454 is clearly bimodal. In my notes, I use the Nj code for the 2nd
mode with these 5 samples, defined by 454>11. This is evidence of a subclade, but the statistics are not
convincing yet. Maybe with more data in
the near future I might call some of these samples the Nj cluster. It’s not fair to exclude this “old” marker,
DYS454, because there are 7 markers with zero age (no mutations in the 87 samples)
and there are 7 more markers with less than 1,000 years apparent age. The reason for averaging markers is that
apparently old markers should be averaged out with apparently young markers. Anyway, you can go ahead and mask out DYS454
by deleting the mask number at cell AE21, and the new age (58 markers) without
454 is 1,990 years, only a 20 year decrease.
I offer this paragraph of discussion as one example of preliminary
evidence of an N type subclade, based on 454<11.
The third oldest marker is DYS531, at
14,319 years; at this bimodal marker I
use the code Np for the 2nd mode value.
Again, I’m waiting for more statistical evidence for a subclade.
That far right side of the ASD sheet
has more notes about markers with old apparent age.
Age estimation from STR variance is
highly uncertain. At another of my web
pages, I use M458 as an example of age caveats. I have more discussion about age estimation
methods in the P type topic; please read those two topics if you would like more
discussion; N is similar to P in this
regard.
I’m not too concerned about getting
the age of N type correct in Polish data because I suspect in less than a year
there will be enough evidence to subdivide N - new SNPs and / or more STR data
for better statistical significance. I
suspect there will be younger subclades.
Furthermore, M458+ L260- is not really a tree; it seems to be branch of the Y-DNA tree that is well isolated - a
long smooth segment near the node; but
I mentioned above my suspicion that the main branch might not be really smooth
- there might be significant old branches concentrated outside Poland; if this is true I’ll need to soon redefine N
type as younger, excluding any such significant branches. I’ll leave it for someone else to estimate
the age of M458+ L260- from worldwide data;
I’ll concentrate on N type, and hypothetical sub clades in Poland.
There are 12 samples from N type
available with the new 111 STR marker set (18 Jul 2010). Only DYS532=12 is an obvious signature
marker for N type from the 44 new markers;
10 of the 12 have this value.
Modal for R1a is 532=11. P type
also has the 532=12 value, also 10 of 12 samples, so this marker also provides
a signature for M458 with good statistical significance. I type also has the 532=12 value; see the I type
discussion below.
The following topics are my proposed
subclades for N type in the Polish Project.
Please consider reading the section P Type Bimodal
Markers, if you would like more discussion of how I use bimodal markers as
hints for subclades; that same
discussion applies here for N type. If you
are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls. Near the bottom of the “Haplotypes” sheet is
a list of 70 code names for signatures that I
considered for N type subdivision. I
discuss only a few of these here. I
spent a lot of time studying tentative subclades of N because I’m anxious to
find significant subtypes that are concentrated in Poland. I uploaded a total of 17 Excel
analysis files associated with N and tentative subclades, all discussed above
and below.
Ng. Rewrite finished 22 Sep 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: NgType.xls. Ng is a small subtype of N type, but it has
highest confidence.
This is a very small subtype, only 3
samples, but it is very well isolated.
The definition uses 56 markers, cutoff 4, gap 9. There are no samples in the gap, from step
4 to 12. SBP = 15.8%.
These same 3 samples are present in
Ysearch, where the gap with no samples is from 4 to 11. Two samples at step 12 are from Germany and
Unknown. There are none at step 13 and
11 samples at step 14. It seems Ng is
concentrated in Poland.
The signature is (537, 492) = (10,
14). These are the only 3 Polish
Project samples in N type that have any mutation from the 12 value at 492, and
they have a 2-step mutation. 492 is
ranked 18th of 67 in the extended Chandler mutation
rates. The 10 value at 537 is also rare
- only these 3 plus 2 other samples have it in N type in the Polish
Project. The same 3 Ng samples are
extracted from N type using 1 to 67 markers.
They are well isolated using as few as 7 markers because they have
little variation from each other in the rapidly mutating markers, so those rank
well for the Ng definition. ASD age
comes out 619 years using all 67 markers but of course that is a very rough estimate.
The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and
passed on those 2 unusual mutations.
The 3 Ng samples fall at steps 4, 5,
6 with the N45 definition of N type, a hint that the Ng node is near the center
of the N type branch, not one of those old branches I speculated about, but
this is just a preliminary hint.
I introduced Ng type in Oct
2010; there have been no new 67 marker
data in the STR neighborhood of Ng type, so SBP has been 15.8% since, with the
same definition.
Ng also has what I call the Na
signature, discussed below.
The “g” is only my arbitrary code
name that I have been using for the DYS492=14 signature.
N-Ashk. Rewrite finished 25 Sep 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: NashkType.xls. N-Ashk is a small subtype of N type. Only 4 samples.
These seem to be Ashkenazi
samples. Mayka
pointed out to me that the names seem Ashkenazi, per his experience. The samples beyond the cutoff are apparently
not Ashkenazi.
Signature (19,385a,594) = (15,12,11)
I introduced this type in Jan 2011,
with SBP 23%, slightly more than my stated 20% limit for using
the word type. Two
reasons: First, the Ashkenazi names are
independent evidence of a clade.
Second, the N-Ashk modal haplotype differs
from the N modal at 6 markers, which is evidence of a fairly old node in the N
branch of the Y-DNA tree.
I introduced this type as Nca type,
because of what I have been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was
confusing because the samples do not match what I have been calling the Na
marker. Nc is large; I doubt N-Ashk is a twig in a large Nc
branch; the Nc mutation more likely
arose independently in the N-Ashk hypothetical clade.
This Sep 2011 reanalysis makes a
cleaner cluster of data, although still small with only 4 samples. The 594=11 marker is very clean; these 4 samples are the only R1a samples in
the Polish Project with this value. SBP
increased to 47%, so it is a stretch to call this a type, but the Ashkenazi
connection is improved now and the 594=11 marker seem to be strong
evidence. Also, I avoid making changes
in classification names without significantly more data, so I’ll continue to
call this a “type” for now. There are no
longer any N-Ashk Borderline samples at 67 markers; the Borderline category is used for apparent Ashkenazi samples
that match well with only 37 markers.
The improved definition uses 58
markers, cutoff 3, no samples in the gap at steps 3 and 4. (The previous definition used 59, cutoff
5.) The improvement: I masked out CDY. The previous definition used CDYb, missing an Ashkenazi sample
that fits the type well, but has recLOH, providing a
misleading step of 5 at this one marker. With that new sample the ranking of markers
came out slightly differently, so a few other markers were added or removed
from the definition. The old and new
definitions are available in NashkType.xls.
The new definition is also available at Ysearch as 2TZKF,
and in my Haplotypes.xls
file.
The ASD age comes out only 668 years,
cell N29 in the ASD sheet in NashkType.xls.
Age calculated from only 4 samples is highly speculative, but N-Ashk
seems young because of little variation in marker values. The ASD should use (4-1) in the denominator
instead of the total 4 samples (although most genetic genealogists do not do this
for small sample sizes); with that adjustment
the age comes out 890 years, but that is still highly speculative. That cell N29 is using 61 markers; CDY and 464 are masked out. (The mask is row 21, which you can easily
edit.) All 67 markers yield 1,024
years, cell N12, because of CDY. DYS464
has no mutations in the set of 4, so including those 4 reduce the age, but I
left 464 out because most people routinely exclude the 464 set from ASD.
N-Ashk is quite young, but the node
seems old because of the 6 marker distinction from N type. The simplest
explanation: N-Ashk has a long smooth branch, having an old node with N, but no
further branching near that main node.
The samples in the Polish Project all seem to come from twigs with young
nodes. I speculate that there may
actually be some branches of N-Ashk outside Poland. Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat
less than a millennium ago. More data
will eventually confirm or refute this speculation.
2TZKF
is the modal haplotype at Ysearch, where
only two of these samples are present, and where there are 2 additional samples
in the gap, from Russian and Belarus;
the simple explanation is that N-Ashk is concentrated in Poland,
although there is too little data for confidence. See NYsearch.xls
for my Ysearch analysis.
N-Ashk has what I call the Nb
signature, discussed below.
Nt. Edited 25
Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: NtCluster.xls.
With 17 samples, Nt cluster is my
largest speculative subclade of N type identified so far.
SBP = 27%; this cluster is close to the 20% maximum SBP for Polish Project
assignments as a type.
I am suspicions of this Nt cluster due to selection bias: I considered 70 signatures for N type during
the summer of 2011, and carefully analyzed more than 30 of them. With that many attempts, a false positive is
likely. One of the clusters I analyze
will necessarily have the lowest SBP, but that might be just the luck of the
data. No one knows how to calculate the
statistical confidence in such a case.
I discovered Nt at the end of this major effort. If SBP improves with more data for Nt I’ll
upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll
loose interest in Nt.
If Nt is valid, it is probably
concentrated in Poland. See NYsearch.xls. See my Ysearch method discussed above. I consider this additional evidence that Nt
corresponds to a clade, boosting my estimated confidence to about 70%. We don’t always use 70% confidence for
assignments, but everyone is anxious for more subdivision of N type in the
Polish Project, so we started using Nt in Sep 2011.
The Nt definition
uses 48 markers, cutoff 4, one sample in the gap at step 4. The definition is available at Ysearch as 2544E.
Nt is based on the signature
DYS442<14. However, there are 29
samples with that signature, and 5 of the 17 Nt cluster samples have the N
modal 14 value at this marker. My
simple speculative explanation: the 442
mutation from 14 to 13 occurred independently in the Nt clade after the node
with the main N type branch. Other
speculative explanations are possible - those 14’s might be a back mutation
within a much larger “father” clade that carries the Nt signature on most but
not all samples.
One Nt cluster sample has the 12
value at 442, which could be another mutation or an independent double
mutation.
If we subtract the 12 Nt signature
samples with <14, that leaves 17 more samples (not included in my Nt
cluster) with this second modal value at 442.
There are only 3 samples at 15 in all of N, and we expect step up to be
more common than step down for a slow mutator (see my publication
for references), so that still leaves an excess of samples with <14,
implying yet another hypothetical clade with an independent mutation, or a
larger “father clade” but this paragraph is getting highly speculative. I have more speculation like this about
independent clades vs large clades in the Na, Nb, and Nc topics below, similar
speculation applies to Nt.
Thirteen Nt samples match what I call
the Na signature, discussed below, but two samples match the alternate mode
Nb; the last two samples are one step
away from Na. This is evidence of an
even larger Na father clade, but as discussed below the Na vs Nb signatures may
have arisen multiple times independently, so I’m not confident to speculate
further along these lines.
See also NclusterAssignments.xls.
Ns. Edited 23
Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: NsCluster.xls. Ns cluster is a speculative subclade of Nt
cluster.
With 6 samples and SBP = 27%, this
cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions
of this Ns cluster for the same reasons given above for Nt: On the other hand, Ns looks like a credible
subclade of Nt, which adds credibility to both of them.
If Ns is valid, it is probably
concentrated in Poland. See NYsearch.xls. The 67% concentration is the best I have
seen so far, but this % is highly uncertain because it is based only 2 Ns
samples at Ysearch. Such as it is, I
consider this additional evidence that Ns corresponds to a clade, same as my
confidence for Nt.
The definition uses 47 markers,
cutoff 2, no samples in the gap at steps 2 and 3. The definition is available at Ysearch as A5NSG
Ns is based on two signatures. Ns is my code for DYS446=12, 9 samples, vs
446=13 modal for N type. Nt is my code
for DYS442=13, 5 samples, vs 442=14 modal for N type. The 6 Ns samples are all at steps 0 and 1 with the 47 marker
definition; the other 3 with that
signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation
happened twice independently in the N type clade. Five of the 6 Ns samples have the Nt signature, but that 6th one
has the value 12, two steps from the N modal 14, so it should be considered Nt
also.
See also NclusterAssignments.xls.
All 6 Ns have what I call the Na
signature, discussed below.
Nd. Edited 24 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: Nd53Cluster.xls.
Based on the signature DYS389I = 14,
vs N modal 389 = (13,29). Nine samples
have the Nd signature. Only 3 of these
9 fit Nd53. My confidence is only about
50% that these 3 samples really belong to the same clade; I included this analysis as an example of an
uncertain clade, and for discussion below in the Na topic.
DYS389II has the value 30 for Nd but
this is not a mutation at 389II. See compound markers for an explanation.
I call this Nd53 because the 53
marker definition is somewhat arbitrary - there is no very likely
definition. It is likely I’ll need to
change the definition soon, when more STR data becomes available. Also, “Nd53” makes it clear that this is not
the same as the cluster formed using only the Nd signature.
Nd53 is not used for assignments in
the Polish Project; see NclusterAssignments.xls
for speculative assignments.
The 3 samples do not have Poland as
origin, although I suppose those men have suspicion of Polish ancestry, because
that is usually the case for Polish Project samples. On the other hand, Nd53 might be representative of a clade that
is concentrated outside Poland.
Ne. Edited 24 Sep 2011. New topic 23 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: Ne40Cluster.xls.
Based on the signature DYS390 = 24,
vs N modal 25. Twelve samples have the
Ne signature. Only 3 of these 12 fit
the Ne40 cluster. My confidence is only
50% that these 3 samples really belong to the same clade; I included this analysis as an example of an
uncertain clade, and for discussion below in the Na topic. Nd and Ne have similar status.
I call this Ne40 because it is likely
I’ll need to change 40 marker definition soon, when more STR data becomes
available.
Ne40 is not used for assignments in
the Polish Project; see NclusterAssignments.xls.
Only one of the 3 samples has Poland
as origin, although I suppose the other two Ne men have suspicion of Polish
ancestry, because that is usually the case for Polish Project samples. On the other hand, Ne40 might be
representative of a clade that is concentrated outside Poland.
Na and Nb. I have been rewriting this topic throughout
the late summer of 2011. Finished 24
Sep 2011. Based on 5 Aug 2011 Polish
Project data.
Clusters based on DYS464, a maker set
that is multimodal in N type. Analysis files Na45Cluster.xls
and Nb32Cluster.xls
I introduced Na and Nb in my publication,
page 179 and Table 3. I have been
updating the discussion for Na and Nb here at this web page. I consistently emphasize that these are
speculative subclades. In retrospect, I
should have avoided the word “type” for these because more data over the years
has convinced me that the explanation for what is going on is not two subtypes
of N. It will take me a few paragraphs
to explain the issue of Na and Nb:
One way to split the N type data,
obvious at a glance, is by the number of markers for 464. Some samples have 4 values, some have 6,
just a few have 5 or 7.
I understand that the 464 set is the
most prone to genetic testing evaluation errors, so this or any categorization
using 464 will have uncertainties. If
464 is taken in combination with other markers that means some statistical
uncertainty due to possible evaluation errors at 464. Specifically, a sample in a database with 4 values at 464 might
really have 5 or more values, and vice versa.
Follow my links if you wish to read
more about compound markers and recLOH
issues, which introduce confusion for the 464 marker set. Briefly, copy mutations can increase the
number of 464 markers, but recLOH mutations might reduce the number. A single copy mutation can change more than
one value in the set. Copy mutations
and recLOH mutations are rare, about the same frequency as very slowly mutating
STR markers. Net mutations in the 464
set are common, with frequency among the fastest in the standard 67 set. For the Chandler rates, each of the four
markers 464a to 464d are assigned a rate 1/4th the net rate for single
mutations for the set of 4.
I use Na as my code for the signature
464 = (12,12,15,15,15,16) - the most common value set for 464. 28 of the 87 samples. My Nb signature is the next most common, 464
= (12,15,15,16). 16 samples. I say 464 is multimodal because there are
also two sets with 4 samples each;
that’s why I’m using Na as a signature even though it is the modal value
for N type as a whole. This is for the
87 N type samples in my 5 Aug 2011 download of the Polish
Project; the proportions change
every few months as data accumulates due to the statistics of small sample
sizes.
Na and Nb differ by 2 steps following
the Ysearch method, but that is misleading because Na
can turn into Nb in a single recLOH mutation, which might have happened more
than once in the past in this N type database.
Nb can turn into Na with a single copy mutation. I may not be exactly correct in this
paragraph if my assumption of the structure of 464 in N type is incorrect, but
this paragraph is certainly a brief example of the kind of confusion that
arises with 464.
It is easy to construct clusters
using 464 in N type. Too easy. Too many choices for clusters, as I discuss
in the following. I could not come up
with clusters with good statistical confidence. My Excel analysis files allow setting
maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any
variation of a sample from a trial definition;
still I found no clusters with confidence.
My analysis files allow an alternate
method, treating the 464 markers as individual markers. This is the method I used in my 2009
publication, still no clusters with confidence.
My default is to follow the Ysearch method for counting step at 464, although this
method is obviously less than perfect.
My list of code names is available in
Haplotypes.xls.
When trying individual markers,
DYS464b is best. In my notes I use Na1
- 464b<14, and Nb1 - 464b>13;
these two signatures neatly split all the N type data. Na1 captures all the Na samples plus
mostly samples with more than 4
markers; Nb1 captures all the Nb plus
mostly samples with 4 markers; there are
exceptions. Using Na1 vs Nb1 I come to
the same conclusions as using Na and Nb, discussed below.
DYS464e provides another way to split
the data. In my notes I use Nx - any
value for 464e, and Ny - no value for 464e.
Nx captures all the samples with more than 4 markers including the Na
samples; Ny captures all the samples
with 4 markers including the Nb samples.
Using Nx vs Ny I come to the same conclusions as using Na and Nb,
discussed below.
Consider my definitions Na45 and
Nb32, with 45 and 32 markers. See those
two Excel files for details. My
choices for 45 and 32 are arbitrary.
Those files show columns with trial definitions using a wide range of
markers, automatically chosen by rank. A wide breadth of number markers seem
roughly equivalent. It is remarkable
how many samples fit very well using up to 50 markers for trial definitions: Na has 16 samples at step zero using 11
markers, and 15 samples at step less than 2 using 45 markers; Nb has the same 14 samples at step zero
using from 11 to 32 markers. When the
464 set is excluded from the definition, some Na samples fit the Nb definition,
and some Nb samples fit the Na definition.
One simple explanation: Na45 and
Nb32 might correspond to two very young clades. However, there is an alternate explanation: Na45 might correspond to two or more young
clades, and Nb32 might correspond to two or more young clades, and they may be
a “bushy” set of branches where some Na45 clades are connected by nodes to some
Nb32 clades. I see no way to be
confident that most of the Na samples are in a branch distinct from a branch
with the Nb samples. I suppose if your
sample matches Na45 at step zero or one, there might be better than a 50-50
chance that you and others who match at <2 belong to a unique clade that may
someday have an SNP definition, but such a clade will surely exclude some of
the step <2 samples, and include some samples from steps 2 and 3, so Na45
does not provide a definition. The same
can be said if you match Nb.
Some samples that fit the Na
signature at 464 = (12,12,15,15,15,16) come out at high step using more
markers. Similarly, some samples that
fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more
markers. You can see this at a glance
in those two files. Two opposite simple
explanations come to mind: Na and Nb
may have independently arisen more than once, followed by population expansion
- multiple branches in the N tree. The
opposite explanation: Na and Nb sets
might be signatures for two old clades that each have a few old subclades - two
main N branches that have a few old branches and where both Na and Nb have a
bushy clump of branches at the ends.
More complicated explanations also come to mind. That second explanation, two main branches,
is attractive, but I see no proof that is true, or even highly likely.
In the file NclusterAssignments.xls,
I make speculative assignments. Most of
the Na45 and Nb32 samples fit other more believable types and clusters. I went ahead and assigned the few leftovers
to Na and Nb, but these are just speculative assignments, meant so show you
which of my clusters you best fit.
Summary: There is not enough evidence to consider Na and Nb to be two
unique subclades of N. Maybe Na45 and
Nb32 do correspond to the top of two main branches of the N tree, with most of
the samples that fit Na45 or Nb32 belonging to the corresponding clades. Maybe not.
I see no way of ruling out multiple independent clades (branches far
apart in the tree) for both Na45 and Na32, or for any other definitions based
on the 464 set. Perhaps in a year or so
more STR data will provide convincing sub cades along these lines. Perhaps in a few years SNPs will be
discovered to subdivide N type.
I have more discussion along these
lines below, in the Nc topic.
At all 67
standard markers, the Na and Nb modal haplotypes are essentially the same
for STR markers other than 464. I say
“essentially” because the rapid mutators, particularly the CDY pair and DYS576,
typically vary modally from month to month due to the statistics of small
samples. At CDYb, Na type signatures
with multiple markers are typically modal 40, while Nb are typically modal 39,
but this marker always ranks poorly for definitions because of the wide range
of values. In Nb less than 1/3 of the
samples typically have the modal value at CDYb.
The Russian
site independently came up with this same haplotype
distinction. Two modal haplotypes are available on Ysearch, from the
Russians. Each use 78 markers and each
match my Na and Nb types at 67 markers, including that 39 value for CDYb in
Nb. Central European-1 Modal GTAVR
corresponds to my Nb, using only 4 values, 464a-d. Central European-2 Modal 495M5
corresponds to my Na, using 6 values, 464a-f.
Nc. New topic 25 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis files Nc32Cluster.xls
My Nc code is for the signature DYS19
= 15, compared to the modal value of 16.
Similar to Na and Nb, my publication and previous versions of this web
page proposed Nc as a tentative subdivision cluster of Nb. The samples with the 15 value last year had
mostly Nb samples, but this year that correlation is not significant.
My opinion of Nc is very similar to
my opinion of Na vs Nb: No confident
conclusion. Nc might correspond to a
single large clade. Then again, Nc
might correspond to independent unrelated clades where the Nc mutation arose
independently.
My Nc analysis complements my Na and
Nb analysis: If you look at
Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na
and Nb. If you look at Na45Cluster.xls,
you see at a glance that the best fit samples are a mix of Nc and modal
DYS19=16. If you look at
Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc
and modal DYS19=16. If Nc32 vs modal 16
is a valid division of N type, then Na vs Nb cannot be valid. If Na vs Nb is valid, Nc vs modal 16 cannot
be valid. All three files have, at the
bottom, at large step, some Na, Nb, and Nc samples.
Next, let me consider the 4
combinations using DYS464 and DYS19:
Nbc42Cluster.xls
is my analysis file using both the Nb and Nc signatures together.
Nac32Cluster.xls
is my analysis file using both the Na and Nc signatures together. This is very different than Nc32; the latter has a mix of Na and Nc; the former is a new analysis using the
additional restriction to Na match.
They both have 32 markers by coincidence. As in Na45 and Nb32, the number of markers is my arbitrary
choice; there is no obvious best
choice; the number of markers will
likely change as data accumulates for all these definitions where I specify the
number of markers in the code name.
Nb5_37Cluster.xls
is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers
plus the modal value at DYS19.
Na7_26Cluster.xls
is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers
plus the modal value at DYS19.
In the file NclusterAssignments.xls,
I make speculative assignments to these 4 clusters, but samples that fit one of
the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more
confident assignment if they also fit these 4 combinations.
The 3 Ng samples
are all Na, but they are a mix of values at DYS19. The neighborhood (just beyond the Ng cutoff) is all Na. This is a tantalizing hint of a “father”
clade with the Na signature.
The 4 N-Ashk
samples are all Nb, but in this case the neighborhood is a mix of Na and
Nb. This is a hint of an independent
mutation to Na somewhat older than N-Ashk.
Three of the 4 N-Ashk are Nc, as are most of the neighborhood. The other has the modal DYS19=16 value. This is a hint of a father clade with the Nc
signature, DYS19=15, plus recent back mutations to the modal value.
The 6 Ns samples
are all Na, with a neighborhood mostly Na but some Nb. The Ng, N-Ashk, and Ns samples are all very
far from each other. You can see this
in the file NclusterAssignments.xls, where each type and cluster has a column,
with step value for each samples. I
consider this strong evidence against a large Na clade; it seems more likely that the Na
(464=12,12,15,15,15,16) set arose independently by copy mutation 3 times in
these three hypothetical clades.
Nt, the purported
father of Ns, has 17 samples; 13 Na
signature, 2 Nb, 2 one step away from Na.
It is reasonable to speculate that those 2 Nb are due to an independent
recLOH in Nt, and that the father clade has the Na signature. Unfortunately, it is also reasonable to
speculate that there were multiple mutation to the Na signature within Nt
making the 464 set is irrelevant.
The 3 Nd samples match Nb but again
the immediate neighborhood is a mix of Na and Nb, again evidence for
independent mutations at 464.
Ne is another example of a mixed Na
Nb neighborhood. In this example, 2 of
the 3 match Na. That third one,
464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a
hint at another copy mutation.
NYsearch.xls
has a sheet with Ysearch data analysis for each type or cluster. The Polish percent, in boldface, is my
important result. Although this
analysis is based on very little data for each of those 4 combination clusters
here is the tentative finding: Nbc42 is
not concentrated in Poland. The other 3
seem to be concentrated in Poland; that
is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors
one or more clades that are concentrated in Poland.
Ns seems related to Nac7_26, because
4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3,
so this technique of 4-way combination is good for hints, but not conclusive.
Summary; Na, Nb,
and Nc clusters: 25 Sep 2011. That was a lot of analysis to justify my
opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without
correlation to more markers. N type
probably experienced population expansion not long after the TMRCA,
whereby the main N branches come out today with similar STR distributions. DYS464 is multimodal; DYS19 is bimodal; the 4 main combination modes based on 464 and
19 provide evidence of twigs that are concentrated in
Poland. I bet there are many more small
Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type. I’ll continue to watch the STR data. New SNP markers within N type someday will
be even better.
Np. New
cluster. See L1029
And Np News. See also L260 and M458
P.
Complete rewrite finished 16 Aug 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: PType.xls
P type is the main topic in my publication, Part II. P type is significantly concentrated in Poland, and in the Czech
Republic. It is found at lower
frequency in other Eastern European countries, and in eastern Germany. About 9% of Polish males carry P type Y-DNA.
After my publication, an SNP called L260 was discovered, found
to be equivalent to P type, confirming my prediction that P type corresponds to
a haplogroup, R1a1a1g2.
The “father” haplogroup R1a1a1g (M458) is composed of
what I have been calling N type (L260-) and P type
(L260+).
My current definition
for P type, P43, is a modal haplotype using 43 of
the 67 standard markers. The cutoff is 7, which
means all samples less than step (genetic distance) 7 are
predicted P type (predicted L260+).
That definition is available in the PType.xls
analysis file, in my Haplotypes.xls
file, and at Ysearch as 8U92G.
P type age (age means TMRCA)
is about 1,600 years. That’s highly
uncertain, but I’m 80% confident that age of 1,600
years is not off by more than a factor of 1.5 - age 1,100 to 2,400 years. The L260 mutation is likely quite a bit
older than the age of P type.
It’s interesting to wonder if the age
of P type is associated with the historical appearance of Poland somewhat more
than 1,000 years ago. It’s also
interesting to wonder why P type is so isolated in haplospace
- why there are so few men alive today with STR values slightly different than
P type. I added a bit of speculation
along these lines to my publication, but frankly, no one knows the
answers. I offer a little more
speculation at the end of this topic.
That was a quick summary. Next comes detailed discussion:
My published 2009 definition for P
type, P36, still works very well. My
prior update definition, Sep 2010, P46, still works very well. I updated the definition Aug 2011; P43.
All 3 definitions are compared in that analysis file PType.xls,
Calculator sheet, columns BZ to CB.
The August change is only a slight
tweak; I dropped 3 slowly mutating
markers that are mutated in two samples recently found L260+; these two were at steps 7 and 8 using the
prior P46 definition; they are now at
steps 5 and 6 with the new P43. More
discussion about this below.
There is only one L260+ sample not
captured by P43. This sample is at step
9 using any of my 3 definitions. The
problem is DYS464, where this sample obviously had a serious recLOH
mutation, expanding the number of 464 markers from 4 to 6, yielding step 4 for
only that compound marker. The net step
9 would become step 5 without 464.
Nevertheless, I cannot drop 464 from my definition, because this marker
helps a lot to discriminate P type from N type. I have more discussion below about this
outlier sample.
P43 captures only one sample not P
type, an NB sample, which means N Borderline.
Although this sample fits N better than P, hence the NB prediction, it
has not been tested for L260 or M458, so its status is uncertain.
There are 10 samples at step 6 (5 Aug
2011), the last step of the type, where uncertainty is
highest. Seven of these have been
tested L260+, confirming membership in this haplogroup. This high testing rate is not a
coincidence; Mayka
and I have been encouraging men with marginal
samples to do the L260 test since it became available in Apr 2010. (We paid if cost was a problem.) One of the step 6 samples not L260 tested is
the NB sample of the previous paragraph.
Another is M458+ and not a fit for N type, so it
can be confidently predicted L260+ (although the L260 test would be nice). The 10th step 6 sample has neither SNP test, and is not a fit for N type, so it is assigned
PB, a Borderline assignment intended to encourage SNP testing. There are two other PB samples that were
step 6 using the prior definition;
these are now step 5. We will
probably expand the PB category, so the next assignment update should have a
few more PB samples, again to highlight the ones most likely to benefit from
SNP testing. I estimate the PB samples
have about 75% probability of being proven L260+.
P43 summary: The P43 definition, cutoff 7, captures 90
samples as P type. One L260+ sample is
not captured because of DYS464. One
captured sample at step 6 is probably N type.
So the predicted P type is 90 samples and the predicted (some actual)
L260+ is also 90 samples (5 Aug 2011).
The statistical accuracy of my P type
definition may seem like about 98% - 100% below step 6. However, my confidence
is more like 90% - I’m 90% confident that more than 90% of future samples that
match P43 below the cutoff step 7 will be L260+ if tested - - 95% confidence
below step 6. That confidence is not
calculated - it’s my estimate to account for two issues: First, I have removed from the definition
markers that are mutated only for L260+ samples at high step (mentioned above
and discussed further below) but more such mutated markers are bound to show up
for future samples, so future predictions are not quite as good as the adjusted
fit implies. Second, there may still be
a very small L260- clade that just happens to have STR
values close to P43 due to the luck of random STR mutations. For samples without Polish ancestry the probability
is higher for these two issues; this
confidence discussion is limited to Poland, as represented by the Polish
Project.
According to Pawlowski,
along with further evidence in my publication, P type (L260+) is concentrated
in Poland. I verified this and other
Polish types using both Yhrd and Ysearch. P has fewer mutations than N and K, so it
must be younger. In my publication I
estimated that about 8% of Polish men have P type male line ancestry of this
type; my current estimate, from the Results Table, is 9.0% (calculated from the edited
data 28 Jul 2011) -- calculated 70% confidence interval 8.0% to 10.0% -- 95%
confidence interval 7.1% to 11.0%.
Ludvik Urban pointed out to me that P
type is common in the Czech Y-DNA
Database. FTDNA also has a Czech Y-DNA
Project. There is not enough data
yet to calculate if the frequency in the Czech Republic is greater or small
than the approximate 9% frequency in Poland (as represented by the respective
projects).
Karen Melis, administrator of the FTDNA
Zamagurie
Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on
the border of Slovakia with Poland. I’m
not sure of the concentration in Slovakia.
It will be interesting if more data
in the future allows resolution of subtypes of P type by region.
I added a “Ysearch” sheet to that
PType.xls analysis file, with update analysis from Ysearch. That file has a copy of the 123 matches at
step < 9 (12 Aug 2011) from my P43 definition, 8U92G. Seven of those matches are modals, segregated
to the bottom of the sheet and not used for analysis. The cutoff is 7, same as in the Polish Project, but SBP is 19%,
not very good. The reason is 10 samples
at step 7. Only two of these at 7
indicate “Poland” for Origin, 3 Germany, 2 Scotland, 2 Unknown, and 1 USA. This may be a sign of a clade outside Poland
with STR values close to the P type cutoff;
I doubt that; more likely, these
are outliers from more distant clades, because there are a huge number of
samples at step >9 so of course some samples from those clades will fall at
step 7 just due to the luck of random mutations. In other words, P type is a relatively small haplogroup on
Ysearch, and the background is larger on Ysearch than
in the Polish Project, so of course SBP will be larger. Still, 19% is pretty good on Ysearch.
Those Ysearch results include 11
samples with “Unknown” or “USA” for Origin, so I removed those for Origin
analysis, 105 net samples. Below the
cutoff step 7, 54% are Poland; that is
very high; the overall percent of
samples in Ysearch from Poland is a very low percent. At steps 7 and 8, 26% are Poland, showing the expected drop off
for outliers. Germany and other Slavic
countries also have significant percent P type; there is a table with details in that Excel sheet. This updates my evidence that P type (L260+)
is concentrated in Poland.
The isolation of P type in the Polish
Project is now even more impressive than at the time of my publication. Most of the samples at steps 7 and 8 are
good fits to other newly discovered types (see PType.xls, column CB), so there
are now fewer borderline samples just beyond the edge of P type. Two of the step 7 samples are my maternal
cousins; their close match to P type is
what got me interested in this topic; if I had not noticed this someone else may
have done a similar study and those two samples would not be in the
database; statistically those two
should be edited; I edited by -1 in the
Results Table, but I do not do minor edits in the analysis files. One of those cousins is tested M458- so I
have high confidence both belong to I type, not P type.
This Aug 2011 analysis does not
include L260 data from other projects.
I’ll wait a few months before reviewing L260 data outside the Polish
Project. My last analysis including
data from outside the Polish Project for P type, N type, L260, and M458 was Jan
2011. For those last results, see the
following topics, which have not been updated for several months:
P type Age - TMRCA: My publication explains the ASD method. The ASD sheet in PType.xls provides 1,778
years using all 67 makers. However,
385b should not be used because 5 samples have recLOH
mutation from 14 to 10, providing the unreasonable ASD age of 11,007 years at
this one marker. Also, 464 has obvious
recLOH issues; my ASD sheet, treating
464a to d as independent markers, comes up with an average of 2,093 years for
these 4. Most people who figure ASD age
exclude 464. It is interesting that
385a has no recLOH (10 to 14) so far; I
do not understand why not. The other
compound markers are not issues because the P type values are such that the
apparent recLOH cause only step 1 mutations, so they might as well be included.
1,637 years is the ASD age,
cell N29 of the ASD sheet, using 62 markers; excluding 385b and excluding the
four 464. Exclusion is by typing a
blank or zero into a mask, row 21, so you the reader can easily verify that
removing compound markers other than 385b has no significant effect.
The far right of the ASD sheet has
all the markers ranked by apparent age.
I added a Notes column with explanations for some of them. Other than 385b, other old markers should
not be excluded because the random luck of STR mutations is bound to produce
such anomalies, which are statistically balanced by the 9 markers with zero age
(no mutations among the 90 samples).
They should all average out. By
the way, the number of markers with apparent zero age has been declining in P
type as data accumulated during the past few years, as of course it should, but
apparent age averaging many markers has not changed more than statistically
expected due to the details of new data.
My 2009 published age was 1601 years;
my update last year on this web page was 1775 years. I have consistently written “roughly 1600
years” in my discussions.
There are a number of reasons why
“raw” ASD age should be increased, as discussed in my publication, part I. However, those reasons are mostly due to population bottlenecks in the
past. As discussed below, P type
evidently went through a rapid population expansion soon after the TMRCA, so
the raw ASD age should be used as a best estimate. Anyway, there are significant non-statistical age caveats that produce systematic
uncertainties as large as the uncertainties due to population bottlenecks, and
much larger than the statistical sampling uncertainties from 90 samples. So any age calculated from ASD (or from any
other type of STR variance) should be taken with a grain of salt. My factor of 1.5 uncertainty quoted above is
based on my 80% confidence from experience, not from
calculation.
385a=10 is the best marker for P
type. I have a separate topic for the P type signature. 385a=10 continues to be amazing.
89 of the 90 samples predicted L260+ have the 385a=10 value. Beyond P type, 385a=10 shows up in only 2
samples at step 7 (my two cousins, mentioned above, who should not both be
counted), none at step 8, only 1 at step 9, and 3 at step 11. The PType.xls database is truncated at step
<12; the full R1a data from the
Polish Project - 457 samples - has only 1 more 385a=10 sample beyond step
11. In other words, this one marker
385a=10 is about 99% effective at capturing P type (future L260+ predictions)
plus less than 3% additional falsely predicted foreign samples from the rest of
R1a. 385a=11 is ancestral (N type and most
of R1a), but so far there are no P type with the ancestral 11 value, strong
evidence that the rare mutation from 11 to 10 happened before the TMRCA. The 385a & b pair are ranked together
tied for 41st in the Chandler rates, not very
slow. However, shorter STRs mutate a
lot more slowly than longer ones, and step down is slower than step up with
stronger effect for shorter STRs.
(Chandler discussed this with me by email - his project did not take
these issues into consideration - treating compound markers together, with data
combined from all haplogroups). In
other haplogroups 385a values >14 are not uncommon. So it makes sense that the 385a mutation 11
to 10 should be very rare, explaining why it works so well for P type, although
that one P type exception (at step 4) is an even rarer 10 to 9 mutation.
Column CJ of my analysis file shows
that using only the best 5 signature markers, cutoff 2, 83 P type samples are
captured an none from outside P. That’s
better than 80% accuracy using only 5 markers, which is very good and unusual
in SNP prediction. Even more unusual is
that the one best marker is even better.
DYS540=11. A new signature marker.
From the 111 marker STR set recently
available commercially. 71 Polish
Project R1a samples already have the 111 data, including 12 P type and 12 N
type (18 Jul 2011). 11 of the 12 P type
have the 540=11 value. 11 of the 12 N
type have 540=12. Since P and N are the
two parts of the R1a1a1g (M458) haplogroup, this marker nicely distinguishes
the two parts with high probability. 12
is obviously ancestral because that value dominates the R1a data. 540 already does not look as good as 385a
for P type, but it’s always nice to have another signature marker. It is too early to switch definitions to the
full 111 set. I’ll be adding 111 modal
haplotypes to my Haplotypes.xls file over the following months; P and N are already there.
That Excel analysis file is
intended for finding types - hypothetical haplogroups with
< 20% SBP. For P
type this is moot because L260 is available.
Nevertheless, I used the file to automatically come up with the best
prediction, P54, column CF, with SBP 7.6%. That SBP means 80% confidence (if L260 were
not known) that less than 7.6% of the predicted P type would not actually
belong to the predicted haplogroup.
Indeed P54 captures 89 samples, only 3 of which are not P according to
my new P43 based on L260 - that’s 3.3% foreigners captured. Since I published the SBP method in 2009,
almost all predictions have been better than SBP. But I designed SBP to be conservative (higher percent) to account
for statistical biases. I expect
eventually to have a few failed predictions (foreign background larger than
SBP, or two or more unrelated haplogroups fitting one type definition).
The main point of that PType.xls
file: Many definitions are displayed,
with various marker selections. I tried
a lot more definitions than the ones displayed in that file. The exact definition does not matter much
for P type. Any reasonable definition
of P type captures more than 90% of P type and less than 10% foreigners. Even the full 67 modal haplotype works
OK. Although that P54 has lower SBP
than my current P43 definition (9.2%) , P43 is better because I adjusted P43
using L260 results.
I identified P type and submitted my
analysis for publication before the M458 mutation was
announced by Underhill. The end of my Part I mentions M458 -- notes added during
publication. M458 (so far) is composed
of P type plus N type plus perhaps a few small clades just outside N. L260, the SNP that
defines the haplogroup corresponding to what I have been calling P type, was
discovered by a P type member of the Polish Project, inspired by my
publication. With him and other
coauthors, I published a brief letter announcing and describing L260 in the
Fall 2009 issue of www.jogg.info.
P type has obvious structure. Evidence of sub clades. Nodes in the P type
branch of the Y-DNA tree. The most
obvious evidence is bimodal markers. The bimodal markers are discussed below as clusters - hypothetical sub clades without high
confidence. The bimodal markers do not
correlate with each other, so none of the clusters qualify yet as types. Future data may provide better statistics
with a convincing subtype of P. If this
paragraph is not clear, please read the discussion below for the individual
clusters: Pa, Pc, etc.
Other evidence of structure: My two edits of the P type definition. In Sep 2010, I increased the number of STR
markers in the definition, then edited out the markers that have mutations only
in L260+ samples at high step, and not in L260- samples at or just beyond the
cutoff. In Aug 2011, I edited out 3
more such markers. Four samples involved, color coded in columns BZ to CA in
the analysis file; two do not fit my
original P36 but fit the other two definitions; two do not fit the 2010 P46 but fir the other two
definitions. These edited markers are
also evidence of structure. These are
all relatively slow mutating markers.
Those samples with such mutations are probably from old nodes in the P
branch. Of course, these cannot all be
old nodes because some markers will have mutations only at high P step just due
to the luck of random mutations. Some
samples from young nodes will come out at high step due to luck, and some
samples from old nodes will come out with low step. The point of this paragraph is that old nodes defined by rare
mutations are expected in any Y-DNA tree, and those samples are evidence of the
expected structure in P type. Another
point of this paragraph is justification for my method of editing markers. You the reader may be concerned by such
editing as selection bias to improve the apparent fit of the data. Indeed there must be such bias in some of
the markers that I edited. However, insofar
as some of those edited markers truly correspond to old nodes in the P branch,
it is appropriate to edit them; future
distant cousins with the same rare mutation will be better predicted as
L260+. The whole point of using
definitions shorter than the full 67 is to remove those markers that define sub
clades in order to come up with a proper definition that distinguishes the
branch as a whole, as explained in my publication.
Old node comment. It is possible the P type data includes
samples that really belong to an L260 branch with a node much older than the
next youngest node. In such a case it
would not be proper to combine them into the single P type. That one sample at step 9 (discussed above)
is an example of a candidate for such an old branch, but then again that sample
might just be an unlucky member of a young node (an outlier). Those 4 edited samples of the previous
paragraph are also examples. Because
there have been very few P type samples beyond my original cutoff, and because
all but one of them were easily incorporated with minor edit of the definition,
I am comfortable considering them all as a single type until there is evidence
of significant L260+ samples beyond P.
At any rate, all markers are included in the age calculation, so any old
branches contribute to the estimated age of the oldest node (oldest node means
MRCA). This paragraph would be a valid comment about any type analysis, but P
type is unusually well isolated in haplospace, so the justification is strong
to consider it a single clade.
The L260 mutation might be about the
same age as P type. Unlikely. We expect a defining SNP to be more likely
older than the TMRCA, perhaps much older.
The Western Slavic Modal haplotype,
Ysearch 28WGP,
matches P type perfectly at all 43 markers used in my new definition. That Western Slavic Modal uses 76 markers,
but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.
The Polish Project makes some
assignments to P type for samples with < 67 markers if they match the P type
model very well. I have not updated
those assignment rules for a couple years, but I have been quite conservative below
67, so those assignments are still > 80% confidence.
Let me finish this P type topic with
brief speculation about the origin of P type:
What does P type isolation mean? One simple explanation: The M458 father haplogroup for P type and N
type seems to have experienced a severe population
bottleneck. The evidence: P type and N type are very easily separated
by STR values. Both are isolated in haplospace. No
overlap. They are so far apart that the
nearest neighbors (just beyond the cutoff) for P type include outlier
samples (from other R1a haplogroups) in addition to N type samples, and nearest
neighbors for N include samples other than P.
Apparently, the father haplogroup was quite old at the time of the
bottleneck, with lots of variation in STR values. The bottleneck wiped out most of that population, so today men in
that father haplogroup descend from just two ancestors, the MRCAs
of P type and N type.
Why is P type so large and
concentrated in Poland? One obvious
explanation is a rapid population expansion not long after the TMRCA. Evidence:
Subtypes cannot be defined with confidence. Apparently, the major bimodal markers are due to mutations that
happened early in the population expansion, so the branches of P type have
similar statistical spread of STR values.
For more discussion along these lines see the discussions of the
clusters below.
There are other explanations to these
questions: P type may represent a huge
migration of a single paternal tribe during the dark ages from far away to the
region that is now Poland. Perhaps the
related haplogroups in that far away place got wiped out by subsequent famines
and wars. On maybe they did not get
wiped out. If people in that far away
place did not tend to migrate to North America in the past, and today do not
tend to get DNA tests, then perhaps there are isolated pockets of L260 clades
there waiting to be discovered - some with STRs very similar to P type - some
with STRs very different than either P or N.
Maybe in the mountains of western Asia.
Also, the standard “null” explanation
should be considered unless there is strong evidence otherwise. The null explanation is statistical: No significant bottleneck or expansion. Just the luck of random growth of clades in
a small human population over the millennia.
The MRCA of P & N perhaps were far apart in STR values just by luck
- both being outliers. No one knows how
to calculate the probability that a large P and a larger N clade can be sole
survivors of the statistics of clade growth in the Y-DNA tree in only a couple
thousand years. To me it seems highly
unlikely. But I don’t know how to rule
this null model out in a convincing way.
I can think of more complicated
models as explanations. I’m sure you
can, too.
Caveat: I said M458 consists of P and N.
It is possible some of the outliers from N type might represent small
old branches that have nodes older than the node for P & N. There is no evidence to support this, but
then again there is no evidence to rule this out with confidence. More data will answer this over the next
year, perhaps. Anyway, this is a small
detail in the larger picture.
P type Bimodal Markers. This sub topic was significantly edited 25 Aug 2011, when I
introduced a definition of bimodal.
The following analysis uses the 90 P
type samples (5 Aug 2011) predicted L260+, at 67 markers, discussed above. I also include some comments about the 12
samples available with 111 markers (on 18 Jul 2011). A bimodal marker is evidence of structure, but not proof - a hypothetical
clade.
In the past, I have sometimes called
these hypothetical types. I now prefer
to reserve the word type for < 20% SBP, which Mayka and I take as evidence for
80% confidence that more than 80% of the samples
belong to a clade that will someday be confirmed as a haplogroup
by a newly discovered SNP.
Sometimes we make exceptions slightly above 20%, for example when a type
is regionally concentrated.
None of the following bimodal markers
qualify as a definition of a type, although some of them might be good enough
to be called clusters.
This is not proof that a specific
bimodal marker or cluster does not correspond to a future haplogroup. It is still possible that 95% of the samples
from a particular bimodal marker belong to a unique future haplogroup. For example, if the son (or grandson, or
great great grandson) of the P type MRCA had that defining mutation, and if he
participated in the purported P type population expansion, that would explain
why his haplogroup (male descendants) have STR values so similar to P type
except at the one defining marker. He
had no other mutations that differed from his ancestor among the standard 67
that I’m using today for analysis.
It is possible as more STR data
accumulates some of the following will qualify as types. Cluster identification is a bit of an art so
it is possible I just failed to find a small P sub type and someone else will
find it.
Many of the following are probably
not unique clades, but instead represent two clades that have widely separated nodes in the P tree.
Or three or more.
One characteristic of a type: It shows up early in the data as a cluster
with 20% < SBP < 50%, and the SBP continuously decreases in value as more
data shows up, as the SBP penalty for sampling statistics becomes diluted. This is good - it means false clusters that
show up by luck will not last as more data accumulates. The P bimodal markers that I have been
following for a few years (Pa, Pb, Pc, Pd, Pe, Pg) all have increased in
SBP, which I take as evidence that they will probably not become types.
Excel files for Pc and Pg are
in the on line data with my 2009 publication;
I am not updating those or adding any others because none are good
enough to stand out. Nevertheless, some
merit discussion:
Pa Bimodal Marker. Defined by DYS389 delta = 18.
DYS389=13,31. 18 samples (among
90 P at 67). P modal values are
13,30. This is a compound marker; that 2nd number is the sum, so this mutation
is in the longer repeat chain; P modal
17, Pa value 18. All the 18’s are
13,31; there are no 14,32 or 12,30 in
the Polish Project P type data at this time;
my analysis files will capture any future such samples as Pa. That 31 value by itself does not capture the
Pa cluster because there are several 14,31 in P type, which I’m calling a
different cluster because they are not mutated at the longer repeat chain; the 14 refers to the shorter chain.
Pa is briefly mentioned in my
publication at page 172. Pa was the
first bimodal marker to catch my attention in 2007 because that 31 value
produces the 3rd most common haplotype in Polish data that differs by only one
step from P modal values using the old standard 12 marker set; see the table in my publication at page
162. Such a common haplotype at 12 is
evidence that Pa is an old sub clade of P.
However, the evidence is not convincing yet.
Bimodal evidence: Only 4 samples (value 16) with values other
than 17 or 18 for the longer chain.
3 Pa are available at 111 markers.
I have more discussion about Pa in
the Pg topic below.
Pb Bimodal Marker.
DYS19=16. 27 samples. P modal value 17. This one is of interest because 16 is the ancestral R1a value,
modal for both N and K types. The large
size of Pb is a bit of a surprise, because Pb is only 5th largest at 12
markers, and those should be a mix of P and K because Pb differs from both P
and K by only 1 marker out of the 12.
Those 27 are not K because they have 67 makers and do not fit K type,
which differs by multiple signature markers.
The large size of Pb might mean there is one large P sub clade that
represents the oldest P node, before the mutation to 17, so it is quite old
with lots of STR variation. That makes
sense, because the proportion of Pb samples that match the Pb modal at 12
markers is not much different than the proportion of P samples that match the P
modal at 12.
On the other hand, Pb might be 2 or
more clades with unrelated nodes, only one of those might be the oldest, the
others being back mutations to 16 by coincidence. On the other hand, that 16 might be a back mutation for most or
all samples, as far as we know with the data available today.
Bimodal evidence: Only 2 samples (value 15) with values other
than 16 or 17.
5 Pb are available at 111 markers.
Pab bimodal marker pair would have
both Pa and Pb defining mutations.
There are only 2 such samples (out of 90 at 67 markers).
Pc Bimodal Marker. DYS439=11. 17
samples. P modal 10. Also discussed in my publication starting on
page 171.
Bimodal evidence: Only 2 samples (value 12) with values other
than 10 or 11.
One Pc is available at 111 markers.
The combination markers produce Pac
and Pbc clusters with 3 and 6 samples.
See also Pch below.
Pc Cluster. New topic 5 Feb 2012.
I called this Pc type in my publication, but I have since restricted my use of
“type” to those clusters in which I have 80% or more confidence.
Pc Cluster is the only significant
subdivision so far of P type, although I do not have high
confidence that I have accurately defined Pc.
P type is the most signficant Polish Y-DNA clade. In my analysis file PcCluster.xls
I included a long and tedious Discussion sheet; that sheet is not really
intended only for men assigned to Pc;
that sheet is intended more as an demonstration of cluster analysis
techniques that I have developed, and that may be of interest to other STR
cluster analysis enthusiasts.
Pg Bimodal Marker. DYS572=11. 25
samples. P modal 12. Also discussed in my publication page
172. Like Pb, this one is of interest
because the 11 value is ancestral; the discussion is similar to the discussion
for Pb.
572 is the 4th from the last of the standard 67 markers.
Bimodal evidence: Only 2 samples (one each at 12 and 13) with
values other than 11 or 12.
3 Pg are available at 111 markers.
The combinations Pag and Pbg each
have 8 samples. Two Pb combinations
(above) have 3 or more samples. All
other combinations of a, b, c, g have fewer than 3 samples each.
Those two combinations with 8
samples, Pag and Pbg, are instructive. They provide a reason why Pg has not worked as a proposed type in
the past. Pg might be comprised of two
sub clusters. Pag has the P modal 17
for all 8 samples at the “b marker”.
Pbg has the P modal 17 for the long 389 chain for all 8 samples at the
“a marker”. 9 Pg samples belong to
neither Pag nor Pbg.
For most haplogroups, a cluster of 8
samples with two markers that differ from the haplogroup modal is
impressive. However, P type is large
and relatively homogeneous. In this
case I have tried many combinations;
some are bound to come up impressive just by luck; I am discussing only the impressive
ones. I suppose if your sample falls into
either Pag (or Pbg) there may be 50% confidence that you belong to a clade
including more than 5 of those samples defined by the two corresponding
mutations, but I personally do not consider the confidence anywhere near 80%.
Even if Pag and Pbg are shown in the
future to correspond to two haplogroups, it does not follow that they will be
sub clades of Pg; they may be
independent branches of the P tree that both received the DYS572=11 mutation
independently. Or one of them could be
an old node with the ancestral value.
DYS572 is ranked in the Chandler list as 40th, not very slow. In the 2010 version of this web page, I
presented evidence that 572 is indeed a slowly mutating marker, at least in
R1a. I still stand by that
prediction. That would make it
reasonable that most of the Pg samples belong to the oldest node in the P tree
(but still less than 80% confident for 80% of the samples). Also, we wonder if Pbg is the oldest node in
the Pg branch, or if Pbg is a more recent back mutation at the “b marker” DYS19
to the ancestral value? In other words,
are the apparently ancestral 572=11 and 19=16 both older than P type, or both
younger, or is one older and one younger?
We don’t know yet.
H type also has
the 572 = 12 value.
Ph Bimodal Marker.
DYS534=14. 34 samples. P modal 13.
Bimodal evidence: Only 2 samples (value 15) with values other
than 13 or 14.
One Ph is available at 111 markers.
There are several combinations; the ones with 3 or more samples: Pah, Pbh, Pch, Pgh, Pagh, Pbch, Pbgh have 4,
11, 12, 14, 3, 5, 4 samples.
My published Pc can also be
considered Pch, defined by those two markers that differ from the P modal.
The best 3: Pbh, Pch, Pgh, have 11, 12, 14 samples. These are instructive, particularly if they are viewed along with
the previous two “instructive” combinations, Pag and Pgb above. These cannot all be valid clades because the
same markers are used in different combinations. This is an explicit demonstration how interesting clusters will
always come up if enough combinations are tried. However, if we assume one particular cluster to be valid, that
means some of the others are not valid.
Pd, Pe, Pf, Pi, …. My Haplotypes.xls
file, near the middle of the “Haplotypes” sheet, has a longer list of bimodal
markers in P type.
Plap Cluster. Includes Lapinski samples.
This cluster has 8 samples that match perfectly at 14 of the 67
markers. Two of those 8, plus two more
at step 1 out of the 14, belong to the Lapinski family set. This is an example of selection bias,
because Lapinski recruited the other 3 distant relatives, so the cluster is not
as large as it seems. The cluster does
not form a type; I mention it here as
an example of a tentative cluster.
The Plap modal differs from the P
modal at what I call the Pr marker, DYS607 = 17 for Plap vs 16 for P
modal. DYS607 is highly variable in P
type; there are more 17 samples than 15
samples -- a mildly bimodal distribution.
However, those 8 Plap samples, all with the 17, just about account for
the excess 17’s, so 607 is no longer bimodal after adjusting for Plap.
Pz Cluster.
DYS565=14. Only 4 samples. DYS565 is the last of the 67 set. There are 5 DYS565=14 samples -- these 4
plus another that does not fit. The Pz
modal differs from the P modal at 12 markers, so this one is promising for the
future. SBP comes out over 50% because
of the penalty for small sample statistical correction built into SBP. This one may improve as more data
accumulates in a year or so. On the
other hand, I studied about 20 P clusters to come up with this best example of
a new promising cluster, so the most obvious explanation is luck. If you study STR data randomly generated by
a computer you may find a good cluster if you examine enough candidates.
R. Update
20 Oct 2012. This category was
discontinued in early 2012. The original
significance was as a “Remainder” category for R1a samples that are outliers in STR values, not fitting any other
category. Outliers have always been a
significant fraction of R1a. In 2008, R
was the 2nd biggest category after U. Over the years, many “R” samples tested
positive for newly discovered SNPs, giving them a category for assignment. Many still have not been tested for new
SNPs. Most of them have been falling
into the haplogroups Z92 and Z93 and into the paragroup
Z280*. So many are still not confidently
assigned even with a Z280 positive result because Z92 is an old branch of Z280. Those with 67 markers are assigned based on
best STR fit to nearest neighbors in haplospace, but
such outlier assignments are not high confidence.
Z280. See K type.
Z93. New topic 31 Oct 2011. This new SNP was recognized earlier this
month by ISOGG as R1a1a1h.
So far, all Z93 samples in the Polish
Project are coming out L342.2+, and vice versa.
A type,
discussed here at this web page since origination, and mentioned in my 2009 publication, is a branch of Z93 (L342.2). A type samples are coming out positive for
both SNPs.
I just today added L342T
as a new cluster, a hypothetical branch of Z93 (L342.2).
The Z93 category at the Polish Project web page has the samples that are Z93+
or L342.2+ and are not predicted A type or L342T cluster. Z93 also includes samples not tested for Z93
but are close STR matches to a sample that tested Z93+.
I tried to come up with an STR definition
for Z93 (L342.2). I could not. Z93 does not have good signature
STR markers. Or, there is a better way
to say that: The signature markers for
Z93 are about the same as the signature markers for Z280 (previous topic),
which is a large new haplogroup in R1a.
Lots of Polish Project samples are now coming out Z280+. Z280 seems to be equivalent to what I have
been calling K type.
Z93 and K type have similar STR values at the slower mutating STRs. As a result, the modal haplotype for R1a as
a whole is similar to the modal haplotype for Z93 (L342.2) samples, and similar
to the modal haplotype for Z280 (K type) samples.
A simple explanation: Z280 and Z93 are “brother” haplogroups, and
neither is particularly young. The MRCAs of these two haplogroups apparently had very similar STR
values. Originally, both grew rapidly,
before significant sub clades could form with STR
mutations at slow mutating markers.
Over the years, both haplogroups diversified in STR values. So many subclades in Z280 and Z93 today have
STR overlapping values. Population bottlenecks eventually produced some
sub clades with good STR signatures, such as A type for example, which is very
well isolated in haplospace. This paragraphs is a simple explanation of
why it is difficult to distinguish all Z93 samples; other explanations are possible, including complicated
explanations.
Z93 is a good example of why
calculating age of haplogroups is highly uncertain. A type seems to be very young. A type dominates Z93 in the Polish
Project. Maybe A type had a particularly
vigorous population expansion; or maybe A type luckily avoided a severe
population bottleneck; or maybe the A
type ancestors moved to Central Europe from distant lands; whatever.
Age is calculated from STR variance, so the age of Z93 is dominated by
the age of A, which is misleading and too young. If A type samples are excluded, the age of Z93 still would come
out too young, because the A type samples have a unique STR signature, which
means significant STR mutations, which means the A type MRCA lived at a time
when Z93 was already quite old, so the A data needs to be considered when
estimating the age of Z93. I’ll try to
come up with an age estimate, for next time I update this topic.
End of R1a Categories.
Instructions
for Ysearch comparison are below.
On 20 July 2010 I added the following
three R1b Types to this web document (next three subtopics,
L23EE, L47P, L47A).
Mayka had
already added these three to the Polish Project
web page during the previous week, based on my
recommendation, based on my SBP analysis.
I independently found these three by
analyzing the Polish Project R1b data, but Mayka pointed out they were
previously known as clusters. We judge that my analysis justifies adding them to our list of
types. Since I’m using 639 samples with
67 marker data as representative of Poland,
a small type clade at 1% of the Polish population would be expected to have
roughly 6 samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.
I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.
The STR definitions for these are
available at Haplotypes.xls. PolishCladesUpdate has a
link to an Excel analysis file for each of these three types.
Instructions
for Ysearch comparison are below.
Here is the “UserIDs” bar for R1b comparison:
USEID, CX94E, MKM4R, 7HB9C
Change USEID to your User ID.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be
optimal for other regions. If you have
Polish ancestors, and if you have all 67
markers, and if you match one of these within a step distance of 10 there
is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability
that you belong. You should test the
appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern
Europe and you are a marginal match (step distance 5 to 15) for one of these,
it is not very probable that you belong to the corresponding Polish clade,
because each of these types has some overlap with other clades that are rare in
Poland.
L23EE. 20 Jul
2010 documentation: This type is positive for
the L23 SNP, hence this type is a hypothetical future
haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch -
R1b1b2a1 - of L23.
Nordtvedt
pointed out the cluster for this type some
years ago, calling it R1b-EE (Eastern Europe). Mayka suggested the L23EE code to me.
There are only 6 samples in the
Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a
small type. The cutoff is 12, but if you match at step 10 through 12 I
estimate your probability of belonging at slightly better than 80%, so you
really should test for the L51 SNP - a negative result would boost the
probability to about 95%. In the Polish
Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of
the samples from step 17 to 20 are L51+.
So this type is very well isolated in
haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this
type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in
the Polish Project) and only 2 of those 12 are East European - one each in
Germany and Russia. That means this
type is not well isolated world wide, meaning samples near the cutoff are
highly uncertain. I interpret this as
evidence that my definition of L23EE type is really a Polish subtype within a
larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no
obvious correlation. To me, that means
there are probably at least 3 sub-clades that may become evident as data
accumulates.
If you match this type closely at 37 markers I highly
recommend getting the full 67, because the
statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51
test; a negative result confirms
membership in this hypothetical clade, and a positive result means you are not
a member. We do not know the
probability of outsiders matching L23EE in STR values, particularly outside
Poland, so there is still a slim chance of a surprise - a close match to the
definition but with L51+.
L47P. 20 Jul 2010
documentation: This type is positive for
the L47 SNP, hence this type is a hypothetical future
haplogroup within the current haplogroup R1b1b2a1a1d1. This type is probably negative for L44, the
only current known branch - R1b1b2a111d1a - of L47, but that L44 negative
indication is based on only one sample so far so it is not certain.
Mayka announced the cluster
corresponding to this type on the web in
March 2009.
There are only 4 samples in the
Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small
type. The cutoff is 7 and the gap is 10.
There are no samples from step 7 to 16.
Although samples in that wide gap are expected as data accumulates, this
type is very well isolated in haplospace in
Polish data.
This type is very robust;
the same 4 samples are selected using any number of markers from 10
to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3%
indicates, because some of the samples at step 17 and beyond have tested
negative for the SNPs in the R1b trunk leading to L47 so they clearly do not
belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but
3 of them are the same as the Polish Project.
Ysearch has 8 samples at steps 8 to 12, so the type is not as well
isolated worldwide.
The “P” in the code L47P represents
my hypothesis that this type is Polish.
Members of this type should test for
L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b,
which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in
STR matching, probably less than 10% in Polish data, but I do not know what the
exact percent interference be until more data accumulates.
See the last paragraph of L47A, next
topic, for more comments.
L47A. 20 Jul 2010
documentation: This type is positive for
the L47 SNP, hence this type is another hypothetical future
haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative
for L44, a known branch of L47.
Mayka suggested
the “A” code, since this type is obviously Ashkenazi, based on family names
(see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA
projects, although I did not do the research to find a first web publication at
67 markers; I would appreciate an email
of a reference to add here, even if it does not exactly match my
definition. It’s OK if an international
modal haplotype differs by a few markers from a haplotype determined in Poland,
particularly if the difference is at markers that are bimodal, indicating
subtype structure.
There are only 5 samples in the
Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a
small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to
18. Although samples in that wide gap
are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust;
the same 4 samples are selected using any number of markers from 30
to 67 with SBP <10%.
This type is better yet on Ysearch (code 7HB9C), with 18 samples
(13 Jul 2010)
for better statistics; SBP = 4.6%,
which is remarkable. It might be even
better with an optimized definition; I
used the modal haplotype that I extracted from the 4 Polish Project samples.
This
one does not seem as Polish as L47P, although those 18 Ysearch samples are
concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are
the only two known branch haplogroups of L48.
In the Polish Project so far (20 July), no one has tested yet for L148,
and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do
not know the frequency (prediction probability) of L48 samples that do not
match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR
values expected for L148. (There are
two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for
statistical estimation.). All this will
quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch
at FTDNA, so only administrators have visibility of SNP test results beyond
L48, including L47 and L148. Mayka provided the SNP data that I have documented here.
Update 25 Mar 2012.
At the end of July 2010 I added two types from the I haplogroup to this web document. I independently found these two by analyzing
the Polish Project I data. Mayka informed me that they were previously known as clusters, hypothetical clades,
discussed some time previously by Nordtvedt. Mayka added these two to the Polish Project web page in July 2010, based on my recommendation, based on my SBP
analysis. One is a branch of what has
previously been called I2-CE, and seems to represent a Polish collection of
M253 branches so we named it M223CE type, discussed in the next topic. The other seems to be a Polish branch of
I1-M253, so we named it M253P type, discussed in a topic below. I an now also using the short code names
I-CE and I-P for these. I am now splitting
I-CE into I-C, I-D, and I-E, topics below.
My STR
definitions for these are available at Haplotypes.xls,
in the Excel analysis files discussed below, and at Ysearch.
Instructions
for Ysearch comparison are below.
These types are calibrated to Polish Project data. The I-P definition WC8JD
forms a type in the Ysearch database, so it seems to be reasonably valid world
wide. The I-C definition SB6YK,
and the I-E definition QUXE3,
are probably not valid at Ysearch for a sample with origin remote from
Historical Poland, because of interference by other clades with similar STR
values, particularly from Russia.
I-CE.
(M223). Update 25 Mar 2012. ISOGG code is now I2a2a; last year’s code for M223 was I2b1,
still being used at FTDNA and the Polish
Project.
All the I-CE samples in the Polish
Project fall into one of the 3 branches discussed in the following topics.
The M223 clade is very well isolated
in STR haplospace.
FTDNA is able to predict I2b1(M223) with high confidence using only the
first 12 standard markers, for more than
90% of the samples. Using 67 markers, I
found that any reasonable definition does a good job of extracting M223 samples
from Y-DNA STR data. A good definition
is available on Ysearch, code 4H6C9,
using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).
STR isolation in the Polish Project
is generally evidence of a single Polish clade. It is possible that two or more clades with distant nodes in the
Y-DNA tree might have similar STR values by coincidence. In the case of Polish I-CE, since the larger
I-CE world-wide clade is well isolated, my Polish I-CE type might well be a
collection of multiple clades, perhaps including some clades that are not particularly
concentrated in Poland. My original
M253CE type used 4 of the 8 I-CE samples back in 2010. There are now 12 I-CE samples, and they form
two types plus one cluster. It may seem
silly to split these into 3 branches, but there are new SNPs,
discussed below, that justify the split as valid haplogroups. These small types are interesting because
they are preliminary evidence of small Polish clades.
CE stands for Continental Europe,
including Britain.
The M223-Y-Clan project
has lots of data; I used this project
data for reference.
A good signature
is (392, 437, 450) = (12, 14, 9), which distinguishes almost all M223 samples from
others, allowing one mutation step. (594) = (11) is also an excellent signature
for M223, with the value 10 dominant outside M223, but this one is strange in
the Polish Project, where 4 of the 12 samples have value >11; this is evidence that I-C might comprise two
clades.
At Ysearch,
the percent Polish samples for I-M223 is low.
The following 3 STR definitions, my proposed Polish branches, capture a
small fraction of M223 at Ysearch.
My Excel file I-CE.xls has
analysis of this type and also analysis of the following three branches. That file has ASD analysis, but ASD age is
very misleading when calculated from samples that are a collection from
multiple large old clades. The three
branches have too few samples to attempt age estimates.
I-C.
(M223+ P78-). (I-C Type Branch). New topic 25 Mar 2012.
I-C type is a hypothetical subdivision of I-CE (M223).
I-C type includes all 4 samples
assigned to I-CE last year, plus one that was missed last year, plus 3 new
ones, for 8 total at 67 markers in the Polish Project. SBP has improved from 19% to 2.6% over the
past year, so this is a clade with high confidence due to the excellent
isolation, although there is a chance it may be two or more independent clades
as discussed above.
My Excel file I-CE.xls has
analysis of this type in column CJ, SBP=2.6%.
My definition uses 67 markers, cutoff 20, gap
14. There are no Polish Project samples
in the gap from step 20 through 33, so this type is very well isolated. This definition also isolates I-E type, 4
samples, steps 34 to 42, but there is a better definition for I-E, see the next
topic.
There are no Polish Project samples
at step 43 or 44. There is only one
I2b2 sample (not M223) as step 45. Then
there are no further samples at steps 46 through 52. So this I-C definition also captures all of the broader I-CE (M223), although surely a
better I-CE world wide definition could be constructed.
A good signature
is (406, 487) = (10, 12), which itself distinguishes the 8 I-C samples in the
Polish Project.
Two of the I-C samples are I-D
samples, discussed below. Two other I-C
samples have the same family name, very close in STR values. The remaining 4 samples in I-C are not particularly
close to each other in STR values. The
SNP data for each sample is included in column BX of the “Calculator
sheet”; 4 of the samples tested
negative for all 4 known haplogroup branches of I-M223. So I-C seems to capture M223* plus P95
(below) in the Polish Project.
My definition is also available at
Ysearch, SB6YK. On Ysearch there are plenty of samples from
step 20 through 33, so this definition does not work world-wide. The closest fits are not concentrated in
Poland, so if I-C truly represents a Polish clade(s) my STR definition will not
find members with confidence far from the region of Historical Poland.
I-E.
(M223+ P78+). (I-E Type Branch). New topic 25 Mar 2012.
ISOGG now I2a2a3; last
year’s code for P78 was I2b1c, still being used at FTDNA
and the Polish Project.
My Excel file I-CE.xls has
analysis of this type in column CM, SBP=13%.
My definition uses 67 markers, cutoff 19, gap
7. There are no Polish Project samples
in the gap from step 19 through 25, so this type is very well isolated. Only the I-C samples are all at steps 26 to
44, so this definition also nicely separates I-C from I-E in the Polish
Project.
A good signature
is (393, 459a, 446) = (15, 9, 10), allowing one mutation step, which
distinguishes the four P78 samples in the Polish Project.
Both the P78+ Polish Project samples
are in the M223-Y-Clan
Project, and there are 13 others, but there are many more P78- in M223-Y-Clan,
so this is not a particularly large subdivision of M223.
The other two I-E samples in the
Polish Project that have not been tested for SNPs, but both have P78+ close
matches on Ysearch, and no close matches from the other 3 branches of M223, so
those are likely also P78+.
There are two other known haplogroup
branches of M223: M379 has no positives
in M233Y-Clan, and plenty of negatives, so it is very rare. M284 has plenty of positives in
M223-Y-Clan; that branch is a large
subdivision with a couple known branches of its own, but no samples in the
Polish Project.
My I-E definition is also available
at Ysearch, QUXE3. The Ysearch closest matches are I2b1c, so my
definition is good at extracting P78 samples, but I suppose a better definition
could be constructed for the world-wide P78 data. On Ysearch there are plenty of samples from step 19 through 25,
including some I2b1c beyond step 25, so this definition does not work
world-wide. The closest fits are not
concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR
definition will not find members with confidence far from the region of
Historical Poland.
I-D.
(M223+ P95+). (I-D Cluster). New topic 25 Mar 2012.
ISOGG now I2a2a4; last year’s
code for P95 was I2b1d.
There are only 3 samples P95+ in the M223-Y-Clan Project,
and many P95-, so this is a small haplogroup.
Those 3 include one but not both of the Polish Project I-D. Two of those 3 have Poland listed as origin,
and the third has no origin listed, so this may be a Polish clade, but it is
too soon to tell. It is possible that
I-C has a larger subdivision Polish branch, of which this I-D may be a branch,
but this is just speculation until we get more data.
(640) = (13) seems to be a signature
for I-D, but one STR marker should not
be very reliable for prediction.
I did not enter a definition into
Ysearch. The two I-D samples are
highlighted bold blue in column CI of I-CE.xls. Only one sample is P95+ in the Polish Project - the one that is
also in the M223-Y-Clan Project, so I used that sample as the definition. There is a sample at step 10, and none
others out to step 22, so I tentatively assigned that step 10 sample to I-D,
forming a cluster of two samples, SBP=25%, well isolated from others but not a type.
I-P.
Update 29 Mar 2012. I-P is a
proposed branch of I1(M253).
On 26 July 2011, I added this Polish
type for I1 haplogroup to this web page.
This type has been known as a cluster for a few years. Mayka pointed out to me that Nordtvedt listed it on the web. Marek Skarbek Kozietulski has studied this cluster quite a bit,
since he’s a member. I mentioned this
type briefly in my publication, where I was
previously calling it Y type, considering it not high confidence
based on the data available then in 2009.
I am now very confident that I-P type corresponds to a valid clade,
concentrated in Poland, to be verified someday with a new SNP
discovery.
I have also called this M253P type,
because I-P samples test positive for I1 (M253) and negative for the branches
of I1, although new SNPs for I1 are being rapidly discovered, and the newest
have not yet been tested for I-P. So
this is a type within the paragroup I1*, although a
low fraction of samples from I1* are members of this M253P type. Marek has done the WTY ,
March 2011, without finding an SNP for I1.
My analysis file is I-PType.xls. My definition for I-P
type uses 54 markers, cutoff 4, gap 5, no samples in the gap from steps 4
through 8 in the Polish Project at 67 markers.
SBP came out 6.4%
for the 9 samples in M253P in July 2011 in the Polish Project at 67
markers. There are now (Mar 2012) 11
samples with SBP = 5.0%. Marek informs
me that he had identified 4 men who matched at 12 markers and actively
recruited them to obtain all 67 markers and to join the Polish Project. That means only 7 of these 11 samples should
be used for statistical purposes. SBP
calculated on the basis of 7 samples is 8.7%, which is excellent evidence of a
clade that is isolated in haplospace.
I used all 11 samples in my analysis
file in order to best estimate the definition, which are also available at Haplotypes.xls.
A good signature
for M253P is (391, 392, 447) = (11, 12, 24), although this signature alone is not
foolproof for distinguishing I-P from all other I haplogroup samples.
Nordtvedt's I1
Tree has this I-P type as I1*-P1, with related clusters I1*-P2 and AS4.
Here is some interesting speculation
for which I do not have convincing statistical evidence: Marek points out that a sample at step 4 on
Ysearch is Danish, which adds to his evidence that there might be a related
clade in Denmark, perhaps with a node in the I1 tree slightly older than the
node for the I-P Polish clade. I do not
know where that Danish sample falls in Nordtvedt’s tree.
Ysearch provides evidence of
concentration in Poland. My definition
is WC8JD. 73% of the samples that come up in Ysearch
(8 of 11) have Poland as origin.
Although this is a small statistical sample, this is the most Polish
concentrated type I have seen so far.
SBP=22.1% on Ysearch, due to that single Danish sample at step 4, so
although statistically less confident at Ysearch, my definition can suggest
samples from Ysearch for the hypothetical I-P clade, albeit with lower
confidence than samples with Polish origin.
The age comes out
only 567 years using all 67 markers.
See cell N12 of sheet ASD in my file.
There are many caveats associated with
age calculation based on ASD, and this is a small statistical sample. Insofar as Marek may have recruited with a
bias toward close matches, the ASD age is biased low. That said, it is clear that I-P type represents a young clade.
Instructions
for Ysearch comparison are below.
N-G.
(N-L551). (N-G
Type). Update 22 Mar 2012. Introduced on 17 Oct 2010 as “N1c1(M178)-G
type”. The latest ISOGG
code is N1c1d1a
(L551).
Mayka suggested
this one, based on a suggestion by Andrzej Bajor, from his Rurikid
Dynasty Project. This type is concentrated in
Lithuania, and Andrzej suggests that at least one member might be a male line descendant
of Gediminas, the medieval Lithuanian Duke.
Hence the “G” code.
This type has 9 samples at 67 markers very well isolated in the Polish
Project with SBP = 8.9%.
See N-GType.xls. The definition is
also available at Haplotypes.xls
and at Ysearch as RGE95,
using 51 markers, cutoff 3 (samples < step
3). All but one of the N-G samples can
be extracted from the Polish Project using only the signature
(392, 607, 557) = (15, 14, 13).
This type should not be confused with another G type in the R1a haplogroup.
That new L551 SNP
verifies our prior prediction that G type corresponds to a clade. All 9 of the predicted G type samples at 67
markers have tested L551+, and samples predicted just beyond G type are coming
out L551-. Of course, there will
probably be a few exceptions as more data accumulates, but so far N-G type (STR match) is equivalent to L551 in the Polish Project.
At Ysearch, N-G type is not as well
isolated; the SBP is 22% with cutoff 4,
due to interference by what might be a Russian clade. There are many Lithuanian samples matching my N-G definition
(RGE95), including Lithuanian samples beyond the cutoff (step 3). 46% of the Ysearch samples below step 9
indicate Lithuanian origin. L551 is too
new to be included in Ysearch, so this paragraph refers to N-G type as defined
by STRs.
I do not know if the Polish Project
N-G samples are an independent Polish sub-clade of a larger Lithuanian
clade; or if the Polish Project samples
are just a random sample of individuals from a larger clade(s). I have not taken the time to search other
projects for STR matches to my N-G definition, or to search for more L551+
samples. Someone might inform me before
I get a chance to search. Watch this
topic for updates.
The age of N-G type
seems to be less than 1,000 years, perhaps only 500 years. Check the “ASD” sheet in my analysis
file. ASD age is highly uncertain, particularly for such a small
sample, but G type has little STR variance, so surely G represents a clade
younger than 2,000 years old. Isolation
is evidence of an old node, with TMRCA much younger than the node. The age of the L551 mutation can be anywhere in the time span
older than the TMRCA of G type and younger than the node. N-G type is well isolated in Lithuania and
Poland, but N-G may have a relatively young node with those other clades
world-wide with similar STR values.
Those other clades can be used to better constrain the age of the L551
mutation.
N-M.
(N-L591).
(N-M Cluster). Update 22 Mar 2012. Mayka suggested this
one also, introducing it at the Polish Project in Jan 2011, as “N1c1(M178)-M
Cluster”. The latest ISOGG
code is N1c1d1b
(L591). Includes Mickevius
(Mickewicz) descendants. Hence the “M”
code. Also concentrated in
Lithuania. These two, N-G and N-M, are
a small fraction of the M178 clade.
I call this a cluster because it does
not meet my criterion SBP<20% to be called a type. Actually, the
original proposed cluster is equivalent to what I am now calling Ma cluster,
discussed below. The recent new SNP named L591 is coming out with about
twice as many samples, so we have adopted the “M” short code name for the STR
data for L591; this larger N-M cluster
is so considered equivalent to N-L591.
My analysis is available, N-MCluster.xls,
10 samples at 67 markers. My best
automatic definition for N-M, column CL,
SBP=25%, is 80% accurate, missing one sample that is obviously L591 and
predicting one sample that came out L591-, out of 10 predicted. Actually, this result is a nice confirmation
of my SBP method, because although the data has only 10% background
(false positives captured by the definition), my SBP formula has an increase to
account for statistical confidence; hence 25% is a better upper confidence
estimate of the background for so little data.
I bet as more data accumulates my best N-M definition will drift below
SBP=20%, qualifying as a type. Anyway,
this is moot, because L591 is a better criterion for the clade, and there is a
logical distinction between the N-M cluster (samples with STR correlation) and
the L591 haplogroup. My definition
serves as a guide for priority for L591 testing. Testing should be concentrated near the cutoff.
Accordingly, I came up with an
improved STR definition for L591, using a mask to manually adjust marker
selection. I’ll still call it by the
short code N-M. Column CC in that
file. SBP=50%, but SBP does not matter
here, because the purpose of the definition is not to discover a hypothetical
clade, but to predict samples for a known clade. Most clades do not produce low SBP because most clades are not
well isolated. Let me elaborate with
discussion of the statistical issues for N-M:
Obvious issue: There are three N-M samples with a very rare
6 step mutation at DYS446, from 17 to 11.
Without DYS446, two of these three marginally fit the N-M cluster (based
on STRs). These three seem to represent
a subclade of L591 with modal STRs slightly
drifted since their node. I marked them as “Mb” in that Excel
file. Only one of these has actually
tested L591+. Another one of these is that
“obviously L591” sample that I mention above, the “obviously” based on this 6
step mutation, which is almost as good a marker as an SNP. That “obviously” sample is an STR outlier at other markers, which need to be excluded from
the L591 definition, assuming more samples like this will show up. This seems obvious, but it needs
verification with more data over the near future.
Speculative issue: There are two other outliers, which I
labeled Mc and Md. Tested L591+. These may represent two clades with nodes
only slightly younger than the TMRCA for L591, with
independent modal drift. Highly
uncertain. They might just be
statistical outliers, due to the luck of random mutation. Again, more data will tell. For now, I adjusted the N-M definition to
capture them, on the assumption that some future samples might come up with
similar STR values.
Another issue: That one sample, mentioned above, fitting
the M cluster very well but L591-, probably represents a clade with a node
slightly older than L591, but similar STRs by coincidence; there may be other such clades. Again, this is speculative, but I adjusted
my definition to exclude this one.
Statistical speculation summary: L591 does not seem very well isolated in haplospace, albeit more isolated than most young Y-DNA
clades. It seems the L591 tree has
nodes close to the SNP age, both younger and older.
My L591 definition is available in
that Excel file, in Haplotypes.xls,
and at Ysearch as 64RUG.
This L591 clade seems to be
concentrated in Lithuania. The evidence
is Ysearch - Lithuanian concentration of the N-M cluster. L591 test data is not available yet at
Ysearch. My Ysearch analysis (data in a
sheet in that Excel file) is similar to the G type analysis: SBP not as good because of apparent
interference from clades world-wide.
Using the N-M definition at Ysearch, there is Lithuanian concentration
at steps well beyond the cluster cutoff, so there seems to be a larger
Lithuanian clade.
In the Polish Project, I spotted
evidence of such a larger STR type, about double the size of N-M, including the
all the N-M samples as a sub-clade. I
colored these samples green in column BX of N-MCluster.xls, using all 67
markers. I dubbed this one N-L
type. That 67 marker evidence is not
satisfactory because it captures a couple N-G samples. In another file, not posted on-line, I came
up with a satisfactory definition for N-L;
I provide it in the “Haplotypes & Masks” sheet, row 21, of
N-MCluster.xls. Mayka advises me that
there are two new SNPs, L1025 and L1027, that are currently candidates for a
haplogroup larger than L591. We are
waiting to see how those come out before introducing N-L. That N-L definition cutoff provides a
suggestion of where to prioritize SNP testing.
The age of N-M (L591) comes out
similar to the age of N-G type, probably less than 1,000 years; see that short
paragraph in the N-G topic above. My
comments about isolation of N-G in the Polish Project do not apply to
L591. For N-M, it is important to
exclude DYS446, because that one marker triples the age as calculated using ASD
(STR variance), due to that 6-step deletion mutation mentioned above. You can see this by editing cell BV21 in my
mask in my “ASD” sheet in that file.
Another way to edit this is to edit the 446 value, to make the mutation
count one or two, which is more representative of the age. This is a good example of one of the caveats associated with age calculation based on
STR variance.
N-Ma. New topic 20 Mar 2012. This is the original “N1c1(M178)-M Cluster”
cluster, explained in the previous topic.
Only 3 samples when introduced Jan 2011, SBP=36%. Now there are 5 Ma samples, SBP=30%. Although still not qualified as a type,
there is better than a 30% chance this will improve over the next couple years
as data accumulates. Lithuanian
concentration, same as N-G and N-M.
Again, I do not expect validity world-wide for N-Ma because of interference
from other clades world-wide, but this might grow into a nice small, young
Lithuanian clade. Analysis is in N-MCluster.xls,
where the 61 marker definition for Ma is in column CG.
Instructions for Use of Ysearch
Link to the site: http://www.ysearch.org. Brief description of Ysearch.
Click on the Create A New User tab,
where you can upload your Y-DNA STR data from a number of testing
services. Or, you can type in your
data. You end up with a “User ID”.
Ysearch has a Research Tools tab to
click, where you can type in other User ID’s for comparison.
Cluster Genetic Distance
Method; for Haplogroup R1a: P - Pc - Pg - N - K - A - I -
B - D - E - Fa - Fb - H - M - G:
Click here: Research
Tools
Copy the following line into the
“UserIDs” bar at the Research Tools page:
USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX,
RU8Z8, K49NZ, GNYBG, YQ6D2, EFQM7, 559EE, 24MB4, ZD29Z
Change USEID to your User ID.
You need to type the Captcha puzzle
for access.
Click on ‘Show genetic distance
report”. You get a table of results.
Results: If there is a small genetic distance result
(3 or less) for one of these types, you have a high probability of belonging to
that type. There are more detailed
rules available, see the “Polish Project Rules” sheet in the R1a Assigner.xls
file. For haplogroups I, N, R1a, and
R1b, see also Haplotypes.xls.
Reminder: This web page concentrates on the region of Eastern Europe
associated with Historical Poland. If
your male line is not from this region, the results of this Ysearch comparison
may be misleading if there are unrelated clades, rare in Historical Poland,
with haplotype range that overlaps one of these. Search for my discussion, in this web page, for your best match
type; in some cases I have evidence for
interference world-wide (significant matches by unrelated clades). Many men of Polish male line ancestry do not
match any of these types; this web page
is a work in progress. For non-Polish
there is a higher probability of not matching any of these types.
Cluster Genetic Distance
Method; for Haplogroup I: I-P, I-C, I-E:
Follow the R1a instructions above,
except copy the following line into the “UserIDs” bar at the Research Tools
page:
USEID, WC8JD, SB6YK, QUXE3
Cluster Genetic Distance
Method; for Haplogroup N: N-G, N-M:
Follow the R1a instructions above,
except copy the following line into the “UserIDs” bar at the Research Tools
page:
USEID, RGE95, S4RUG
Cluster Genetic Distance
Method; for Haplogroup R1b: R1b-P, R1b-EE:
Follow the R1a instructions above,
except copy the following line into the “UserIDs” bar at the Research Tools
page:
USEID, MKM4R, CX94E
Comment 18 Jul 2012: See the L260 and
M458 News topic for an update.
This topic needs a complete rewrite
with new test results.
This topic was completely rewritten
during Dec 2010 & Jan 2011; last
update edit 15 Jan 2011.
This topic provides discussion. Read L260 and M458
News first, for a summary. For
detailed numbers see the following topic.
L260 and M458 are the names of two SNP tests.
Lawrence Mayka is
the administrator of the Polish Project. SNP results are not posted on the web. Most of my SNP
data comes from Mayka. Some of my data
comes from Cyndi Rutledge, the administrator of the R1a Project.
Many men join both projects, but of course many men purchase the L260 or
M458 test and do not join either. If
you are an administrator of an FTDNA project (or a project at another database)
you may send me the L260 and M458 results for your project for merging into my
analysis, if you wish. Karen Melis, the
administrator of the Zamagur8ie Project, also sent me a few M458 results.
Data with the 67 standard markers is most
common in the SNP results because Mayka and I selected these for the initial
tests. In addition, men who have
purchased less than the standard 67 markers are less likely to purchase SNP
tests. This discussion is limited to
the 67 marker data with only brief comments about those with <67.
Mayka and I purchased many L260 and
M458 tests for Polish Project members, so test results available to me are
biased toward Polish data. Also, I
suppose men who notice my publication and web pages about Polish types are more
likely to purchase the L260 and M458 tests, so even data not available to me
might be biased toward Polish data. At
first we were concentrating on samples that match P type and N type very well,
so much of the data available to me are biased toward P type and N type, of
course. Later we concentrated on
borderline samples that just barely match P type and N type, in order to better
define the borders in STR haplospace. If there are clades from outside R1a1a1g
(M458) that just happen to have STR values that match P type or N type we will
discover them quickly, but not if they are concentrated far from Poland, and
particularly not if they are concentrated in any Eurasian lands where men do
not tend to get DNA tests. If there are
M458 clades with STR values very different than P type or N type it will take
some time to discover them all, because those will require “deep clade” tests
by men without an M458 prediction to do the M458 test anyway. I have many such
“wildcat” results; so far I have no
L260+ or M458+ with STR values very distant from P and N type. I have comments below in this topic about
the few outlier results a few steps beyond P and N
types.
The SNP results do not provide
estimates of population frequency because we are selecting the most interesting
samples for SNP tests. However, since
the SNP tests verify my type classification, my STR types provide credible frequency
estimates. My Results Table is still the best estimate
of frequencies in Poland: P type for
M458+ L260+; N type for M458+ L260-.
My types
are defined by STR values following my mountain method.
For samples with all 67 standard STR markers my P type definition uses 46 of those markers; N type uses 45. The cutoff for both P and N is step 7, which means samples less than 7 genetic distance (step
mutations) from the definition are predicted as belonging to the corresponding
type.
To be fair, I should point out that I
was a bit more conservative with my P and N predicted assignment rules 2 years
ago, before the M458 and L260 SNPs were available, and when there were not as
many samples with all 67 markers. Also,
there were fewer known types 2 years ago.
Half of today’s P and N outliers would be missed using my rules from 2
years ago and the others would be placed into “PK Borderline” and “NK
Borderline” categories because 2 years ago I was more concerned about
distinguishing P and N from K type, now known to be
M458-. I no longer use those PK and NK
categories. With recent data, my
current STR based assignment rules are much more accurate for P and N outliers. I changed the P type definition October
2011.
I cannot define P type as exactly
equal to L260, nor can I define N type as exactly M458 minus L260, because the
types are defined by STR correlations. The outliers may be statistical, due to the
luck of random mutations, particularly for P type with only 2 outliers so far
(15 Jan 2011). I find that unlikely for
N type, because the N branch STR distribution seems to have a non random tail
extending to many outliers. It is
possible that N branch outliers represent very small clades
(perhaps only one clade) with old nodes in the Y-DNA tree. However, any particular outlier at or beyond
the N cutoff cannot be assigned with confidence to a subclade of N. This is the reason I use the word “branch”
instead of type for outliers, because I cannot be confident they all belong to
the same young clade, as opposed to multiple young clades with old branches -
with old nodes in the Y-DNA tree
However, those N type outliers
provide confident assignment rules. At
the N cutoff step N=7 all 4 samples in the Polish Project have been tested
M458+ confirming N branch. At the next
step N=8, 3 of the 6 in the Polish Project fit well for prediction into one of
the M458- types, and 1 of those has been tested M458-; the other 3 N=8 do not fit any of the other
types and indeed have been tested M458+ confirming N branch. This analysis is continued below in the next
topic; the result is that samples without
SNP results that have STR values at the cutoff or 1-2 steps beyond P or N type
can be predicted with 100% probability (not 100% statistical confidence) to belong to the corresponding branch, for
those samples that do not fit another type.
At 3 steps beyond the cutoff probability is still about 50% for
belonging to the branch.
P type and N type are very well
separated from each other. Within P
type, there is only 1 sample with steps N=P+5;
all others are N>P+5. N type
is more diffuse in STR values than P type.
For N<6 there are 3 with P=N+5.
The most ambiguous N type sample has N=7 (cutoff) P=8, and that one has
been evaluated M458+ L260- confirming that samples marginally N type are really
N branch. There are 3 others with N=6
or 7 and N<P<N+4; 2 of them are
confirmed M458+ L260- and the other is M458+ but not tested for L260 yet. The most distant sample has N=10 P=9 and it
is confirmed in the N branch, M458+ L260-, again providing the insight that
distant STR samples with P step about equal to N step tend to fall into the N
branch. (Again, this is for Polish
Project samples that do not fit another known type). Of course, we expect someday to see exceptions, just due to the
luck of random mutations.
There is one sample with P=9 N=11,
but that one has an recLOH mutation that scores 4 steps
at the DYS464 set. This is really only
one mutation, so I manually adjusted the step to P=6 on this one.
For P type, the closest M458- sample has
P=7 (cutoff); it fits I
type; this is the sample that
originally sparked my interest in P type. A P=8 M458- sample is assigned to K Borderline. A P=9
N=9 sample is the closest M458- sample that does not fit any known type, so is
assigned to the Remainder category.
Borderline comments: In the Polish Project we use borderline categories for samples that have 50% to 79%
confidence of belonging to a haplogroup or type. For P and N type samples with 67 markers, borderline means the
SNP test has not been performed. With SNP
results, samples are placed in the corresponding P or N type, with the understanding
that outliers may in fact belong to closely related clades, as explained above.
Remainder comments: I use remainder
categories for samples that have less than 50% estimated probability of
belonging to any known type. Until
recently we distinguished between the Rx458 category for samples not tested for
M458 (and not positive for L260) vs the R458- category for samples that have
been tested negative for M458. Today,
all samples distant from all known types have been coming out R458-, so the
Rx458 data has been merged into the R458- category.
During 2010 I used a R458+ category
for N branch outliers, to distinguish outliers, which might not be true N type
members. However, the distribution of N
STR values is continuous, with no objective cutoff for N type vs N branch, so
the R458+ distinction was dropped for now.
This discussion concentrates on
samples with 67 markers for clarity.
There 31 with only 37 markers and 2 with only 12 that have SNP
results. I watch these for obvious
anomalies; none yet. Analysis has lower confidence with fewer
markers.
Summary of results: P type and
N type are very well isolated in STR haplospace. They are well isolated from M458- samples
and even more isolated from each other.
Roughly 90% of the M458+ samples cluster into the two STR types within
which I can make future SNP predictions based on new STR data with virtually
100% confidence.
The roughly 10% remainder have STR values near the cutoffs for the
types, mostly N type. Future STR
predictions for these can be made with more than 50% statistical confidence (up
to 100% probability based on the few data available
so far) because most of these that do not fit one of the other known types do
come out L260+ if closer to P type and M458+ otherwise. It is possible that some of these outliers
belong to small clades (perhaps only two or three) that have older nodes in the
Y-DNA tree.
Age (TMRCA) of haplogroups is uncertain due to a number of caveats.
That said, N type seems to be about 2,000 years old and P type seems to
be about 1,500 years old. Those
estimates can be up to a factor of 2 incorrect, as discussed in my caveat
topic. The ages of L260 and M458 are
particularly uncertain because the calculated ages are dominated by P and N
types, which are quite young. The SNPs
may be much older, for all we know. The
outliers in the P branch are too few to have significant effect on the
calculated age of P type. It is
possible that the N branch is really two (or more) types that are just as young
as P; the calculated N age in such a
situation would come out older. Ng type provides preliminary evidence of a hypothetical
subtype of N, but Ng is too small and too close to N to affect the calculated
age of N.
What does all this mean? There are a number of explanations. Here is the explanation that seems simplest
to me: The R1a1a1g (M458) clade seems to be thousands of years old. It may have expanded into a large population
long ago. The members of this clade diffused
into a wide distribution of STR values over the millennia. Then there was a severe population bottleneck followed by a
rapid population expansion, or multiple bottlenecks followed by multiple expansions. The living members of M458 descend from only
a few men who each lived near the beginning of the most recent population
expansion. Almost all living M458 men
descend from just two of those men: the
N type MRCA and the P type MRCA. A low
percentage of living M458 men perhaps descend from other MRCAs who lived at
roughly the same time as those two, as evidenced by the outliers in the N
branch SNP data available to me today.
L260
and M458 Test Results; Details
Comment 18 Jul 2012: See the L260 and
M458 News topic for an update.
This topic needs a complete rewrite
with new test details.
This topic was completely rewritten
during Dec 2010 & Jan 2011; last
update edit 15 Jan 2011.
Number of samples:
51 L260
Polish Project
32 L260+
19 L260-
8
L260 R1a Project, counting only those not in the Polish Project
2 L260+
6 L260+
59
L260 Total
34 L260+
25 L260-
154
M458 Polish Project
83 M458+
71 M458-
43
M458 R1a Project, counting only those not in the Polish Project
9 M458+
34 M458-
7
M458 Zamagurie Project, not in lists above
7 M458+
204
M458 Total (sum from above)
99 M458+
105 M458-
263
Total SNP test results (sum from above)
213
unique samples (210 different men)
180 have all 67 standard STR markers
31 have the 37 standard set
0 have the 25 standard set
2 have the 12 standard set
P type summary at 67
markers:
34 P<6 predicted P type all
confirmed
7 P=6 predicted P Borderline all
confirmed P type
1 P=9, but P=5 or 6 if corrected for recLOH, so predicted P type; counted as P<6;
confirmed L260+
42
P type; so far, all samples below the cutoff 7 came out L260+, confirmed P type
1 P=7 (cutoff for the definition) P
branch outlier confirmed L260+
1 P=8 P branch outlier confirmed
L260+; this one from Czech Rep. is not
in the Polish Project
2 P
branch outliers; so far, all SNP data
samples with P<9 are either P type or fit well to another type
so far, no L260+ with P>8
N type summary at 67
markers:
25 N<6 predicted N type all
confirmed
6 N=6 predicted N Borderline all
confirmed N type
31
so far fit N type, all confirmed
4 N=7 (cutoff) predicted N Borderline
all confirmed M458+ N branch
There are no samples from
other types at N<8
So far, all samples below
with N<8 came out M458+ L260-, confirmed N branch
4 N=8 N branch outliers; all confirmed M458+
8 N
branch outliers N<8 100% probability predictable,
>50% confidence
So far, all SNP data
samples with N<9 are either N type or fit well to another type
2 N=9 N branch outliers
1 N=10 N branch outlier
3
with 50% predictability but no confidence, explained below
P type Details; samples that
have all 67 markers sorted by P step:
34 P<6 solid P type (P<6 means
<6 mutation steps from P type definition)
3 N=9 all solid P type well isolated from N
5 N=10
26 N>10
16 of the 34
are M458+, L260+; confirming P type not
N type
8 are L260+
not tested for M458, assumed to be positive, confirming P type
10 M458+ not
tested for L260 - predicted positive
7 P=6; 1 step below cutoff;
would be predicted P Borderline prior to SNP evaluation; all 7 are M458+
4 are L260+, confirming P
type
3 not yet tested for L260
probably most of these will be positive, now predicted P type
These
represent all the Polish Project samples at step 6, 1 step below the cutoff,
because these were selected for M458 evaluation soon after M458 was
discovered. So step 6 is not as common as
it seems in this SNP analysis.
2 P=7; cutoff; first step just
beyond P type predictions
1 P branch outlier
predicted P Borderline; confirmed
L260+. From Bohemia.
1 predicted I type,
verified M458-, not in the P or N branches
P=7 is very
rare. By the way, this sample is my maternal grandfather.
5 P=8; 1 step beyond cutoff; 2
steps beyond P type
1 M458+ L260+ P branch
outlier; not Polish Project; R1a project from Hostacov CR
1 predicted K Borderline;
result M458- confirms not P or N
type; still predicted K Borderline
3 N<=P considered N
branch; details below
9 P=9; 3 steps beyond P type
1 N=11; M458+ L260+ This
one has recLOH at 464, contributing 4 steps, so I consider this equivalent to
P=6, so I count it as predicted P type, not an outlier. This is marginal, since it could be argued
that the recLOH mutation may have happened after a 1 step mutation at 464 for
all we know, making 2 steps, placing this sample an outlier at the cutoff 7, so
my decision to predict him P type is arguable.
1 N=12; K=1; predicted Fa
type (subtype of K) confirmed M458-
7 N<(P+2) considered N
branch; details below
123 P>9 none are L260+; 18 are
L260-
180 total with 67 markers, sorted
here by P step
N type Details; samples that
have all 67 markers sorted by N step:
25 N<6 solid N type
all 25 are M458+
0 L260+
3 of the 34 are M458+,
L260-; confirming N type not P type
3 P=9; N<5;
match N type much better than P type
29 P>9
6 N=6; 1 step below cutoff; were predicted N Borderline a couple years
ago
all 6 are M458+
2 are L260-; the other 4 are predicted L260-
1 of the 6 is P=8 just
beyond cutoff, but P is a tighter cluster, so this would not be predicted P, and
this one came out L260- as expected
1 of the 6 is P=9, L260-
4 P>9
so all 6 are well
isolated from P type
4 N=7; cutoff. These represent
all the Polish Project samples at step 7, because these were selected for M458
evaluation soon after M458 was discovered.
So step 7 is not as common as it seems in this SNP analysis.
all 4 are M458+
2 are L260-; the other 2 are predicted L260-
These are predicted N
Borderline prior to SNP evaluation
5
N=8; 1 step above cutoff; 2 steps beyond the original N type definition
4 M458+; predicted N
Borderline, now classified N type
1
P=8,10,10,13; K>6; fit no other known type
1 is Austria,
not Polish Project
1 M458-, P=11, K=3
predicted K type, M458- result confirms K not N
9 N=9; 2 steps beyond N cutoff
3 predicted N Borderline
do not fit any known type
2 M458+ N
Branch outlier; P=11,16
1 M458-
Assigned to R Remainder category; P=9, K=7
even at N=9,
2/3=67% probability N branch for samples that do not fit other known types
3 predicted D & G
types verified M458- not M458 branch
3 P<5 P type analyzed
above all 3 are L260+
10 N=10; 3 steps beyond N cutoff
2 do not fit other types
1 M458+ L260-
N branch outlier; P=9; DYS573 null - discussed in previous topic
1 M458-
Assigned to R Remainder category; P=16, K=8
3 fit other types; D, DB,
& K, predicted M458-; all confirmed
M458-
5 P<5 P type analyzed
above 4 are L260+ 1 not L260 tested yet
121 N>10
36 P type discussed above
3 <(N-2) other types discussed
with P sort above
75 M458- predicted other
types, neither P nor N
7 N.17 M458- do not fit
any type; R Remainder category
180 total with 67 markers, sorted
here by N step
For background explanation of L260
and M458, see the topic L260 and M458.
This topic was written 17 Jan 2011.
DYS385a. The single STR marker called 385a is by far the best
signature for predicting P type vs N type. All 34 samples with L260+ result so far have
the value 10. All 25 samples
with L260- result so far have the value 11.
Mayka also
independently noticed this and mentioned it to me.
Usually, a signature with more STR
markers predicts better. In this case,
discriminating P (L260+) from N (L260-), 385a=10 predicts best by itself. No signature with 2 or more markers
discriminates better. In fact, just
385a=10 works as well as the 46 marker P type definition.
This seems amazing, but is not
entirely unexpected. STR markers have
lower mutation rates at lower values, and step down mutations are less frequent
than step up. Since N type has mostly
385a=11, step down to 10 should be less often than step up to 12.
The mutation rate of 385a=10 in P
type (L260+) seems very low. At another of my web pages I postulate a rare SNP in the
middle of a long STR chain to explain a low mutation rate, but such a postulate
does not seem necessary in this 385a case because of the short STR chain
value. For the lower rate at lower STR
values, I provide a reference to Whittaker (2003) in my publication.
We can predict that future M458+ samples
will be L260+ if 385a=10 and L260- otherwise.
The probability is 100%.
Exceptions are zero out of 59 L260 results so far. I figure the confidence
of this prediction at 94%: Poisson 94%
confidence interval for zero is the interval zero to 3.5; (1-3.5/59) = 94%. In other words, I am 94% confident that 3 or fewer samples out of
the next 59 L260 measurements in the Polish Project will be exceptions to this
new rule - that 385a=10 means L260+.
Exceptions will be found eventually, of course, due to rare independent
mutations from 11 to 10.
In the Polish
Project, all 96 samples assigned to P type and all 15 samples assigned to P
Borderline have the value 10 for 385a.
There are 89 samples assigned to N and N Borderline. Only 7 of these have the value 12 for
385a; the other 77 have the value
11. In this case, predicting P type
based on 385a=10, zero exceptions out of 100 samples, provides 97.8%
confidence.
I postulate that 385a has only a
slightly higher mutation rate in the N branch, at value 11. I postulate that those 7 N branch samples
with 385a=12 belong to 2 or 3 subtypes in the N branch, 2 or 3 independent
instances of a mutation from 385a=11 to 12.
Most of these belong to a hypothetical Ncm type. The data is not sufficient yet to provide
statistical evidence along these lines.
385a does not work quite that well
for discriminating P type from all of R1a.
Among the 91 M458- samples not tested for L260 there are 2 with 385a=9
and 4 with 385a=10. None of those are
expected to be L260+ because L260 is a subhaplotype of M458. The 385a marker is still the best single
marker for extracting P type from a full R1a database, including M458- samples
from outside the M458 (P+N) haplogroup.
However, in this case, using 2 or 3 markers works better, and of course
the definitions (46 markers for P, 45 markers for N) work much better than any
short signature.
A few samples with 385=(10,10)
represent a hypothetical subtype within P.
I call this Pk. I’ll discuss it
more if and when there are enough samples for statistical significance.
Other signatures. Table 3 of
my publication provides other signature
markers. DYS572=12 continues to be 2nd
best for P type. DYS 537 continues to
be best for N type.
My R1a
page has a handy 3 marker signature table.
I announced this more than a year ago, as a handy prediction signature
for the dominant types in R1a, using only the first 25 markers most common on
the internet. It still works well. That signature uses (385a, 439, 447). The values for P type (L260+) are
(10,10,23). The values for M type
(M458+ L260-) are (11,11,23). The
values for K type (M458-) are (11,10,24).
37 Marker Network
Lawrence Mayka
(independently, March 2007) constructed a “median joining network” Network
for the 37 marker samples of the Polish Project. This network supports the definitions of the P & N clusters,
and of the A subcluster. The P cluster
is the left side of Mayka’s network; N
is the top branch, and A is a small branch on the lower right.
29 March 2010 correspondence: I mentioned Russian sites
for R1a clusters in my publication. It’s not easy for me to figure out which of
those clusters correspond to my types. Mayka worked out a correspondence on 29 March, warning me
that the correspondence is not exact.
Some of the Russian clusters are broader than my types; some are narrower. Here are Mayka’s findings:
My
Type code vs Russian cluster name:
A Ashkenazi Jewish
B Western Eurasian
C Old European
D Baltic - Carpathian
E Northern Eurasian
F Central Eurasian
G Northern European
H Western Carpathian
I Northern Carpathian
N Central European
P Western Slavic
19 Sep 2010 update: A nice tree display of the Russian
subdivision of R1a is at www.r1a.org. Robert Sliwinski brought this site to my
attention.
My opinion: R1a cannot be highly subdivided
with confidence based on STR data. This web site of mine is dedicated to estimating the confidence
of each type that I study.
I try to indicate which types are speculative. Even for the types with high confidence, the location of the
nodes in the R1a tree will be uncertain until corresponding SNPs
are discovered. These Russian clusters,
apparently by Klyosov, have plus / minus values for
accuracy of TMRCA ages that are far to small, because
there are serious caveats associated with
systematic statistical uncertainties.
Minor edit 8 Mar 2012. Rewrite 15 Sep 2011 - the bottom half of
this topic.
Here is a summary of terms (in
boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see the fall issue of JoGG. By haplospace I mean
multidimensional sets of STR values; each
haplotype is a point in haplospace.
A cluster
qualifies as a type if the graph of step frequency (number of samples at
that step) vs step looks like an isolated mountain. The step is the
genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate
step. The cutoff
is the next step just beyond the mountain.
A good type has low step frequency in a “gap” of
step values including the cutoff (only the cutoff for a gap of 1). In other words, the cluster forms a mountain
at step values less than the cutoff, separated by a gap from the rest of the
database (the parent haplogroup usually) at higher step numbers.
The Statistical Background Percent
(SBP) is an objective measure of the quality of the
type. Low SBP is taken as evidence that
a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP
(yet to be discovered). Larger types
with lower gaps have lower SBP. SBP is intended
as an estimate of the background percent of
samples in a type that really do not belong to the corresponding hypothetical
clade. SBP is increased to account for
the estimated probability of outliers from other clades. An outlier is a
sample that has very unusual STR values due to the luck of mutations. SBP is also increased to account for the
estimated probability of small foreign clades
that just happen to have the same STR values but are not closely related to the
type. The SBP is also increased to
provide the rough equivalent of the maximum in a confidence
interval. Small sample counts have wide
confidence intervals. So larger types
(more samples) automatically get lower SBP.
For a valid clade, SBP should decrease with time as data accumulates in
a database. A very well isolated clade
will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely
to be a clade. SBP < 25% is good
enough to be published. SBP < 50% is
a type worth watching as data accumulates with time. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look
like mountains. The number of markers
in the definition should be chosen to provide as
small an SBP as possible; my Excel
tools provide automatic rank of
markers as an aide; human judgment can
be used to include or exclude markers with obvious problems. A signature is
a small set of markers that rank best, convenient for publication of a type,
and for simple demonstration of the correlation of STR values.
I use the word “type”
to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes
that differ from the modal haplotype by step less than the cutoff. The definition
of a type is the modal haplotype plus cutoff.
The definition uses only those STR markers that provide the lowest SBP,
but the definition uses as many STR makers as possible if there is a tie. The definition of a valid type may change
slightly as data accumulates.
Here are
some common terms (in boldface) for genetic genealogy. I did not define these, although I use them
in a restricted sense: A marker (also “locus”, plural loci) is a DNA location for an
SNP or STR or other kind of mutation. A
haplotype is a set of gene values at any number
markers, here restricted to Y-DNA STR values.
I use the word sample (plural samples or data or database) for the Y-DNA STR values from one man. A sample is also commonly called a
haplotype, but I avoid calling a sample a haplotype to make it clear that a
haplotype may or may not be present in a particular database of samples. A clade is a
general term for common descent, so an SNP haplogroup is one kind of
clade. I use the word clade in general,
meaning a Y-DNA clade that may or may not be a defined official
haplogroup. All types
have associated hypothetical clades, but most clades cannot be isolated as
types with low SBP. A cluster is a set of samples with similar STR values. All types have associated clusters but not
all clusters are associated with types.
The modal value for a marker is the most common
value in the cluster. The modal haplotype is the set of most common values,
usually the most common haplotype in a cluster. Many people use the adjective “modal” as a noun, meaning “modal
haplotype”; so do I; I tried to avoid that in this web document.
The rest of this topic provides
discussions and more definitions that not part of my Mountain Method. These are discussions and terms that I use
often, so I provide them here for easy link reference from my web pages. Some of these terms are not common in
genetic genealogy. Some of these I do not
recall seeing used in documents at all, so they might be my inventions,
although I suppose other writers may have used these terms with similar
meaning:
A bimodal
marker has a second STR value with many samples - more than expected
statistically - in addition to the most common modal value. A multimodal
marker is possible if there are more than two common values for the marker and
if those common values are not distributed more or less symmetrically on both
sides of the most common value. (A
Bessel distribution is statistically expected for a low fraction of random
independent mutations at an STR marker.
A Bessel distribution is close to a Gaussian distribution for a high
fraction of independent mutations. A
Bessel looks like a tent; a Gaussian
looks like a bell.) Step up mutations
are more common than step down for short STRs, so for example a modal 8 plus a
few more 9 values than 7’s does not necessarily mean the 9’s are statistically
significant; experience helps to
judge. RecLOH and
other issues at compound markers also cause confusion in this regard. A bimodal marker is a hint that there may be
a clade associated with that 2nd value, so genetic genealogists study clusters
defined by one or a few such bimodal 2nd values. The main modal value also sometimes makes a good signature at a
bimodal marker. In other words, a set
of values using one or more bimodal or multimodal markers makes a good
signature for a hypothetical cluster.
In the past, I have sometimes called
such clusters hypothetical types. I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence
that more than 80% of the samples belong to a clade that will someday be
confirmed as a haplogroup by a newly discovered SNP. Sometimes we make
exceptions above 20%, for example when a cluster is regionally concentrated, or
associated with an ethnic group.
I had sometimes used “bimodal marker”
for that second STR value, but I’ll try to avoid that confusion going
forward. It’s the STR marker that is
bimodal, with two common values.
There is no known way to calculate
the % confidence that a cluster corresponds to a
clade, but an experienced genetic genealogist can roughly estimate confidence
based on experience. I developed SBP so
that 100% minus SBP expresses my confidence, but only for clusters with less
than 30% SBP; SBP breaks down around
50%. I avoid publishing clusters in
which I estimate less than 50% confidence, although I may mention some as
speculative.
Not all Y-DNA STR data separates into
types because the distribution of STR values tends to be continuous. A type corresponds to a clade that
experienced a population bottleneck
- isolation or migration or very rapid population growth.
I like to use the word twig for a small young branch of the Y-DNA tree.
By the age of
a clade (haplogroup or type or hypothetical cluster) I
mean the TRMCA. By
definition, a TRMCA corresponds to a node in the Y-DNA
tree, where two clades branch. (Often
more than two clades meet at one node, but we expect future SNPs to resolve
that node into multiple nodes with two clades each, with rarely more than two.) An SNP is probably older
than the TMRCA of the haplogroup it defines, and the node for two SNPs is
probably older than either SNP, because there are usually many generations
between old nodes, due to the statistical pruning of
the Y-DNA tree (Y-DNA clades tend to die out statistically). The probability is very low that an old SNP
mutation happened in exactly the same generation as the TMRCA. (An exception is a recent private SNP found in an extended male line family.) I call the segments
between nodes smooth branches, where there
are no branches in that segment of the Y-DNA tree. A long smooth branch in the Y-DNA tree is one way to visualize
isolation in haplospace. Any type,
because it is isolated, probably has a long smooth branch older than the
type. A smooth branch is necessarily a
statistical estimate, because it is not possible to be sure a branch is
smooth; the evidence is multiple
equivalent SNPs, or less than usual STR variation. In addition, there may be small branches with living men who have
not registered Y-DNA data on the web.
So a “smooth” segment really includes the possibility of very few small
branches. The metaphor of a tree is
appropriate, because a large branch with very few twigs looks smooth from a
distance; a smooth branch in an old
tree was not smooth many years ago, but the twigs in that segment have died and
fallen off the tree over the years.
A Y-DNA branch can be smooth in
one database (like the Polish Project) and not smooth in a larger databases
(like Ysearch, if significant branches in that segment
are rare or absent in Poland). All this
paragraph applies to hypothetical clusters, but with lower confidence.
Probability Minimum
80% for Polish Project R1a Assignments
This topic was updated 29 Dec 2010.
See Polish Project Assignments for a brief
overall explanation of how assignments are done. This topic provides more detailed discussion. This topic focuses on the R1a categories,
but most of this discussion obviously applies to other categories.
Each sample
(individual man) is assigned to a category.
Many categories are known haplogroups or paragroups.
Haplogroups are defined by SNPs, but not all
haplogroups are supported by FTDNA
assignments, which may cause some confusion.
Some categories are types,
which are hypothetical haplogroups.
Borderline and cluster categories are discussed near the bottom of this
topic. Click on Remainder
and Unassigned for discussion of those two categories
elsewhere.
The assignment guideline is at least
80% probability for each individual sample. Using an 80% minimum, most
assignments are better than 80%, of course.
So the average probability for a category is higher than 80%, and the
average varies by category depending upon how many samples are marginal near
80%.
For haplogroups, “80% probability” means that if a large number of
samples with 80% probability were SNP tested, about 80% of
them would test positive for the haplogroup into which they were
predicted. Probability is determined by
correlating STR values with samples that have been tested
for that SNP.
Some assignments are 100% probability
- samples with positive SNP test results, assigned to that haplogroup, and not
given an extended assignment. Actually,
there is no such thing as 100% because the genetic test might be in error, but
it seems from experience that testing errors are much lower than 1%.
I arrive at probabilities with a
combination of calculations and educated estimates. This topic is my explanation.
Confidence is a separate topic,
related to probability. Confidence is next, below.
I figure probability as a decreasing
function of step from a modal
haplotype. My assignment rules are
step distances at which I figure 79% probability. If a sample matches the modal haplotype at less than the 79% step
distance, I assign that sample to the corresponding haplogroup or type or other
category. In practice it’s
complicated. I use an Excel file
for assignment. You can view the file
at www.gwozdz.org/PolishCladesUpdate/Assigner.xls. That may not be the current version. In that file the “PolishProjectRules” sheet
has the list of rules for human reading - next to the coded logic functions for
Excel. If you are a Polish
Project member you can find your kit number and view your step to each category
in the table - “Modal Calculator” sheet.
The following paragraphs explain how
I figure probability for types. This is not something I proved in my publication, but it seems to
me that my publication makes it reasonable.
I hope you the reader find the following method reasonable. I expect this method will be proven with
time as most of my predicted types are validated.
If a type has 90% probability of
being valid and a particular sample has STR values that match the type with 90%
probability, those two numbers get multiplied for net probability. That particular sample has 81% net
probability of validity, and 19% probability of invalidity. I do not actually calculate this. This paragraph is a conceptual explanation
introducing the explanation in the following paragraphs.
My publication
has detailed discussion of my statistical method for types. Briefly, I use SBP as a
quality measure. SBP is a measure of
the background - the percent of samples that match
the type but really do not belong. For
example if SBP = 15%, that means 15% is a measure of how many samples within
the type (step less than cutoff)
really do not belong to the type. For
this example, a typical sample in the type has 85% probability of really
belonging to the type.
It is not possible to calculate the
probability that a type really is a clade that will be validated some day by an SNP not yet
discovered. Although 100% minus SBP is
not the probability of type validity, 100% minus SBP is closely related to
validity. Certainly a type with high
SBP has low probability of being valid.
Certainly a type with SBP less than 15% has high probability of
validity.
SBP is a high calculation, designed
for roughly 70% confidence interval, with additional
increase for many statistical reasons explained in my publication. That’s why I call it “Statistical Background
Percent”. This statistical increase is
small for small SBP and larger for larger SBP.
The way SBP is calculated, it goes over 100% for type candidates with
high background; SBP should not be used
over 50%.
The best estimate for background percent
is lower than SBP. However, as
explained a few paragraphs above, the net percent of invalid samples (net
invalidity) is higher in the cluster of a type,
because of the unknown probability that the type itself is invalid as a
whole. It is convenient for me to
assume these two considerations cancel each other. I use SBP as my estimate for the net background percent of
invalid samples in a type.
A sample that matches the modal
haplotype has close to 100% probability of belonging to the corresponding
type. For a type with a high cutoff,
this is true even for a sample a few steps away from the modal haplotype. The reason is that the vast majority of
haplotypes in a type are at the highest step numbers, so that is where most of
the background is. This is explained in
the discussion of Table 1 on page 145 of my publication.
So here is my method: I figure an assignment rule “step < S” to
assign samples, where the samples at step S and greater, equal to about SBP
percent of the type cluster, do not get assigned.
This finishes my brief justification
for using SBP as a guide for assignment.
More discussion of details:
There are other calculations in
addition to SBP, for example haplogroup correlations mentioned above.
Another is the calculation of
correlations for 37 marker rules, which are similar to haplogroup
correlations. Using 67 marker data for
a type, the 37 marker data for those samples provide probabilities that other samples
with only 37 markers belong to this same type.
After I do a particular calculation
many times, I feel confident glancing at new data and making quick estimates
for new rules if the number of samples does not justify detailed calculation.
Let me repeat what I said above: I arrive at probabilities (assignment rules
are 80% estimated minimum probability) with a combination of calculations and
educated estimates.
Mayka, who does
the assignments for most categories other than R1a, does not use my calculation
methods, but insofar as he uses his experience to judge STR correlations, he is
really performing estimated correlation calculations.
When a probability is judged close to
the 80% minimum for assignment based on
STR correlations (step close to the rule limit), there are a number of
additional factors that can be considered.
The following paragraphs are examples.
More examples are in my publication.
Mayka uses similar considerations for assignments:
Geographic concentration. P type is an example. P type is concentrated in Poland. I considered P type as more likely valid
because it is geographically concentrated, before it was validated by an
SNP. Back then I considered a Polish
family name associated with a sample as marginal additional evidence of
belonging to P type. Today that
consideration applies to a sample that marginally matches the P type haplogroup
with STR values but has not been measured for the L260 SNP.
Ethnicity. For example, there are a number of haplotypes known to be common
among Jews, so a Jewish name associated with a sample is marginal additional
evidence that the sample belongs to a corresponding haplogroup or type.
Stragglers. We tend to avoid categories for only one or a few samples, so if
one or two samples have 70% probability as a best estimate it makes sense to
adjust the rule a little looser so that the rule picks up those few samples
that do not quite fit, rather than create a borderline category (discussion
below). Conversely, it makes sense to
be a bit stricter for type assignments if a borderline category is available.
67 markers. We are marginally more liberal with assignments using the full 67
markers and marginally stricter for samples with fewer, because those with
fewer can get more accurate assignments by procuring the remaining markers.
Men with closely matching STR should
be classified together, particularly if the family name is the same.
We avoid changing assignment rules
too often, so some assignment rules may remain in place for a while even after
new data has provided slightly better rules.
For a valid type SBP comes down as
data accumulates, with better statistics.
I avoid introducing a new small type with SBP above 25%, because I
expect it to improve with time.
Technically, SBP = 40% means 60% of the samples can be introduced as a
new type category, but I prefer to wait a few months for more data, so that a
new type is substantial at introduction.
For some types, many of the samples
near the cutoff have already been assigned with high
probability to another type. So those
assigned samples should not be included in the SBP calculation. K type is an
example. Although my published SBP for
K type is 26%, many samples at the cutoff are assigned with high confidence to
other types, including many P type that have tested positive for the L260 SNP. The true
background for K type is much less than 26%, although I have not taken the time
to do an adjusted SNP calculation.
We do not wish to be dismissed by
others with experience evaluating STR data.
On the other hand, we do not wish to have others point out that samples
are being left without obvious assignment.
I suppose the goal should be that the number of people complaining that
assignments are too liberal turns out to be about equal to the number of people
complaining that assignments are too conservative (people with experience
evaluating STR data who have read and understood my documentation here).
A person who assigns samples to hypothetical haplogroups based on STR
values acts like a bookie who provides advance estimates for gambling odds,
using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by
reality very quickly. Probabilities of
an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based
predictions. In the past, a number of
STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you
can read as much as you wish about our (Mayka’s and my) methods, judging for yourself the reliability of
our probability estimates.
I use Borderline
categories when a significant number of samples have 50% to 79% probability for
one haplogroup or type, and have less than 20% probability for any other
haplogroup or type that I know of. The
80% rule does not apply to samples in a borderline category, although it would
be correct to say those samples have minimum 80% probability of belonging
either to the corresponding type or to an undiscovered type with similar STR
values, as opposed to belonging to a specific known type or to another
borderline category.
For R1a, I also use a Remainder category, and an Unassigned
category for samples with fewer than 67 markers.
The Polish Project occasionally but
rarely uses a Cluster category, which is similar to a
borderline category, except the category itself has less than 80% probability
of corresponding to a valid clade, so the 80% rule does not apply.
Borderline, Remainder, Unassigned,
and Cluster samples also have their FTDNA
assignment which is either 100% (green) or 99% (red).
Confidence and
Probability; More Statistical Comments
{This
entire topic needs rewrite. This is an
old version. I moved the probability
discussion to a new topic, above. Much
of this topic is OK as is for explanation of “confidence”, but most is
redundant. Watch this space for a
rewrite.}
See the Assignments
topic for a brief discussion of Polish Project assignments. The previous topic explains how I figure minimum 80% probability for assignments
of individual samples (men), for the Polish Project. My publication
explains my statistical methods. There
is a summary of my mountain method above.
This topic is about confidence. I’m not trying to be statistically exact
here. I’m just trying to explain a
point that may not be obvious to everyone:
Confidence is not the same as probability. For example, I could calculate a 90% probability of no rain today
based on data showing that on this day in this place, over a large number of
years, it only rained on this day for 10% of the years. However, if I can see storm clouds in the
distance, I have much less than 90% confidence of no rain.
My minimum 80%
probability rule for assignments also means minimum 80% confidence. I give an example in the next paragraph of
one method to calculate confidence.
However, most of my confidence for assignments are based on educated
estimates, not exact calculations.
Confidence interval example: By 80% confidence I mean 80% is the lower
number of the 80% confidence interval.
For example, 80% confidence might mean that the actual probability is
90% but the 80% confidence interval is 80% to 96%. In the following paragraphs I
{continue
the edit here}
As an example, consider a situation
where 10 samples match a type with an STR
test. Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for
the SNP, and 1 negative. That means 9
of the 10 really belong to the haplogroup and that 1
mismatch must come from a different haplogroup that matched the STRs by the
luck of mutations. Next, consider a new
sample that matches that same STR test.
What is the confidence that the new sample will pass the SNP test for the
haplogroup? The probability is 90%
because we know that 9 out of 10 previous samples like this matched the
SNP. However, 1 out of 10 is a very
small sample. As explained in my publication, I use Poisson statistics for quick
calculation of confidence interval.
Poisson statistics is simple to calculate in Excel. My tool Type.xls has an
“SBP” sheet with a set of cells for quick Poisson calculations.
80% confidence interval of 1 is 0.11
to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence
interval of a match comparing to 9 out of 10 is 61.1% to 89%; that lower number 61.1% means the 80%
confidence ranges to lower than 80%, so net confidence is lower than 80%.
70% confidence interval of 1 is 0.16
to 3.37, which is 16% to 33.7%, lower number 66.3%; net confidence lower than 70%.
60% confidence interval of 1 is 0.22
to 2.99, lower number 70.1%; confidence higher than 60%.
67.3% confidence interval of 1 is
0.18 to 3.26, lower number 67.4%. So
that’s my one number: 67% confidence.
In other words, if 9 out of 10
samples that match an STR also match the SNP test, we have at least 67%
confidence a particular future sample matching the STR test will also match the
SNP test.
For 18 out of 20, the probability is
still 90%, but a similar calculation shows 75% confidence.
For 36 out of 40, the probability is
still 90%, but a similar calculation shows 80% to 96% confidence interval,
minimum 80% confidence, which is my example that I started with above. These calculations take less than a minute
using my Excel cells.
Statistical Background Percent: SBP. I use SBP as a net confidence estimate for
the background (samples that match the STR values but
really do not belong to the clade of a type). My publication does not go into the details
of confidence intervals. That is the
purpose of the explanation here in this topic.
SBP is my estimate for the net statistical confidence before any SNP has
been discovered to validate a hypothetical type. 100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the
corresponding hypothetical clade.
A mountain cluster corresponding to a
type might include outliers from other clades, or might
include foreign clades. These and other caveats associated with STR prediction are
discussed in detail in my publication, where I
point out that the confidence for all such caveats cannot be calculated. I estimate the background by using the low
frequency of samples in the gap as representative of the
background throughout the haplospace
neighborhood. My SBP formula (available
in the tools) includes an increase in SBP to account for
all such caveats.
Part I of my publication
explains: “Much of the background is
probably at the last step of the mountain, just before
the cutoff. Much of the remainder is
probably at the previous step, much of the remainder after that at the previous
step, etc.” My Part I Table 2 justifies
this by demonstrating how the number of possible haplotypes increases very
rapidly with step. In other words, SBP
is a good worst case overall estimate of background percent within a type, but
background percent is very low at step zero and increases rapidly with
step. My publication does not provide a
formula for background vs step and in fact I have not derived an formula. For assignment of samples, I estimate the
confidence vs step in a manner to provide a rapid decrease in confidence near
the last step, in a manner to produce overall confidence roughly equal to 100%
minus SBP. Step zero is my rough
estimate that the type is a valid clade, since the step zero samples belong to
the clade with very high probability if the type is valid.
Some outliers from the type
statistically fall within or even beyond the gap, so confidence is not zero at
the cutoff.
Confidence also depends upon the size
of the gap. A wide gap with zero
samples means even samples in the gap near the mountain have reasonable
confidence percent.
Estimates vs Calculations vs Adjustments: A person who assigns samples to hypothetical clades based on
STR values acts like a bookie who provides advance estimates for gambling odds,
using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by
reality very quickly. Probabilities of
an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based
predictions. In the past, a number of
STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you
can read as much as you wish about my methods, judging for yourself the
reliability of my estimates and net probabilities.
The first confidence interval example
above, confidence of STR predictions calibrated to SNP data, can be pure
statistical calculation without any estimates.
However, judgment is involved.
Even such SNP predictions should be split into parts based on the step
value of the samples within a type.
However, if split down to individual steps, the statistics are very poor
due to small sample size, so steps are best combined in batches. For the first data from a new SNP it is
necessary to combine all the steps, so the predictions benefit from an
estimated confidence by step. So the
judgments and calculations can get quite complicated, and often I just estimate
the confidence from experience rather than do the calculations every day as
data comes in.
I avoid changing assignment rules
often, so some assignment rules remain in place even after new data has
provided better rules.
My standard is 80% confidence, but I
avoid introducing a new type until the confidence is a bit higher, because a
new 80% confidence type would provide only a few samples at step zero on the
day when enough data has accumulated.
After waiting for more data, I tend to bend the guidelines a bit below
80% confidence in order to introduce more samples with a new type. Also, if I notice an individual coming out
at 75% when I’m updating rules I’ll tweak the rule to include him.
I tend to be generous in estimates
for samples with all 67 markers, and I tend to be conservative with samples
having fewer than 67. I update the
rules more often at 67. After all,
samples with fewer than 67 markers can get much better confidence by ordering
more markers, and 67 is the most available as a standard commercial test.
I do not look forward to a man
feeling slighted when he is not assigned to a type that is a reasonable fit to
his STR data. On the other hand, I do
not wish to be dismissed by others with experience evaluating STR data, so I
try to be conservative in my probability estimates that particular clades in
fact exist. I will have achieved my
goal if the number of people complaining that I assign too liberally turn out
to be somewhat greater than the number of people complaining that I am too
conservative (people who have read and understood my documentation).
Naturally, my confidence changes from
month to month as more M458 and STR data accumulates, for better statistics.
Assignments at fewer than 67 markers: There are two ways: Some
types have low SBP and seem 80% valid using 37 or only 25 markers, at least for
samples at low step, so samples can be directly assigned.
Second way: I check for correlation using the samples with 67 markers to see
which percent of samples at given genetic distance using fewer markers end up
in the corresponding type at 67 markers.
The confidence of a sample at fewer markers is that confidence
multiplied by the corresponding confidence at 67 markers.
This topic was updated 29 Dec 2010.
I look forward
to the discovery of SNPs validating more than 80%, probably more than 90%, of
my R1a Polish Project type assignments.
I introduced P, N, and K types in the Fall of 2007, publishing this web page 6 Dec of
that year. I did not predict that P and
N were brother clades, in fact it looked to me like P was
closer to K. I did not make predictions
about the P, N, K structure because the statistics did not justify such predictions. I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008.
I stated my overall confidence in the subtypes of K type as only 80%,
but again my confidence in K type at step zero was (and still is) 95%.
P type has been validated as R1a1a7b,
defined by the SNP L260.
N type is very
close to the same as R1a1a7*, the paragroup
defined by the SNP M458 minus L260. This is not exactly a validation, because
there are a low percent of M458 (2 samples so far
at 67 markers) that seem to be older than N type, which implies that a future
SNP, younger than M458, may be discovered as equivalent to N type. In previous versions of this document, I
explained: “A new SNP marker may not
fall at the node defining a type.” A
new SNP might be younger, including mostly the samples with low step from the
corresponding type. A new SNP might be
older, including the corresponding type plus some samples with step beyond the cutoff for the type.
K type is not
validated yet.
In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for
those samples that do not fit P, N, or K.
K type plus the R category are equivalent to R1a1a* (M17, M198,
M458-). The R1a table assigns new types
to either K or R. In the detailed discussion of the types I
discuss which types have: (a) high confidence as subtypes of K; (b) high
confidence as not subtypes of K so surely go into R; and (c) lower confidence
of assignment to K or R so are assigned with a best guess. A new SNP for K type might include a few of
these subtypes, and may include some of R, depending upon the age of such a new
SNP.
This topic uses R1a as an example,
but the same discussion applies to other haplogroup assignments.
My publications
have several references of general interest and relevance to my web documents.
My Tools and
data for STR analysis are Excel files.
These are available at the JoGG publication site as Supplementary
Data: www.jogg.info/52/files/cpcindex.htm.
Polish
Clades Update. This folder is for
update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski
(2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my
publications. I specifically mention it
here because this is where I originally found the common Polish haplotype that I now call P type. Link to English abstract: Pawlowski
2002.
Lawrence Mayka
is the Administrator of the Polish Project. Larry helped me to get started when I was
new to genetic genealogy, providing helpful criticism & suggestions. He reviewed & approved my 80%
probability rule for assignments on the Polish Project web page. He also reviewed the original drafts of my
publications. A number of my types were originally suggested to me as STR clusters by
Larry. Larry continues to provide data
for this web page. Many of my
references to other websites in this document were suggested to me by Larry.
Cyndi Rutledge
is the administrator of the R1aY-Haplogroup Project. Larry and Cyndi had been sending me M458 test results when that SNP was
new. SNP results are now listed at
project web pages.
Lukasz Lapinski
is the administrator of the R1a Project.
Anatole Klyosov
published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of
publications. Some of the STR types
that I independently discovered I later found as 25 marker modal haplotypes in
Klyosov’s web documents (before his
publication in JoGG - some in Russian).
It was encouraging to me seeing independent identification of clusters
by different methods. He emailed to me
an English version of one of his 2008 publications. His Fall JoGG articles have references to his other
publications. Here is a web link: Klyosov Home.
Russian
web sites: Semargl, http://www.r1a.org; http://www.rodstvo.ru; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a,
brought to my attention by others, particularly by Mayka,
who worked out a correlation with my types. These sites clearly have proposed
subdivisions of R1a based on STR data, but I cannot quickly understand these
due to the language barrier. Klyosov seems to be active at these sites. The sites make use of the FTDNA
projects and Ysearch.
Kenneth Nordtvedt
published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions,
suggesting many STR based clusters.
FTDNA
link: www.familytreedna.com. This is a commercial DNA testing
company. I make extensive use of the
project databases maintained by FTDNA.
These are my primary sources of data.
Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted
for /polish/ in the following URL.
WTY. “Walk Through the Y”. This is a commercial product by FTDNA, for
reading about 200,000 base pairs of your Y chromosome, in a search for new SNPs
in your branch of the Y-DNA tree. Here
is a direct link to a WTY
description. You can read about my WTY at another of my web pages.
Polish
Project link: www.familytreedna.com/public/polish. One of many FTDNA projects. This is my primary source for Polish
data. The Polish Project tracks both
Y-DNA and mtDNA; click on “Y-DNA
Results” on the left to see the data that I use.
R1aY-Haplogroup
Project link: www.familytreedna.com/public/R1aY-Haplogroup. Original R1a project.
R1a Project
link: http://www.familytreedna.com/public/R1a. Newer R1a project, with multiple
co-administrators, active in subdividing R1a data into hypothetical
haplogroups. The project home page has
a summary chart of R1a SNP sudivision, and other reference links. Lapinski emails to
me correlations between my code names and the code names for this project; here is the Aug 2011 update.
Ysearch
link: www.ysearch.org. Ysearch is the largest web database for
Y-DNA, run by FTDNA, open to all men, including men who also register with
projects and including men with data from other testing services. I use Ysearch often for analysis so of
course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your
data with Ysearch. Or you can type your Y-STR data into Ysearch. I am not associated with the company
FTDNA. I have Instructions for comparing your STR data to
my types (modal haplotypes)
that I have entered into Ysearch.
Yhrd
link: www.yhrd.org. A forensic Y-DNA data base. Data is separate by city, with many Polish
cities. I relied on Yhrd to figure out
the geography of the various haplotypes.
I wrote a Yhrd
Reminders for myself so that I won’t forget how to navigate the Yhrd web
site; click on that link if you need
some hints.
Semargl. R1a site by Vladimir Tangankin.
R1a tree in pie chart format using 111 marker data, Oct 2021: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009tmb700.png
Blowup: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009.png
Sorenson
link: http://www.smgf.org/. Another DNA testing company.
ISOGG
link: http://isogg.org/tree/ Y-DNA tree with the most recent SNPs and
corresponding alphanumeric codes.
FTDNA
Draft Tree link: http://ytree.ftdna.com/index.php?name=Draft
another Y-DNA tree with recent SNPs.
recLOH: A technical detail discussed in many publications, for example http://en.wikipedia.org/wiki/RecLOH. I discuss this and other compound marker issues, and how step is calculated,
in the “Documentation” sheet for my Calculator.xls
tool.
DYS389: Another
technical detail, also discussed on the web and in my Calculator.xls. Briefly, 389II is the sum of 389I plus
another STR, so 389II should be figured in terms of the delta value.
Chandler
mutation rates. Mentioned in my
publication. From Chandler, Fall 2006 www.jogg.info, 37 markers. 67 marker extension on line at mutation
rates.
Peter Gwozdz
I’m a very rare type in Poland -
E1b1b1a2. My maternal 1st cousins are R1a1a.
That means my late maternal grandfather was R1a1a. I became interested in Y-DNA in 2004. My maternal family name is Iwanowicz. I discovered a family with that name in my
maternal grandfather’s home town in Poland.
They are the only Iwanowicz family within 50 miles, so I was suspicious
they might be my 3rd or 4th cousins. I
brought a cheek swab kit when I visited them the second time in 2006. Sure enough, the male son is a perfect 25
STR marker match to my 1st cousin. I
didn’t get around to checking the web for a year. I was shocked to discover that these maternal cousins matched 80
people in the FTDNA data base, for a perfect 12 out of 12
STR markers. That’s a hell of a lot of
matches in the summer of 2007. Most of
these matches are Polish. I did some
research and found an article by Pawlowski (reference
in my publication) about this most common Polish
haplotype, which I now call P type. That got me interested in doing more
research, leading to this web page for others to see my results. My experience, however, is a reminder that
statistics can be misleading. I was
confident that my grandfather’s haplotype was P type, based on a perfect match
at the first 12 markers. I now (June
2010) figure that the probability was really about 93%, because 13 out of the
14 current Polish Project members who have 67 markers and who also match P type
perfectly at 12 markers are in fact P type as judged by all 67 markers. My grandfather does not match P type at 67
markers. My grandfather is that 14th
one. He matches the small hypothetical
clade that I call I type, which is also concentrated in
Poland. But my confidence on that I
type assignment is only 80%, so maybe statistics is fooling me again. That’s how an outsider ended up studying P
type and R1a1a, and writing web pages and articles about common Polish Y-DNA
clades.
Revision History
2007
Dec 6 First web posting of this file
2007
Dec Two revisions
2008 8 revisions
2009
33 revisions
2010
36 revisions
2011
26 revisions
2012
Feb 5 Update Pc Cluster
2012
Feb 7 Mention Pc in the News at the top
2012
Feb 8 Start updating the Results
2012
Feb 15 Finish updating the Results
2012
Feb 28 Update; multiple edits
2012
Feb 29 Update N type and M458
2012
Mar 8 Update; multiple edits, mostly
SNP update related
2012
Mar 17 Ysearch instructions updated to include Haplogroups I, N, R1b; very limited results
2012
Mar 17 N-G type update
2012
Mar 20 N-M cluster update
2012
Mar 21 edit N-G and N-M
2012
Mar 25 update I-CE
2012
Mar 28 update I-P
2012
Mar 29 edit I-P discussion; M458 news
about L1029
2012
Jul 18 update L260 & M458
2012
Oct 17 update L1029 and Np
2012 Oct 20 update of both Abstracts, official drop of K and R categories, update of U, a few other minor updates