Polish Y-DNA Clades
14-Aug-10
Peter Gwozdz
pete2g2@comcast.net
If this is your first time here,
consider jumping down to the Abstract for a summary of
this web document.
My methods
and results have been published.
The Polish
Project has assignments of men to types as a
subdivision of haplogroup R1a, which is a
category of Y-DNA. The Results Table has a summary of the R1a
assignments. If you know your
assignment you can click on the link in the right column of the Results Table
to read more about your assignment category.
Lawrence Mayka,
administrator of the Polish Project, provides data for this web site of
mine. This web document is for
explanation, details, and update news.
R1b & I1 Haplogroup News
On 20 July I added three R1b Types to the analysis discussion in this web document.
On 31 July I added two I Types to the analysis discussion in this web
document.
This news topic was updated 9 Aug
2010.
SNP results
continue to validate P type and N type.
The SNP called L260
is equivalent to what I have been calling P
type. (Types are
defined by STR values.)
This is preliminary news, based on the first 43 results available to
me. Of 30 positive samples
(L260+ results), all are predicted P type based on STR values. Of 13 negative samples (L260- results), none
are predicted P type based on STR values.
All L260+ are also M458+, except eight L260+ not
tested for M458. 41 of these 43 have
all 67 standard STR markers.
P type and N type samples are coming
out M458+. So far there are no M458-
samples that fit P type or N type. This
is based on 96 M458- test results. 84
of the 96 have all 67 STR standard markers, and none are predicted P type or N
type - not even borderline. This is
consistent with my previous analysis, indicating
that both P type and N type seem to be clades that
resulted from severe population
bottlenecks followed by relatively recent rapid population expansions. P type and N type cannot be completely alone
in haplospace, so eventually a few M458- are bound to
show up that fit P type or N type based on close STR match; that will provide a
measure of the background from other clades.
Two samples M458+ do not match either
P type or N type at 67 markers. Not
even borderline. In addition, the
borderline M458+ and L260+ (just beyond the types in STR values) are more than
statistically expected. These might be
representatives of small clades that branched off the M458 haplogroup tree
before the main population expansions that led to P and N.
See L260
and M458 Test Results for more discussion about the data available.
Because P type and N type are so well
isolated, these two clades are good examples of the caveats for age
estimation. Traditional models of STR
mutation diffusion do not apply to the M458 haplogroup. More discussion is in my age caveats topic. See my age of R1a1a7 topic
for very rough estimated ages.
News update 14 Aug: One sample just came out M458- that fits P
Borderline, 7 STR mutations steps from the P type definition. P
type has cutoff 5 and gap 3, so this
is the last step of the gap separating P type from the remainder of
haplospace. This is the closest M458-
to P type I have seen; a couple
previous M458- results were at step 8.
No doubt a closer M458- will be seen as data accumulates over the next
few months. This result reminds us that
the exact border of a type cannot be predicted exactly. It’s amazing that up to now all samples with
P step less than 8 have been coming out M458+ and L260+. This particular M458- sample is step 9 from
N type, one step beyond the N type gap;
3 other M458- had been seen at step 9.
Polish
Project R1a Assignment News
This topic was updated 5 Jul 2010.
If you are R1a
but not a Polish Project member, the Ysearch instructions topic has my method for matching to my types. The news in this topic applies to you if you
know your assignment.
If you are P type
or N type you would likely come out positive in the SNP test for M458 (M458+). If you are P type you are likely L260+. N type is
likely L260-. If you have not already
tested you can pay the small fee to test for these SNP tests to confirm that
you belong to the corresponding haplogroup.
If you are assigned to P borderline
or to N borderline you would benefit more from the M458 and L260 tests, because
that would provide for you a definite assignment within R1a.
The assignment rules are done with
high probability, so if you are unassigned (category U) there is a low
probability that you would test positive for M458, with probability that
decreases with your step (genetic mutation distance) from P
or N.
If you have less than the standard 67 STR markers it
is generally better to purchase the remaining markers. That way, you are more likely to get an
assignment, because the statistics for STRs improves with more markers. Nevertheless, if you are not many steps from
P or N you might consider doing the M458 test.
There is a slight chance that you might test
positive for L260 or M458 even if you do not match P or N. The haplogroup corresponding to M458 is old enough that there
may be small clades with STR markers very different than P or N. I have not seen one yet, but there is no way
to estimate this probability. I
hesitate to recommend the M458 SNP test for men whose samples are distant from
both P and N in STR values. I admit you
can just wait to see if anyone with STR values similar to yours matches an SNP,
then test for that SNP. However, we all
benefit when some men test for all the new SNPs within an established
haplogroup, because that way we find out the size and rough age of the
corresponding new haplogroup branches. FTDNA offers “deep clade” test packages to test for all
possible haplogroup branches, but my understanding is that L260 and M458 are
not yet included in the R1a deep clade test.
You need to purchase them separately from the advanced markers menu. No doubt FTDNA will add them soon to the
deep clade package.
The Fall issue of the Journal of Genetic Genealogy came out on 21
Nov. My publication is split into two
parts there:
Part I is my “mountains in
haplospace” method for evidence that certain “types” of STR clusters correspond
to clades.
Part II is the application
of that method to Common Polish Clades.
That article has a lot more detail than this web page, but that article
was last updated in September, so this web page is an update.
PolishCladesUpdate is my
folder for future updates to those two articles.
This web page will continue as an
introduction and summary, without as much jargon and detail as the articles and
update folder.
R1a Worldwide
Wikipedia has a nice R1a entry with primary
contribution by Andrew Lancaster.
18 June 2010 confusion comment:
In 2008, new SNPs were discovered to
better define R1a. The prior R1a1 (M17,
M198) is now called R1a1a (M17, M198). FTDNA still assigns samples using the older (early 2008)
nomenclature. There is a draft Y tree at the
FTDNA site, so no doubt they will update soon.
ISOGG has even more detail, with the latest SNP
discoveries. As I rewrite topics here,
I use the ISOGG names for discussion purposes, but I still mention the FTDNA
assignments, which use the old nomenclature.
The old R1a1a (M56) is very rare, not
discussed here. It is now called R1a1a1
(M56).
More than 97% of R1a men are R1a1a
(M17, M198), but FTDNA (and Ysearch and projects) is conservative about
predictions, so many are listed as R1a if they have not tested for recent
SNPs. All others are currently listed as
R1a1 (old nomenclature M17, M198).
22 June 2010; this topic is still
current:
L260 is a new SNP, not published
yet. It has been available as an SNP
test since early April at FTDNA.
L458 is a new SNP, published by Underhill. It has
been available as an SNP test since early November at FTDNA.
FTDNA
has not yet assigned haplogroup names to these, so men who test positive are
not reported on-line yet at FTDNA nor at Ysearch, nor at
the projects supported by FTDNA, which include the Polish
Project.
Both L260 and M458 are listed at
ISOGG, where M458 is called R1a1a7 and L260 is called R1a1a7b. In the draft Y tree at the FTDNA site M458
is called R1a1a7, but L260 is not yet assigned a name there.
22 June 2010 update:
Almost all of R1a divides into R1a1a* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260). These correspond to my original predicted division.
R1a also has several known rare groups: R1a*, R1a1*, R1a1aN, where N = 1 to 6 and
8. There is also a very rare
R1a1a7a. That asterisk is used for paragroups; R1a1a*, means haplogroup R1a1a without any
of those 8 known branches.
The rare R1a groups are not in my R1a
Table. It’s a shame the
corresponding STRs are generally not published in SNP announcements. I don’t know if the rare groups all together
add up to 0.1% or 1% of R1a. Surely
they are less than 3%. My percentage
calculations in my R1a Table do not need adjustment because any Ysearch samples that might belong to these rare clades would
probably have unusual STR values, not falling into one of my types, but still
be counted in the totals. In my R1a Table, rare samples are included in row
R. That row R might have a few percent
from these rare groups, but I don’t know exactly how many.
Underhill mentions 7 samples
(men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.
Lawrence Mayka, the
administrator of the Polish Project, had been assuring me by email that all the
Polish Project member tests within R1a had been coming out negative for all the
rare SNP subgroups. So if you are a
Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the
men from Poland. About half of these -
about 1/4 of men from Poland - are R1a1a7.
These two “about” estimates are approximate; my data on these SNPs are not
random samples, so my population estimates are derived from the types in my table, which are STR based.
On 17 June Mayka informed me of the
first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish
Project. My table,
does not show this single exception because the table is for samples with 67
markers, which that one exception does not have. On 19 June Mayka informed me of evidence that C
type might define a new rare subdivision of R1a slightly older than
R1a1a; if this turns out correct it
will be less than 1% of R1a.
An article was published online, 4
Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
Abstract STR
Data See www.gwozdz.org/R1a.html for more
discussion
I call this article “Underhill” for
short, because his is the lead name in the list of 34 authors for this major
work.
This web page about Polish Clades was
completely rewritten using this new information. Recent L260 and M458 test results
are consistent with (albeit not full proof of) my previous R1a subdivision into
“types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by
this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a*
(M458-). See R1a Subdivision
for a brief summary of other groups, and for a clarification of what R1a1a*
means.
R1a1a7 is the new M458
haplogroup. R1a1a7 includes what I have
been calling P type and N type here on this web page, even before M458 was
available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other types, particularly
K type.
This Underhill article has data for
158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for
R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data is
R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458
Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a
few percent more R1a1a7 than R1a1a*, although the latter is more common
worldwide.
Abstract rewrite 22 Jul 2010.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you
might prefer to read the Introduction first.
This web document has three
purposes: 1. More detailed explanations
for the men (samples) that I assign to types
in the Polish
Project. 2. Summary of my published results.
3. Update with recent results.
The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters
that are concentrated in Poland.
Since I originally posted this in
December 2007, emphasis has been haplogroup R1a, because
about half of Polish men are R1a, with no subdivision at that time. A new division,
roughly 50-50, between R1a1a* and R1a1a7, became available in November
2009. In July 2010 I added some R1b analysis. I
will soon expand this page to include clades from other haplogroups that seem
to be concentrated in Poland.
I use the word type
to mean an STR cluster with statistical validity as
established by my Mountain Method. I expect my types to be validated some day by discovery of new SNPs that will qualify them as haplogroups. I chose the word “type” because it is not
generally used in genetic genealogy and I wish to distinguish my types from
haplogroups and from other clusters.
All types have associated clusters but not all clusters qualify as
types. In my publications
and web pages I make it clear which types I have discovered in web data and
which types were suggested to me by others, with references. Usually when I discover a type I later find
out someone else had mentioned it earlier on the web; let me know if you the reader have more clues and references for
me.
Most of types that I discuss seem to
be 1,000 to 5,000 years old, so all the men in each type seem to be descended
in direct male lines from one man (MRCA) who lived that
long ago (TMRCA).
A few of my types might be younger or older than that range.
I use phrases like “seem to be” over
and over because the methods are statistical.
Click here for a summary of the
conservative automatic haplogroup
assignments in the Polish Project, for an explanation of the extended assignments, for a discussion of the minimum 80% probability for assignment, and for the R1a Results Table.
The Polish Project is considered
representative of Historical Poland, with caveats explained in my Publication.
Abstract rewrite 4 Jul 2010.
About half the men of Polish male
line ancestry belong to the R1a haplogroup.
These are all R1a1a with about 99%
probability. This R1a Abstract is a
summary discussion of the R1a Results Table.
U category. Unassigned.
This is the largest category. On
the Polish
Project Y-DNA Results page, detailed assignments are made with
minimum 80% probability. Because of the restriction to 80% probability, many R1a men in
the Polish Project are not assigned to detailed categories at the Polish
Project web page. Those men go into
this “Unassigned” category. These still
have either R1a or R1a1 automatically
assigned by FTDNA. If you are in
this U category, you can promote yourself out by purchasing the full 67 marker
STR set, since all R1a samples with 67 markers get a detailed
assignment.
I consider the R1a Polish data as 4
major categories based on STR data.
About half the men of Polish male line ancestry belong to the R1a
haplogroup, and that group divides roughly equally into these 4
categories. Since 2007, I have been
calling them P type, N type, K type, and R category. P and N are in the new R1a1a7. K and R are R1a1a*.
R, Remainder, is not a type. I use R for samples that do not belong to
any of the types I have identified in R1a1a* so far.
My overall confidence
in K type is only 85% because there seem to be unidentified types with STR
values close to K. The modal
haplotype for K is essentially the same as the modal
haplotype for all of R1a. However, I
have identified subtypes of K that have much higher confidence. In other words I have higher confidence for
many individual samples. I have high
confidence in the subtypes although I am not sure all the subtypes assigned to
K belong to exactly the same clade along with all the other samples that I have
assigned to K outside the subtypes.
Even if K is not a true clade as defined, however, it clear that the K
samples belong to branches in the R1a1a* tree with nodes very close to each
other. The only uncertainty is that
there are likely many other samples that belong in other branches just as close
to K.
Borderline
categories are not types but are samples that match types with less than 80%
probability. Each Borderline category
has discussion below.
P type is
concentrated in Poland, rare with increasing distance from Poland. N type seems to be
mostly Slavic, widespread in eastern Europe.
K type corresponds to one of the two largest R1a1
clusters. Another large R1a1a cluster,
the one I call L type, is not common in Poland.
In the table I assign each R1a1a*
(M458-) subtype into either K or R based on how distant the STR values are from
K. Some of these are borderline
however. There is no clean separation
of K from R, so the table should not be considered a high probability separation
of K subtypes from the R remainder subtypes.
Read the individual type discussions to see which subtypes fit K with
high probability; A
type is an example.
Thanks go to Lawrence
Mayka, Polish Project administrator, for extensive email information and
assistance.
You can compare data to my types by
clicking this link to instructions for Ysearch.
Reminder: I am concentrating on Poland.
The statistics of STR clusters depend a lot on the data base. For example, P type stands out dramatically
in Polish data. In other countries P
type is rare. If you belong to an R1a1
cluster that is rare in Poland, I’m sorry, but I’m not covering you. K type is an example of a type that is
common both in Poland and elsewhere. M type is common in northwest Europe but so far absent in the
Polish Project.
This Introduction is for people
unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites with
a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JoGG
are good general references. The Y Chromosome Wikipedia
article is about male line DNA, also called Y-DNA.
The following several paragraphs are
a brief introduction to genetic genealogy for Y-DNA, providing some definitions
of jargon needed to read my web pages.
The definition words are boldface.
I often use links to those definitions when I use a jargon word for the
first time in a topic. There are more
boldface definitions in the summary of my Methods.
The Y chromosome gets passed from father
to son, so it works just like a male family name. Men are divided into haplogroups
based on known rare mutations (called SNP) in the Y
chromosome. Division into haplogroups
is done in a manner that has virtually 100% confidence. I say “virtually” because your confidence in
your DNA result from your DNA testing company might be 98% or 99% or
99.9%; the confidence for haplogroups
is better than that. We can be virtually
certain that all the men in a haplogroup descend in direct male lines from one
man, called the “Most Recent Common Ancestor” (MRCA)
for that haplogroup. Time of the Most
Recent Common Ancestor (TMRCA) is an estimate of how
long ago he lived - the age of the haplogroup.
Lots of people are working hard to discover more SNPs on the Y
chromosome so that the haplogroups can be divided further into smaller
haplogroups. I’m doing some work on
this, but I’m not discussing it in this web document.
Haplogroups have alphanumeric codes,
like R1a1a. A paragroup
is a haplogroup considered without its known haplogroup branches. When a new branch is discovered within a
paragroup, it gets removed from the definition; that changes the meaning of that paragroup. An asterisk is usually used in paragroup
codes, like R1a1a*.
Other people, like me in this
document, try to “stay ahead” of the haplogroups by analyzing other mutations
that are not so rare (called STR) on the Y
chromosome. Men submit their Y-DNA data
to various web sites. There are lots of
STR data available on the web. Men are
divided into STR clusters as hypothetical subdivisions of the
haplogroups. All such clusters are
hypothetical. Some will be validated in
the future by new SNP discoveries.
There are various statistical methods for estimating the confidence of
STR clusters. I recently published a method that I developed. That publication has references to other
methods. There is a brief summary of my
method below.
A few STR clusters are small family
clusters, with the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA
does not match the DNA of their male line cousins identified by genealogy
research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people
prefer to avoid genetic genealogy. The
male line associated with the Y-chromosome is only one ancestral line. Humans have 24 chromosomes. Anyone who tries to make a family tree going
back 300 years has more than a thousand root tips to be filled by names of
ancestors who lived back then; the one
man at the tip of the male line root is only one of those thousand. That is another reason some genealogists
avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of
many. That said, many people enjoy the
challenging hobby of figuring out to which ancient extended male line they
belong.
Most STR based clusters have an MRCA
who lived thousands of years ago, before family names were common, so most men
assigned to a typical cluster do not have the same family name.
Many SNP based haplogroups have an
MRCA who lived more than ten thousand years ago, so these span multiple ethnic
groups and nationalities. For example,
the R1a
haplogroup is of interest to me. R1a is
most common in Slavic countries but calling R1a Slavic is misleading because it
is found throughout Europe and west Asia.
The MRCA lived so long ago that he may have spoken a language that we
would not consider Slavic if we could hear it.
It is possible that he did not even live in what is now the Slavic
region of Europe; maybe his descendants
moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and
culture, by now some of his descendants long ago moved to other parts of Europe
and Asia. One of the appeals of genetic
genealogy is trying to figure out ethnic descent and migration from the
statistics of haplogroups. Some people
object, pointing out that ethnicity cannot be defined genetically because of
all the moving and mixing of people over the millennia, and because the Y
chromosome is only one of many. True
enough. Some individuals and some web
sites go too far with genetic claims.
That said, statistical analysis of haplogroup data provides many clues
on human origins.
Again, some people try to stay ahead
of haplogroups, using statistical analysis of STR based clusters to gain
insight into more recent human origins.
I am one of those people. My
interest is Polish origins. This web
document, however, is not for the historical analysis and conclusions, except
for occasional comments to remind us of the goal. This document is dedicated to STR data and analysis, identifying
clusters concentrated in Poland, with detailed explanations.
The bottom of my Method section has
more definitions for a number of genetic genealogy
terms.
There are a number of organizations
and commercial companies on the web where you can order a cheek swab kit to
mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company
FTDNA; I mention them because I make
extensive use of their data; check
Google for competitors. At FTDNA, click
on Products for cheek swab kits. DNA
results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your
data into one of the many databases;
for example, most of my analysis is from the data in the FTDNA Polish Project.
I use the FTDNA standard set of 67 STR markers (plus a few
non-standard ones occasionally). I do
some analysis using the standard FTDNA 12, 25, or 37 STR marker sets. Other companies use standard marker sets
that may not overlap with all the FTDNA markers.
Ysearch is the
largest web database for Y-DNA, run by FTDNA, open to all men, including men
who also register with projects and including men with data from other testing
services. I use Ysearch often for
analysis so of course I encourage you to register your Y-DNA data at
Ysearch. From the FTDNA site, you can
register your data with Ysearch. Or you
can type your Y-STR data into Ysearch.
Format
Up to here, I have tried to write
this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity
with genetic genealogy jargon has understood.
If you read this top to bottom, it gets progressively more detailed,
with more and more jargon. I’m sorry
about that, but the audience is also readers with genetic genealogy experience
who want to know how I came to my conclusions.
If you cannot follow some of this, it is written in a manner that you
can jump around and pick out what you do understand, then come back after you
have read more about genetic genealogy.
If you open this html document with
Word, all the link targets (bookmarks) can be viewed alphabetically or by
location.
This topic was updated 9 Aug 2010.
This topic provides discussion. Read L260 and M458
News first, for a summary. For more
numbers see the following topic.
L260 and M458 are the names of two SNP tests.
Lawrence Mayka
is the administrator of the Polish Project. SNP results are not posted on the web. Most of my SNP
data comes from Mayka. Some of my data
comes from Cyndi Rutledge, the administrator of the R1a Project.
Many men join both projects, but of course many men purchase the L260 or
M458 test and do not join either. If
you are an administrator of an FTDNA project (or a project at another database)
you may send me the L260 and M458 results for your project for merging into my
analysis, if you wish. Karen Melis, the
administrator of the Zamagurie Project, also sent me a few M458 results.
Data with the 67
standard markers is most common because Mayka and I selected these for the
initial tests. In addition, men who
have purchased less than the standard 67 markers are less likely to purchase SNP
tests.
Mayka and I purchased many L260 and
M458 tests for Polish Project members, so test results available to me are
biased toward Polish data. Also, I
suppose men who notice my publication and web pages about Polish types are more
likely to purchase the L260 and M458 tests, so even data not available to me
might be biased toward Polish data. At
first we were concentrating on samples that match P type and N type very well,
so much of the data available to me are biased toward P type and N type, of course. Later we concentrated on borderline samples
that just barely match P type and N type, in order to better define the borders
in STR haplospace.
If there are clades from outside M458 (outside haplogroup R1a1a7) that just
happen to have STR values that match P type or N type we will discover them
quickly, but not if they are concentrated far from Poland, and particularly not
if they are concentrated in any Eurasian lands where men do not tend to get DNA
tests. If there are M458 clades with
STR values very different than P type or N type it will take some time to
discover them all, because those will require “deep clade” tests by men without
an M458 prediction to do the M458 test anyway. So far I have results for only 2
such samples distant from both P and N, and they are very different from each
other in STR values; they might be
statistical flukes do to the random nature of STR mutations, or they may be
valid singleton representatives of 2 small older clades that branched off the
M458 tree trunk long ago. These two are
discussed further below.
The SNP results do not provide
estimates of population frequency because we are selecting the most interesting
samples for SNP tests. However, since
the SNP tests verify my type classification, my STR types provide credible
frequency estimates. My Results Table is still the best estimate of
frequencies in Poland: P type for M458+
L260+; N type for M458+ L260-.
P Borderline comments at 67 markers: The data is listed in the following topic. Two of the L260+ have P=7 (step
7 from the P type modal haplotype),
which is 3rd step beyond the P type cluster - cutoff
definition P<5. P type has a gap of 3 steps (P = 5 to 7) where sample frequency is low and prediction is uncertain. I would have predicted these two “P
Borderline” a couple years ago, because the N and K steps are high. The Polish Project has only 7 samples in the
P gap, including these two, 6 have been tested L260+. The 7th, at the cutoff P=5, not tested for SNPs, is classified P
Borderline for now. I also have data
for another gap sample at P=6 not in the Polish Project, tested L260+. The gap is the most likely place to find P
Borderline predicted samples that are L260-, but I have seen none so far. With only 8 samples, the probability is
uncertain. P=8 is common - that is
beyond the borderline. Most P=8 are
predicted N type and some are predicted other types with M458-; no L260+ outliers yet
at P=8. P=4 is common, 7 samples in the
Polish Project, 3 have been tested for L260, all L260+. Although this is little data at P=4, the
last step of the P type mountain, confidence in P type
prediction is boosted by the absence of background in the gap. Conclusion: P type samples with cutoff 5 are haplogroup M458+L260+ with high
probability. Gap samples with P = 5 to
7 are probably P type with uncertain probability.
N Borderline comments at 67
markers: The analysis is similar to P
type. The cutoff is 7 and the gap is 2
(N = 7 to 8):
|
N
Step |
Number
in Polish |
Predicted
N Type |
Number
tested |
Others
Tested (not in Polish) |
|
5 |
12 |
12 |
6
M458+ |
1
N type M458+ |
|
6 |
7 |
7 |
2
M458+ |
2
N type M458+ |
|
7 |
3 |
3 |
3
M458 + |
|
|
8 |
3 |
1 |
1
N type M458+ |
1
N type M458+ |
|
9 |
15 |
3 |
2
N type M458+ |
|
At N= 5 or 6, samples without the
M458 test are classified N Borderline at the Polish Project web page because
the statistics is just now approaching 80% probability
for N type assignment near the cutoff without the SNP test.
At
N=7 cutoff, 3 samples: As with P type,
the frequency at the cutoff is low. All
3 would be predicted N borderline and all 3 came out M458+.
At N=8, 2 of the 3 N type would have
been predicted N Borderline a couple years ago and one now is N type because of
an M458+ test result. The other is now
G type, which type did not have enough data for prediction a couple years ago. That 3rd one is now classified D type, a
subtype of K. One additional sample
from the R1a project has N=8, also tested M458+.
N=9 is not a good indicator for N
type because many of these are P type or M458- types. Only those that do not match any of the other types can be predicted
N Borderline, with low confidence. The
Polish Project has 2 of these; both are
M458+.
N=10 does not have any of these. There are two outliers, one each at N=11 and
N=12, that would not be predicted N borderline but came out M458+. Those latter two might be outliers due to
the luck of random mutations. I prefer
the interpretation that N type is an isolated mountain
in haplospace due to a vigorous population expansion in recent history
following a bottleneck. On this basis,
I speculate that those two at N = 11,12 represent small clades that existed
before the bottleneck and survived, with nodes on the M458 tree older than the
bottleneck. I classify these two as
“M458+ Remainder” to distinguish them from the main N type clade. What I really mean is a “paratype” of
samples that are M458+ but neither N type nor P type. (The Results Table needs to be updated for this new category.) There is no way to figure the probability
that my interpretation is correct, but my classification tells you that I’m
guessing 80% probability. My publication has discussion
about how a type without structure should not have a tail to the distribution
of frequency vs step, so I am confident that the high N value samples represent
structure in the tree of the type.
Those other two at N=9 might also be such structure, with a node older
than a prominent bottleneck, but I’m leaving them in N for now (N Borderline without
M458+ result) because the 80% probability rule does
not apply to borderline categories.
Summary: so far all the M458- samples are falling well beyond the N type
cutoff. M458+ samples have a
distribution tail of outliers beyond N.
Conclusion: N type samples
with cutoff 7 are paragroup M458+L260- with high probability. Samples beyond N type belong to this
paragroup with low probability. Both P
type and N type are coming out very isolated - no foreign samples have shown up
yet close to the two mountains in haplospace.
L260
and M458 Test Results; Details
This topic was updated 9 Aug 2010.
42
L260 Polish Project
29 L260+
13 L260-
1
L260 R1a Project, counting only those not in the Polish Project
1 L260+
43
L260 Total
30 L260+
13 L260-
138
M458 Polish Project
70 M458+
68 M458-
38
M458 R1a Project, counting only those not in the Polish Project
10 M458+
28 M458-
7
M458 Zamagurie Project, not in lists above
7 M458+
183
M458 Total (sum from above)
87 M458+
96 M458-
226
Total SNP test results (sum from above)
191
unique samples (191 different men)
161 have all 67 standard STR markers
27 have the 37 standard set
1 has the 25 standard set
2 have the 12 standard set
76
M458+ results at 67 markers
(includes 8 L260+ not
tested for M458 assumed M458+):
35 P<5 solid P type (P<5 means
<5 mutation steps from P type definition)
(includes 7
L260+ not M458 tested)
2 N=9 all solid P type well isolated from N
5 N=10
28 N>10
3 P=5 at cutoff
just outside the P mountain; all 3 are L260+
these would be classified
P Borderline without that L260 test
N=11,11,13 well isolated
(one L260+ not M458 tested)
2 P=6, cutoff +1, N=14,16
P still well isolated
from N, but not statistically significant yet
2 P=7, cutoff +2, N=13,14
0 P type with P>7; the remainder at all M458+ and none are
L260+
25 N<7 solid N type
2 P=8 not considered P
Borderline because N=4,6
3 P=9 not P Borderline;
N=2,4,4
20 P>9
3 N=7 at cutoff just outside the N
mountain
1 P=8 not considered P
Borderline because P is 3 steps beyond P cutoff
2 P=11,12
2 N=8, cutoff +1, P=8,9
2 N=9, cutoff +2, P=11,15
0 N=10
1 N=11, P=9 very far from both P and
N borders (cutoff +4)
1 N=12, P=12 very far from both P and
N borders
this one has DYS572 null,
value 8 used to provide step 3 at this marker
Those last 2 samples are mentioned in
the L260 and M458 News and in the previous topic.
These are possible
representatives of small old M458 clades
That N=12 sample is L260-
but that N=11 sample has not been tested for L260, probably negative
We are calling these two
“M458+ Remainder”
76
M458+ results at 67 markers again, this time sorted by K (cutoff 4)
1 K=4, (P,N)=(2,10) P type L260+
8 K=5, all P type, P = 1 to 5, one P=5
is L260+; N = 10 to 13
12 K=6; all P type
55 K = 7 to 13
85
M458- results at 67 markers; none match
closely to P or N type in STR values
0 at the cutoffs (P=5, N=7)
0 at cutoff +1
2 at cutoff +2 - Borderline
3 N=9, P=11,12 - these
are G and D type
4 at cutoff +3 - Borderline, but
pretty far out
2 P=8, N=13,15 - these
are I and K Borderline
2 N=10, P=13 - these are
D and D Borderline
79 N>10, P>8 clearly STR
prediction not M458 - verified
37
marker comments (27 samples):
18 M458+ categorized P, N, N
Borderline
There is one sample L260+
with only 37 markers, at P=1, consistent with a P type prediction.
All but one are
consistent with the 67 marker results - P type and N type - although some are N
Borderline, where it is difficult to predict at 37 markers. That one exception has P,N = 10,10, but
DYS388=8 (modal 12), producing 4 of those 10 steps at one marker, so this one
may just be a statistical outlier.
9 M458- all have high P and N values
none predicted P or N
Haplogroups
in the Polish Project
The Polish Project on 25 May 2010 had
1306 members, of which 17 are family set duplicates, for 1289
independent members. This reduction for
family sets is explained in my publication.
Here are the FTDNA haplogroup
assignments. The left side of the table
has the totals by major haplogroup. The
right side provides only those haplogroup branches for which there are more
than 25 samples; most samples are in
haplogroup branches with fewer than 25 samples, not listed here:
|
Haplogroup |
Count |
Percent |
|
Common |
Count |
Percent |
|
|
|
|
|
Branch |
|
|
|
- |
1 |
0.1% |
|
|
|
|
|
C |
5 |
0.4% |
|
|
|
|
|
D |
1 |
0.1% |
|
|
|
|
|
E |
88 |
6.8% |
|
E1b1b1 |
61 |
4.7% |
|
F |
4 |
0.3% |
|
|
|
|
|
G |
37 |
2.9% |
|
|
|
|
|
H |
2 |
0.2% |
|
|
|
|
|
I |
209 |
16.2% |
|
I1 |
63 |
4.9% |
|
|
|
|
|
I2a |
83 |
6.4% |
|
J |
104 |
8.1% |
|
J1 |
28 |
2.2% |
|
|
|
|
|
J2 |
49 |
3.8% |
|
L |
3 |
0.2% |
|
|
|
|
|
N |
95 |
7.4% |
|
N |
54 |
4.2% |
|
|
|
|
|
N1c1 |
30 |
2.3% |
|
Q |
16 |
1.2% |
|
|
|
|
|
R |
711 |
55.2% |
|
R1a |
66 |
5.1% |
|
|
|
|
|
R1a1 |
509 |
39.5% |
|
|
|
|
|
R1b1b2 |
76 |
5.9% |
|
T |
13 |
1.0% |
|
|
|
|
|
Total |
1289 |
100% |
|
|
1019 |
79% |
The left side of the table is a good
estimate of the haplogroup frequencies in Historical Poland, insofar as the
Polish Project is representative of Historical Poland, as discussed in my publication.
The right side of the table is not
representative. This is my next point
for discussion:
About half of those haplogroup FTDNA
assignments in the table above are based on SNP measurements (green text for
haplogroup assignments by FTDNA). About
half are haplogroup predictions based on STR values for samples that do not
have SNP measurements, using an FTDNA proprietary method (red text). The FTDNA predictions are very conservative
- at least 99% probability - they almost always come out as predicted when an
SNP test is done. Because they are
conservative, most FTDNA predictions really belong to a branch subdivision of
the trunk haplogroup to which they are assigned. In addition, many of the green assignments based on SNP
measurements are old, so these do not reflect new SNPs that became available
since they were last tested.
In other words, a conservative
prediction has a low probability of being wrong, but it also has a high
probability of being incomplete.
For example, those 76 men assigned to
R1b1b2 are really mostly R1b1b2a1, and there are SNPs available to further
assign them to eleven smaller haplogroup branches of R1b1b2a1, so those 76 men
can easily obtain more detailed assignments by purchasing the SNP tests. Similarly, those E, I, J, and N men have
more detailed assignments available through SNP tests.
R1a is very different. Almost all of those 66 R1a samples are
really R1a1. They are classified R1a
either because (1) their STR values are unusual, so a very conservative
prediction cannot be made, or (2) they have an old R1a test, made before R1a1
was available, so they remain listed R1a even though they could be
conservatively predicted R1a1.
In other words, there are really more
than 570 R1a1 samples, 44% of the men in the Polish Project, for whom more
detailed automatic haplogroup assignment is not available through FTDNA. That’s one motivation for this web page of
mine.
This topic was updated 5 Jul 2010.
Lawrence Mayka
is the administrator of the Polish Project.
Click on the Polish Project web link to see
how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA
and for Y-DNA. This web document of mine is restricted to Y-DNA, with emphasis on R1a. I help
Larry with assignments to R1a types.
Haplogroups are defined by SNP mutations. STR mutations are easier to test, so many samples have STR data
without SNP data. Predicted assignments are based on STR
correlations.
I mentioned above that FTDNA automatic haplogroup
predictions (red text means STR predicted vs green text SNP measured) have
about 99% probability. We use minimum 80% estimated probability for each individual sample in
the Polish Project that gets an extended assignment - a subdivision of its
FTDNA assignment. At 80%, many more
assignments are possible. Most extended
assignments are better than 80% probability.
Many are better than 95%.
Many samples do not have extended
assignments, but they still have their FTDNA green measured haplogroup (100%
probability) or their FTDNA red predicted haplogroup (99% probability). These bring up the average for the Polish
Project as a whole.
We are confident that the average is
better than 95%, which is to say that more than 95% of the Polish Project
samples would test positive for the SNP corresponding to their assigned
haplogroup. Excluding R1a the average
is likely more than 97%.
E1b1b2a2 (V13) is an example of a
haplogroup category with some extended assignments: Larry has me in this category, which is 100% correct because I
tested positive for the V13 SNP along with 14 other men in the Polish Project
(on 25 May 2010). However, Larry’s
listing includes 48 men in this category, based on his analysis of STR
correlations:
15 green E1b1b1a2. These are of course certain.
28 red E1b1b1 because FTDNA does not
predict beyond that, but these would likely be E1b1b1a2 if tested, because they
have STR values close to those samples that have tested V13+, and unlike the
samples that have tested positive for other branches of E1b1b1. Each has at least 80% probability, and many
are even more probably correct.
2 green E1b1b1 tested for that
previous SNP but not for the current V13, but matching in STR values.
3 green E1b1b1a tested for that
previous SNP but not for the current V13, but matching in STR values.
Note that other E1b1b1 men, both
green and red, fall into other categories at the Polish Project, because they
do not match V13+ samples closely in STR values.
R1a is unique because almost half the
Polish Project samples are placed together by FTDNA into R1a1 (M198), which is elsewhere called R1a1a (M198). Many of our R1a assignments are to types,
which are hypothetical, without known SNP definitions. The minimum 80%
estimated probability still applies to each sample and again most are much
better than 80%. For type definitions
we are confident that the average is about 90%, which is to say that about 90%
of the Polish Project R1a samples assigned to a type would test positive
someday for an SNP, unique to that type, not yet discovered.
“Cluster” and “Borderline” and “Unassigned”
category probabilities are discussed below.
I have been active helping Larry with
R1a assignments to types since late 2007.
During early 2010, I was busy with other things so Larry introduced a
few more R1a clusters, which I have now (June 2010) analyzed and documented
here as types.
The Polish Project on 25 May 2010 had
639 independent samples (members) with the full 67 marker STR set. Many of the samples with fewer than 67
markers are in category U (unassigned).
Polish Project Assignments at 67 Markers are taken as representative of
Poland, with caveats explained in my Publication.
272 of the 639 are R1a.
Click on the link in the far right
column to jump down to more discussion for that type.
Read the R1a
Abstract for a brief summary of this table.
Those Types and Subtypes
are my own code letters, for brevity.
Please do not confuse these code letters with official haplogroups. I
have been using such code letters for R1a assignments in the Polish Project for over 2 years. The color coding is for ease of comparison
on my web pages.
This
table was updated based on 25 May 2010 assignments, R1a, at 67 markers, 272
samples:
|
Cluster |
Group |
Type |
Subtype |
Subcluster |
Samples |
Polish % |
Ysearch |
Link |
|
P |
|
|
|
|
58 |
9.1% |
|
|
|
|
R1a1a7 |
P |
|
|
57 |
8.9% |
||
|
|
R1a1a7 |
|
|
PB |
1 |
0.2% |
|
|
|
N |
|
|
|
|
56 |
8.8% |
|
|
|
|
R1a1a7 |
N |
|
|
44 |
6.9% |
||
|
|
R1a1a7 |
|
|
NB |
12 |
1.9% |
|
|
|
K |
R1a1a* |
K |
|
|
86 |
13.5% |
||
|
|
R1a1a* |
|
K |
|
24 |
3.8% |
|
|
|
|
R1a1a* |
|
A |
|
12 |
1.9% |
||
|
|
R1a1a* |
|
B |
|
6 |
0.9% |
||
|
|
R1a1a* |
|
E |
|
16 |
2.5% |
|
|
|
|
R1a1a* |
|
F |
|
6 |
0.9% |
|
|
|
|
R1a1a* |
|
H |
|
3 |
0.5% |
|
|
|
|
R1a1a* |
|
I |
|
13 |
2.0% |
||
|
|
R1a1a* |
|
J |
|
6 |
0.9% |
|
|
|
R |
|
|
|
|
72 |
11.3% |
|
|
|
|
R1a1a* |
|
|
KB |
32 |
5.0% |
|
|
|
|
R1a1a* |
|
C |
|
1 |
0.2% |
|
|
|
|
R1a1a* |
|
D |
|
11 |
1.7% |
||
|
|
R1a1a* |
|
G |
|
14 |
2.2% |
||
|
|
R1a1a* |
|
|
R |
14 |
2.2% |
|
|
|
L |
R1a1a* |
|
|
|
0 |
0% |
|
|
|
|
R1a1a* |
|
M |
|
0 |
0% |
||
|
U |
R1a1a* |
|
|
|
0 |
0% |
|
|
|
Totals |
R1a1a |
|
|
|
272 |
42.6% |
|
|
The Ysearch
links provide the full modal haplotypes, using
a selected subset of the standard FTDNA set
of 67 markers. I entered this data into
Ysearch for our convenience. All my
modal haplotype definitions
are available in the Excel file Haplotypes.xls,
which also has experimental types not mentioned here. Below are Ysearch
instructions for quickly comparing your haplotype to all my types at once.
Assignment to types is with at least 80% estimated probability.
The estimated percentage for P, N, K,
and R in the Results Table add up to 42.6%, which is the percent of R1a in the
Polish Project at 67 markers.
Description of the R1a Types
Click the Ysearch web links in the Results Table for modal haplotypes, which are my best
fits of web data to groups of men with similar STR data. See also the haplotype Excel file at PolishCladesUpdate.
Please don’t get confused. The following capital letters are my code
for R1a types.
Capital letters are also used for the large official haplogroups, but that’s different.
Some of the following types have my
Excel analysis file for my November 2009 publication; the files are stored in the Supplementary
folder. Many of the following types
have my update Excel analysis at PolishCladesUpdate.
A.
Ashkenazi. This seems to be a
subtype of K. This type is discussed in
my publication, Part II. I have about 90% confidence in that subtype status,
but I am more than 98% certain that A is a valid clade, not just because of my
work, but because the modal haplotype closely matches the various versions of
the most common Ashkenazi haplotype, which has been widely studied and reported
on the web. It should be emphasized
that not all Ashkenazi match this type, and some men in this type may not be
descended from Ashkenazi. This type is
not restricted to Poland. Levy-Coffman wrote an article
about Ashkenazi genetic genealogy; I
noticed discussion in a recent Science
article.
B.
Another subtype of K, recently identified by Mayka. Concentrated in Poland. The B data cluster lies at the edge of the K
cluster. The node for B type in the R1a
tree might be slightly younger or slightly older than the K definition
node. I estimate the former is about
80% probability - that B is truly a subtype of K; if not then B probably lies just outside of K (node slightly
older). Individual assignments to B
type have 80% to 90% probability.
C.
Added to Polish Project in Dec 2009 by Mayka, who
points out that Didier Vernade originally pointed out the unusual DYS392=13
value in 2007. Very small cluster. Only 2 Polish Project members, only 1 at 67
markers, but this type is well isolated on Ysearch, with 4 different samples
with 67 markers. None on Ysearch are
identified as “Poland”. Because this
type differs very much in STR values from the rest of R1a1, C type might be a
very old clade, with a node in the R1a1 tree much older than the other types. Mayka pointed out to me that DYS392=11 is
almost universal in R1a1a, and the 13 value for C type might be the ancestral
value for R1a. One of the C type Polish
Project members has SNP data through 23andMe, and Mayka is working to verify if
this data defines a new paragroup level within R1a.
D.
Concentrated in Poland. This
type was added to the Polish Project in Jan 2010. The cluster was brought to my attention by Mayka,
who points out that Nordtvedt mentioned the cluster in
web discussions some time ago, based on the very rare DYS462=12 value. DYS462 is not one of the FTDNA standard markers; it is a standard at Sorenson; DYS462 is available in data on Ysearch. I did an
analysis using the 67 FTDNA markers;
the SBP came out 18.4%, better than 80% probability just on that basis for the best fit
samples. However, 462 would
significantly reduce SBP, so probability of validity of a clade corresponding
to D is quite high considering 462. On
9 Jan, only 5 of the samples that fit D type in the Polish Project had been
tested for 462 and all 5 had that rare 12 value. Since January more men from D type, plus men with STR values just
beyond D type, have been testing for DYS462.
At this update, 15 March, there are 8 men confirmed DYS462=12 and 4 men
with STR values close to D type that have DYS462=11 (standard value for
R1a). The original modal haplotype used
61 markers; last week I spent a little
more time on analysis and found a 66 marker modal haplotype, with SBP = 18.1%,
cutoff 14; I updated the Ysearch D
type. All 12 of the DYS462=12 men come
out in the D type cluster using this D66 modal. One man at step 6 has not been measured for DYS462, but step 6 is
well with D type, so I classified him D type.
Two more men without DYS462 evaluated come out at distance 9 and 12, so
these are classified D Borderline pending their DYS462 measurement, which is
coming.
Two of the DYS462=12 men in D type
come out at step 13, the last step (cutoff 14). There are also two men with DYS462=11 at step 13, the only men
with 11 in the type. The other men with
11 are at steps 16 and 17. I did not
use DYS462 in the D type definition; if
I had used it those 2 men with the 11 value at step 13 would come out step 14,
so on that basis I removed those 2 men from D type, for 11 total D type at this
time. If all this discussion is
difficult to follow, please consult column CL the sheet “Calculator” in the
file “DType.xls”, at my update folder, where
you can see the data arranged in order of step.
D type also has the unusual DYS481=21
value; only 10 samples in the Polish
Project R1a have this value, and all 10 come out D type. One man has the very unusual 20 value (one
of those 2 with DYS462 coming soon). 23
is the standard value for DYS481, and all the men with step just beyond D type
have 23. In other words, the STR pair
(481,462) = (21,12) seems to be a very high confidence indicator of D type
within R1a1a. I confidently expect that
someday an SNP will be discovered corresponding to this STR pair of values,
elevating it to a haplogroup.
Unfortunately, Sorenson does not use the 481 marker, so there are only 3
R1a1 samples on Ysearch with the D type signature pair (481,462) =
(21,12); all 3 are Polish Project
members now assigned to D type. (There
are 2 others on Ysearch with this very rare signature pair in other haplogroups
- coincidence - a reminder that the STR pair (481,462) should not be used
without first establishing membership in R1a1a.)
D type is clearly a Polish type: In the Polish Project 8 of those 11 D type
indicate “Poland” ancestry; the
exceptions are 1 blank, 1 obvious Polish family name with no country, and 1
Slovakia. On Ysearch, 5 of the 7 best
fits (with D step <6 on 7 Jan using the original 61 marker modal) indicate
“Poland”, while at steps 6&7 (cutoff 7 for that modal) only 1 of 9
indicates “Poland”. That is a hint of a
non-polish clade close to the edge of D type, which might be the reason the SBP
for D type on Ysearch is 22%, not as good as that 18% in the Polish Project. Or maybe this is a hint of a larger parent
clade that is not Polish. Four of the
samples just beyond D type are I type, which is a hint these may be brother
types. D type is very young, about
1,000 years TMRCA (very rough estimate result), and seems to be composed of
subtypes Da and Db (not yet statistically significant). D looks like a subtype of K, but see my K Borderline discussion in this regard. For more details, see the “Documentation”
sheet in my analysis file “DType.xls”, at my update
folder. That Documentation explains
my very rough age estimate of 1,000 years.
E. V.
Rudich entered a modal for this cluster into Ysearch as ID mW7DP,
named “North Eurasian”. Mayka modified
it slightly for the modal used here by me, GNYBG,
named “Belarus”. It’s an excellent
type; on 25 May it has 16 samples at 67
markers in the Polish Project, with SBP = 14%.
F. This type was recently
suggested by Mayka.
Only 6 members of the Polish Project belong to F type. This type has DYS452= 28 vs modal 30, but
452 is not one of the FTDNA standard markers,
so not many Polish Project members have this marker evaluated. SBP = 34% without using 452, but when the
data is adjusted for consideration of 452 SBP = 27%, which is quite good for
such a small type. See H type below, which also has 452=28. It’s not clear if these two types are
closely related, or if the 452 mutation independently happened twice. The F definition uses 46 markers.
G.
Pomeranian. This type was
suggested to me by Mayka, who calls it the Pomeranian
cluster. Pomerania is the name of the
region on the south shore of the Baltic Sea including regions of both Germany
and Poland. Marcin
Wozniak found the G modal haplotype (at 12 markers) to be very common among
Kashubians. Kashubians consider
themselves an ethnic group or nationality within Poland. It will be interesting to determine if
Kashubians in Poland have a higher % concentration of G type than German
Pomeranians. Meanwhile, “Pomeranian” is
a convenient neutral name, suggests Mayka.
G type is mentioned only briefly in
my publication because not much data was
available to me at that time. My GType.xls update
analysis file with June 2010 data has excellent results: There are 12 samples in a nice type with SBP
= 11.2%. There is preliminary evidence
of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in
Ysearch; see Haplotypes.xls
for a list including hypothetical working modals.
H. This type was recently
suggested by Mayka.
Only 3 members of the Polish Project belong to H type. This type has DYS452= 28 vs modal 30, but
452 is not one of the FTDNA standard markers,
so not many Polish Project members have this marker evaluated. SBP = 20% without using 452, which is quite
good for such a small type. Using 452
does not improve SBP because of neighboring F type samples which also have
452=18. See F type above. The H
definition uses 60 markers. I consider
this type too rare to justify a Ysearch entry.
I.
Concentrated in Poland. This
type is discussed in my publication, Part
II. About 85% confidence of
validity. About 80% net confidence that
both A and I are subtypes of K.
J. This type was recently suggested by Mayka. Only 6 members in the Polish Project, but
this type is well isolated at SBP= 13%.
K. This
seems to be a main R1a1a type. K type
is discussed at length in my publication, Part
II. It is larger than others in the
Slavic lands. P and N (below) are just
as close in STR values to K as they are to each other,
probably because the K modal haplotype is the same as the R1a1 modal haplotype
(using the best 34 markers for K). So
far I have discerned a few subtypes of K in my List of R1a types,
but I do not have high confidence that they are all exact subtypes of K, as
explained in my K Borderline discussion. I suppose that as data accumulates more subtypes
will become clear within K and K Borderline.
In the Results
I use K* to signify those samples that match type K but do not match one of the
subtypes. Although I have high overall
confidence in the validity of K type, individual assignments to K* are not as
confident. Because K is located at the
modal heart of R1a, I expect some outlier samples from distantly related clades to match K* fairly
closely just due to the statistics of random STR mutations. Because of the possibility of foreign
outliers, I consider samples at K step 3 to be K Borderline, even though the cutoff for the K definition is
4. Even K* samples with step <3 have
confidence of only 80 to 90%. That’s in
Poland, where K is fairly well defined with SNP = 26%. Worldwide K* cannot be discerned with
confidence. The Ysearch SNP for K is
71%, not significant. That means there
are K borderline clades close to the K cutoff that are rare in Poland but
causing interference on Ysearch. This
is evident by a glance at the K type results on Ysearch, where “Poland” origin
is concentrated at steps <3, and “Poland” becomes progressively less common
at higher steps. A
type is a very high confidence subtype of K, so these caveats about K* do
not apply to the very high confidence of individual assignments to A type, and
similarly to the other subtypes.
The Kurgans are the ones
who domesticated the horse more than 6,000 years ago. Many scientist think that one pre-Kurgan man is the male line
ancestor of all R1a1 men who live today.
The Kurgan hypothesis is controversial, and not necessary for this web
page. You may have noticed that I used
the letters of “Kurgan” for my original types and categories during 2008.
L. This
cluster is highly hypothetical. It is
rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested
this cluster to me. It is a well known
Scandinavian cluster. I quickly checked
it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at
80% probability yet, so I am not yet using it for classification here. More documentation about L will be available
here when I find time to study it.
N.
Concentrated in Slavic countries.
This type is discussed in my publication,
Part II. This is a type that according
to Yhrd seems to be spread all around the Slavic lands and
central Europe, from East Germany to Russia.
N has more mutations than P, so that means it is older. Within Poland N seems to be slightly smaller
than P, but overall N is larger than P.
Previous versions of this page had Na and Nb as speculative subtypes,
but I removed those because it seems N type should be properly studied in a database
that is not restricted to Poland.
However, I’ll continue to watch the Polish Project, because it will be
interesting if more data provides a Polish subtype within N.
There are web comments about a new
R1a1 SNP, to be announced shortly. My
guess is that this new SNP might correspond to the cluster of data associated
with what I call N type.
P.
Concentrated in Poland. This
type is discussed at length, in my publication,
Part II. It seems that about 8% of Polish
male line ancestry men belong to this type.
According to Pawlowski, this cluster is
concentrated in Poland. I verify Polish
types using both Yhrd and Ysearch. P has fewer mutations than N and K, so it
must be younger. My TMRCA
age assessment is 1600 years old, but in light of age caveats P type might be
1 to 3 thousand years old. Regardless
of age, P type seems to have had a population expansion less than 1 thousand
years ago. My publication
provides details on the size and age calculations along with evidence regarding
the validity of P type. In my R1a web
document, I used P type as an example for a discussion of the caveats associated
with TMRCA calculations, and also as an example to explain the possibility of hidden clades, and
also as an example for population
bias in databases such as Ysearch, so you can find lots more discussion
about P type by clicking on those links.
I identified P type and submitted my analysis for publication before the
M458 mutation was announced by Underhill.
Pc & Pg. These subclusters have about 70% confidence, so no assignments
yet. Previous versions of this web page
used Pa & Pb & Pe. The new
versions, Pc & Pg, are different, so they got a different subscript letter,
although I have modified the same Ysearch IDs.
I have a Pd and other subtypes that are too speculative to mention at
this time.
R.
Remainder. Updated 2 Jul
2010. This is not a haplogroup or a
type. This is a category for samples
that are distant in STR values from all the R1a1a types I have defined so
far. If you are in this category, I
highly recommend that you get all 67 markers plus the M458 test. More markers will help me define a new type
for you. Your M458 test is unlikely to
come out positive, but if it does that means you would be the first member of a
new type within M458.
I also recommend that you test for
all the several SNPs that FTDNA considers equivalent to R1a1 (called R1a1a by others). Your
unusual STR values make you a candidate for an unusual small clade that has a
very old node with the R1a tree. Each
SNP is unlikely to come out negative.
In fact, all such tests most likely will come out positive. But if one comes out negative that’s
excellent, because you will join a very rare group, perhaps even define a new
haplogroup. If you cannot afford all these
tests, OK, just hope for people with STR values close to yours to do the tests
and watch this web page for your sample to move into a new category.
R
is equivalent to a paragroup. Just like R1a1a* means only R1a1a samples
that are negative for all known SNP subdivisions, my R category extends that to
mean only samples that do not match any of my known types. At 67 markers, R also means that the sample
does not qualify for one of my borderline categories. I have a policy not to use the U category for samples with all 67
markers, so in some cases I need to make a close call on a sample that is on
the edge a borderline category - some R samples are right at my cutoff at 67
markers.
For a sample with 37 or fewer markers,
I require 80% probability that the sample would not match one of my types if
all 67 markers were obtained. There
used to be quite a few R at 37 markers back when I had only a few types, but
there are none right now (July 2010) because there are none that have STR
markers so unusual that they are far from all types.
The 80% rule
does not apply to R. If a sample has
30% probability of belonging to its best fit type it would be assigned to
R. That means it only has 70% probability
of being a true R. R samples still have
their FTDNA assignment which is either
100% (green) or 99% (red).
When I started this hobby a couple
years ago, R was the 2nd biggest category after U. I now have enough types that R is small.
In June 2010 I subdivided R into two
categories. R (M458-) is those tested
negative for M458. R (needs M458) is
those not tested for that SNP. If an R
sample would test positive it would be moved to the NR category.
U.
Unassigned. Updated 3 Jul
2010. This is not a cluster, but a
holding place for samples with less than 80% probability
for assignment. I use U in the Polish
Project for R1a uncertain samples with less than 67 markers. Samples with all 67 standard markers are not assigned to U, but
instead are assigned to the R (remainder) category, or into “Borderline”
categories such as N Borderline or K Borderline. U is 0% in the Results Table, which is samples with 67 markers,
but considering all samples U is the largest category in the Polish Project,
with 200 members on 25 May 2010 - 15% of the project, 35% of R1a. If you are classified U you can become
promoted to another category by obtaining results for the remainder of the 67
markers. U is the largest category in
the Polish Project
The 80% rule
does not apply to U. If a sample has
70% probability of belonging to its best fit type it would be assigned to
U. That means it only has 30%
probability of being a true U. Many U
have >30% probability of belonging to two or more different types. U samples still have their FTDNA assignment which is either 100%
(green) or 99% (red).
Probabilities
include estimates, so they are not exact.
I tend to be strict for samples with fewer than 67 markers, using U for
marginal situations. At 67 markers is
do not use U - I use R, and I’m not strict at 67. Also, I concentrate my time on improving the assignment rules at
67 markers and have not yet found time for 37 marker rules for some of the
newer small types.
R1b
Haplogroup Types
On 20 July I added the following
three R1b Types to this web document (next three subtopics,
L23EE, L47P, L47A).
Mayka had
already added these three to the Polish Project web
page during the previous week, based on my
recommendation, based on my SBP analysis.
I independently found these three by
analyzing the Polish Project R1b data, but Mayka pointed out they were
previously known as clusters. We judge that my analysis justifies adding them to our list of
types. Since I’m using 639 samples with
67 marker data as representative of Poland, a small
type clade at 1% of the Polish population would be expected to have roughly 6
samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.
I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.
The STR definitions for these are
available at Haplotypes.xls. PolishCladesUpdate has a
link to an Excel analysis file for each of these three types.
Instructions
for Ysearch comparison are below.
Here is the “UserIDs” bar for R1b comparison:
USEID, CX94E, MKM4R, 7HB9C
Change USEID to your User ID.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be
optimal for other regions. If you have
Polish ancestors, and if you have all 67
markers, and if you match one of these within a step distance of 10 there
is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability
that you belong. You should test the
appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern
Europe and you are a marginal match (step distance 5 to 15) for one of these,
it is not very probable that you belong to the corresponding Polish clade,
because each of these types has some overlap with other clades that are rare in
Poland.
L23EE. 20 Jul
2010 documentation: This type is positive for
the L23 SNP, hence this type is a hypothetical future
haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch -
R1b1b2a1 - of L23.
Nordtvedt
pointed out the cluster for this type some
years ago, calling it R1b-EE (Eastern Europe). Mayka suggested the L23EE code to me.
There are only 6 samples in the
Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a
small type. The cutoff is 12, but if you match at step 10 through 12 I
estimate your probability of belonging at slightly better than 80%, so you
really should test for the L51 SNP - a negative result would boost the
probability to about 95%. In the Polish
Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of
the samples from step 17 to 20 are L51+.
So this type is very well isolated in
haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this
type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in
the Polish Project) and only 2 of those 12 are East European - one each in
Germany and Russia. That means this
type is not well isolated world wide, meaning samples near the cutoff are
highly uncertain. I interpret this as
evidence that my definition of L23EE type is really a Polish subtype within a
larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no
obvious correlation. To me, that means
there are probably at least 3 sub-clades that may become evident as data
accumulates.
If you match this type closely at 37 markers I highly
recommend getting the full 67, because the
statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51
test; a negative result confirms membership
in this hypothetical clade, and a positive result means you are not a
member. We do not know the probability
of outsiders matching L23EE in STR values, particularly outside Poland, so
there is still a slim chance of a surprise - a close match to the definition
but with L51+.
L47P. 20 Jul 2010 documentation: This type is positive for
the L47 SNP, hence this type is a hypothetical future
haplogroup within the current haplogroup R1b1b2a1a1d1. This type is probably negative for L44, the
only current known branch - R1b1b2a111d1a - of L47, but that L44 negative
indication is based on only one sample so far so it is not certain.
Mayka announced the cluster
corresponding to this type on the web in
March 2009.
There are only 4 samples in the
Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small
type. The cutoff is 7 and the gap is 10.
There are no samples from step 7 to 16.
Although samples in that wide gap are expected as data accumulates, this
type is very well isolated in haplospace in
Polish data.
This type is very robust;
the same 4 samples are selected using any number of markers from 10
to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3%
indicates, because some of the samples at step 17 and beyond have tested
negative for the SNPs in the R1b trunk leading to L47 so they clearly do not
belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but
3 of them are the same as the Polish Project.
Ysearch has 8 samples at steps 8 to 12, so the type is not as well
isolated worldwide.
The “P” in the code L47P represents
my hypothesis that this type is Polish.
Members of this type should test for
L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b,
which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in
STR matching, probably less than 10% in Polish data, but I do not know what the
exact percent interference be until more data accumulates.
See the last paragraph of L47A, next
topic, for more comments.
L47A. 20 Jul 2010 documentation: This type is positive for
the L47 SNP, hence this type is another hypothetical future
haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative
for L44, a known branch of L47.
Mayka suggested
the “A” code, since this type is obviously Ashkenazi, based on family names
(see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA
projects, although I did not do the research to find a first web publication at
67 markers; I would appreciate an email
of a reference to add here, even if it does not exactly match my definition. It’s OK if an international modal haplotype
differs by a few markers from a haplotype determined in Poland, particularly if
the difference is at markers that are bimodal, indicating subtype structure.
There are only 5 samples in the
Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a
small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to
18. Although samples in that wide gap
are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust;
the same 4 samples are selected using any number of markers from 30
to 67 with SBP <10%.
This type is better yet on Ysearch (code 7HB9C), with 18 samples
(13 Jul 2010)
for better statistics; SBP = 4.6%,
which is remarkable. It might be even
better with an optimized definition; I
used the modal haplotype that I extracted from the 4 Polish Project samples.
This
one does not seem as Polish as L47P, although those 18 Ysearch samples are
concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are
the only two known branch haplogroups of L48.
In the Polish Project so far (20 July), no one has tested yet for L148,
and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do
not know the frequency (prediction probability) of L48 samples that do not
match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR
values expected for L148. (There are
two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for
statistical estimation.). All this will
quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch
at FTDNA, so only administrators have visibility of SNP test results beyond
L48, including L47 and L148. Mayka provided the SNP data that I have documented here.
At the end of July added the
following two I Types to this web document (next two
subtopics, M223CE and M253P).
Mayka added
these two to the Polish Project web page, based on
my recommendation, based on my SBP
analysis.
I independently found these two by
analyzing the Polish Project I data.
They were previously known as clusters,
hypothetical clades, discussed some time ago by Nordtvedt. Since
I’m using 639 samples with 67 marker data as
representative of Poland, a small clade at 1% of the Polish population
would be expected to have roughly 6 samples in the database (70% confidence
interval 4 to 10). These two small
types have only 4 and 5 statistically independent samples, so each is probably
slightly less than 1%.
The
STR definitions for these are available at Haplotypes.xls. PolishCladesUpdate has a
link to an Excel analysis file for each of these two types.
Instructions
for Ysearch comparison are below.
Here is the “UserIDs” bar for I comparison:
USEID, WC8JD, SB6YK
Change USEID to your User ID.
Reminder: These types are calibrated to Polish data. My definition
modal haplotypes may not be optimal for other regions. If you have Polish ancestors, and if you
have all 67 markers, and if you match one
of these (step distance less than 4 for WC8JD - M253P, or
less than 16 for SB9YK - M223CE) I figure there is more than 80% probability that you belong to the corresponding
clade. At higher step there is lower
probability that you belong. You should
test the appropriate SNPs (explained below) for higher probability. If your ancestors are not from the region
around Poland you should follow the links to Nordtvedt’s
worldwide definitions to compare your data.
Comment about validity: Neither of these two types forms a
statistically compelling type on Ysearch, although the
types seem convincing within Poland.
There seem to be clades elsewhere, particularly in western Europe, with
similar STR values, so those others interfere with M253P and M223CE on
Ysearch. I’m using 80% probability in my predictions, and in the case of
these two, I estimate the probability at roughly 10% that either type is not
valid as a unique clade. In addition,
even if both types are valid, men of Polish male ancestry who match at high
step close to the cutoff have up to 10% probability of
being descended from a distantly related clade due to the statistics of STR
mutations (for example a descendant of a recent immigrant from one of those
western clades on Ysearch).
General comments about haplogroup
I: I1 and I2a are each large in Poland. I2b1 is smaller but significant.
I can easily form types for all three haplogroups, and the STR values do
not overlap. Various definitions work
well for those three haplogroups. That
means they are much younger than the ancestral I haplogroup, which experienced
extreme population bottleneck leading to these three haplogroups that are very
well isolated as mountains in haplospace. It appears that only small isolated
populations survived the bottleneck, with an MRCA for each
population very distantly related to the MRCAs for the other populations (I1 vs
I2a vs I2b1 vs a few other small haplogroups in I). There is evidence of additional structure with I1 and I2a -
common STR haplotypes (see Nordtvedt). However, with quite a bit of effort I could
not form more types with low SBP. That
means I1 and I2a grew rapidly in population, so that each is relatively
homogeneous in STR values, at least as represented in the Polish Project, which
is the source of my data for my analysis.
More SNP data is required to further subdivide these two large
haplogroups with high confidence in the Polish Project.
My Haplotypes.xls
file has other haplotypes defining clusters in the I haplogroup that I am
watching as data accumulates, but only the following two are significant enough
to warrant discussion here:
M223CE. On 31 July
I added this Central European type for I2b1 haplogroup to this web page. This type has been known as a cluster for a
few years. Mayka points out that Nordtvedt listed it on the web.
It has been known for some time now that
this cluster is positive for the M223 SNP, which corresponds to I2b1. One of the Polish Project samples in this
type is positive for P95, which is I1b1d.
I hesitate to predict P95 for this type, because individual samples are
roughly 80% probability. It does not
matter if M223CE turns out to be P95+ or P95.
Either way, P95 will serve as a test for higher probability
predictions. So if you match M223CE you
should test for both the M223 and P95 SNPs to determine your category with high
confidence. I’ll update this topic when
more P95 data is available.
SBP comes out 7.6%
for the 5 samples in M223CE in the Polish Project. However, two of these are a family pair where one encouraged the
other to join, so only 4 of these 5 samples should be used for statistical
purposes. SBP calculated on the basis
of 5 samples is 10.3%, which is still a high confidence result.
A good signature
is (392, 437, 450,446) = (12, 14, 9,11), which distinguishes these samples from
all others in the Polish Project, although exceptions are bound to show up in
the future.
This M223CE type does not seem to be
a unique Polish clade because all 4 samples have step values of 16 or more from
each other, even though there is a gap from step 16 to 24 (relative to the 66
marker modal definition), with no samples in the Polish Project, separating
these samples from all others.
This type is equivalent to
Nordtvedt’s I2b1-Continental. A
definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard
markers plus 8 additional markers (on 31 Jul 2010); this definition differs from mine by only 7 mutation steps.
M253P. On 26 July I added this Polish
type for I1 haplogroup to this web page.
This type has been known as a cluster for a few years. Mayka points out that Nordtvedt
listed it on the web. Marek Skarbek
Kozietulski has studied this cluster quite a bit, since he’s a member. I mentioned this type briefly in my publication, where I was previously calling it Y
type, considering it not high confidence based on the
data available last year.
It has been known for some time now
that this cluster is positive for I1 (M253) and negative for the known branches
I1a through I1e. So this is a type
within the paragroup I1*, although not all samples
from I1* are members of this M253P type.
SBP comes out 6.4%
for the 9 samples in M253P in the Polish Project. However, Marek informs me that he had identified 4 men who
matched at 12 markers and actively recruited them to obtain all 67 markers and
to join the Polish Project. That means
only 5 of these 9 samples should be used for statistical purposes. SBP calculated on the basis of 5 samples is
13.6%, which is still a high confidence result.
I used all 9 samples in my analysis
file at PolishCladesUpdate
in order to best estimate the modal haplotype and definition, which are
available at Haplotypes.xls.
The best signature
marker for M253P is 392=12; that marker
alone distinguishes the M253P samples from all other I1 samples in the Polish
Project, although quite a few I2 samples have this value. Exceptions are bound to turn up. That marker alone does not distinguish
worldwide M253P samples on Ysearch.
The best trio signature, (391, 392,
447) = (11, 12, 24) captures 8 of the 9 M253P at step 0. At step 1, (mismatch of 1), that 9th sample
matches, but 1 other sample from I1 and 6 from I2 also match. This demonstrates how short signatures can
be useful but not foolproof. On Y
search that trio does not work well.
Nordtvedt uses a broader definition
for his cluster equivalent to M253P, which I understand he calls I1-ASP, for
Anglo Saxon Polish Pomeranian Prussian.
My 54 marker definition at cutoff 4 is narrower, but my definition seems
to capture most of the I1-ASP samples at higher step values. The corresponding SBP at higher cutoff is
not low enough to qualify as a type. The future may provide an SNP defining a
parent clade broader than M253P along these lines, although I doubt the STRs
will score a low SBP value.
Mayka informs me that further M253
work includes testing members for the many SNPs considered equivalent to
M253. If someone shows up negative for
one of those, that will define a new haplogroup in I1.
Here is some interesting speculation
for which I do not have convincing statistical evidence: Marek points out that the sample at step 4
on Ysearch is Danish, which adds to his evidence that there might be a related
clade in Denmark, perhaps with a node in the I1 tree older than the node for
the Polish clade but younger than the node for the parent I1-ASP clade.
Meanwhile, if you match M253P, it is
a good idea to confirm with SNP testing:
you should be positive for M253 and negative for the 4 known branch SNPs
listed at ISOGG.
Instructions
for Use of Ysearch
Link to the site: http://www.ysearch.org. Brief description of Ysearch.
Click on the Create A New User tab,
where you can upload your Y-DNA STR data from a number of testing
services. Or, you can type in your data. You end up with a “User ID”.
Ysearch has a Research Tools tab to
click, where you can type in other User ID’s for comparison.
Cluster Genetic Distance
Method; for: P - Pc - Pg - N - K - A -
I - B - D - E - H - M - G:
Click here: Research
Tools
Copy the following line into the
“UserIDs” bar at the Research Tools page:
USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX, RU8Z8,
K49NZ, GNYBG, 559EE, 24MB4, ZD29Z
Change USEID to your User ID.
You need to type the Captcha puzzle
for access.
Click on ‘Show genetic distance
report”. You get a table of results.
Result: If there is a small genetic distance result
(3 or less) for one of these types, you have a high probability of belonging to
that type. There are more detailed
rules available, assignment rules above,
followed by several paragraphs of discussion.
Reminder: this web page is for men with R1a1a type Y-DNA. If you are not R1a1a, these instructions
will not produce a matching result, except very rarely, in which case the
result would be meaningless.
The emphasis is on men of Polish male
line ancestry. Just about all R1a Polish
line men are R1a1a. Anyone from the
haplogroup R1a1a from other countries may get good results, but that may be
misleading if there are other types, rare in Poland, not noticed by me, but
with haplotypes that overlap one of these 9 types. Many men of Polish male line ancestry do not match any of these
types. For non-Polish there is a higher
probability of not matching any of these types.
37 Marker Network
Lawrence Mayka
(independently, March 2007) constructed a “median joining network” Network
for the 37 marker samples of the Polish Project. This network supports the definitions of the P & N clusters,
and of the A subcluster. The P cluster
is the left side of Mayka’s network; N
is the top branch, and A is a small branch on the lower right.
29 March 2010 correspondence: I mentioned Russian sites
for R1a clusters in my publication. It’s not easy for me to figure out which of
those clusters correspond to my types. Mayka worked out a correspondence on 29 March, warning me
that the correspondence is not exact.
Some of the Russian clusters are broader than my types; some are narrower. Here are Mayka’s findings:
My
Type code vs Russian cluster name:
A Ashkenazi Jewish
B Western Eurasian
C Old European
D Baltic - Carpathian
E Northern Eurasian
F Central Eurasian
G Central Eurasian
H Western Carpathian
I Northern Carpathian
N Central European
P Western Slavic
Here is a summary of terms (in
boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see the fall issue of JoGG. By haplospace I mean
multidimensional sets of STR values;
each haplotype is a point in haplospace.
A cluster
qualifies as a type if the graph of step frequency (number of samples at
that step) vs step looks like an isolated mountain. The step is the
genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate
step. The cutoff
is the next step just beyond the mountain.
A good type has low step frequency in a “gap” of
step values including the cutoff (only the cutoff for a gap of 1). In other words, the cluster forms a mountain
at step values less than the cutoff, separated by a gap from the rest of the
database (the parent haplogroup usually) at higher step numbers.
The Statistical Background Percent
(SBP) is an objective measure of the quality of the
type. Low SBP is taken as evidence that
a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP
(yet to be discovered). Larger types
with lower gaps have lower SBP. SBP is
intended as an estimate of the background percent
of samples in a type that really do not belong to the corresponding
hypothetical clade. SBP is increased to
account for the estimated probability of outliers from other clades. An outlier is a
sample that has very unusual STR values due to the luck of mutations. SBP is also increased to account for the
estimated probability of small foreign clades
that just happen to have the same STR values but are not closely related to the
type. The SBP is also increased to
provide the rough equivalent of the maximum in a confidence
interval. Small sample counts have wide
confidence intervals. So larger types
(more samples) automatically get lower SBP.
For a valid clade, SBP should decrease with time as data accumulates in
a database. A very well isolated clade
will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely
to be a clade. SBP < 25% is good
enough to be published. SBP < 50% is
a type worth watching as data accumulates with time. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look
like mountains. The number of markers
in the definition should be chosen to provide as
small an SBP as possible; my Excel
tools provide automatic rank of markers
as an aide; human judgment can be used
to include or exclude markers with obvious problems. A signature is a small set of
markers that rank best, convenient for publication of a type, and for simple
demonstration of the correlation of STR values.
I use the word “type”
to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes
that differ from the modal haplotype by step less than the cutoff. The definition
of a type is the modal haplotype plus cutoff.
The definition uses only those STR markers that provide the lowest SBP,
but the definition uses as many STR makers as possible. The definition of a valid type may change
slightly as data accumulates.
Here are
some common terms (in boldface) for genetic genealogy. I did not define these, although I use them
in a restricted sense: A marker (also “locus”, plural loci) is a DNA location for an
SNP or STR or other kind of mutation. A
haplotype is a set of gene values at any number
markers, here restricted to Y-DNA STR values.
I use the word sample (plural samples or data or database) for the Y-DNA STR values from one man. A sample is also commonly called a
haplotype, but I avoid calling a sample a haplotype to make it clear that a
haplotype may or may not be present in a particular database of samples. A clade is a
general term for common descent, so an SNP haplogroup is one kind of
clade. I use the word clade in general,
when meaning a Y-DNA clade that may or may not be a defined official
haplogroup. All types
have associated hypothetical clades, but most clades cannot be isolated as
types with low SBP. A cluster is a set of samples with similar STR values. All types have associated clusters but not
all clusters are associated with types.
The modal value for a marker is the most common
value in the cluster. The modal haplotype is the set of most common values,
usually the most common haplotype in a cluster. Many people use the adjective “modal” as a noun, meaning “modal
haplotype”; so do I; I tried to avoid that in this web document.
Not all Y-DNA STR data separates into
types because the distribution of STR values tends to be continuous. A type corresponds to a clade that
experienced a population
bottleneck - isolation or migration or very rapid population growth.
Probability Minimum
80% for Polish Project R1a Assignments
This topic was updated 5 Jul 2010.
See Polish Project Assignments for a brief
overall explanation of how assignments are done. This topic provides more detailed discussion. This topic focuses on the R1a categories,
but some of this discussion obviously applies to other categories.
Each sample
(individual man) is assigned to a category.
Some categories are known haplogroups. Other than R1a, all categories in the Polish
Project are haplogroups or paragroups. At this time there are only two in R1a.
Haplogroups are defined by SNPs, but not all
haplogroups are supported by FTDNA
assignments, which may cause some confusion.
Some categories are types,
which are hypothetical haplogroups.
Borderline and cluster categories are discussed here a few paragraphs
down. Click on Remainder
and Unassigned for discussion of those two categories
elsewhere.
The assignment guideline is at least
80% probability for each individual sample. Using an 80% minimum, most assignments
are better than 80%, of course. So the
average probability for a category is higher than 80%, and the average varies
by category depending upon how many samples are marginal near 80%.
For haplogroups, “80% probability” means that if a large number of
samples with 80% probability were SNP tested, about 80% of
them would test positive for the haplogroup into which they were
predicted. Probability is determined by
correlating STR values with samples that have been tested
for that SNP.
Some assignments are 100% probability
- samples with positive SNP test results, assigned to that haplogroup, and not
given an extended assignment. Actually,
there is no such thing as 100% because the genetic test might be in error, but
it seems from experience that testing errors are much lower than 1%.
I arrive at probabilities with a
combination of calculations and educated estimates. This topic is my explanation.
Confidence is a separate topic,
related to probability. Confidence is next, below.
I figure probability as a decreasing
function of step from a modal
haplotype. My assignment rules are
step distances at which I figure 79% probability. If a sample matches the modal haplotype at less than the 79% step
distance, I assign that sample to the corresponding haplogroup or type or other
category. In practice it’s
complicated. I use an Excel file for
assignment. You can view the file at www.gwozdz.org/PolishCladesUpdate/Assigner.xls. That may not be the current version. In that file the “PolishProjectRules” sheet
has the list of rules for human reading - next to the coded logic functions for
Excel. If you are a Polish Project
member you can find your kit number and view your step to each category in the
table - “Modal Calculator” sheet.
The following paragraphs explain how
I figure probability for types. This is not something I proved in my publication, but it seems to
me that my publication makes it reasonable.
I hope you the reader find the following method reasonable. I expect this method will be proven with
time as most of my predicted types are validated.
If a type has 90% probability of
being valid and a particular sample has STR values that match the type with 90%
probability, those two numbers get multiplied for net probability. That particular sample has 81% net
probability of validity, and 19% probability of invalidity. I do not actually calculate this. This paragraph is a conceptual explanation
introducing the explanation in the following paragraphs.
My publication
has detailed discussion of my statistical method for types. Briefly, I use SBP as a
quality measure. SBP is a measure of
the background - the percent of samples that match
the type but really do not belong. For
example if SBP = 15%, that means 15% is a measure of how many samples within
the type (step less than cutoff)
really do not belong to the type. For
this example, a typical sample in the type has 85% probability of really
belonging to the type.
It is not possible to calculate the
probability that a type really is a clade that will be validated some day by an SNP not yet
discovered. Although 100% minus SBP is
not the probability of type validity, 100% minus SBP is closely related to
validity. Certainly a type with high
SBP has low probability of being valid.
Certainly a type with SBP less than 15% has high probability of
validity.
SBP is a high calculation, designed
for roughly 70% confidence interval, with additional
increase for many statistical reasons explained in my publication. That’s why I call it “Statistical Background
Percent”. This statistical increase is
small for small SBP and larger for larger SBP.
The way SBP is calculated, it goes over 100% for type candidates with
high background; SBP should not be used
over 50%.
The best estimate for background
percent is lower than SBP. However, as
explained a few paragraphs above, the net percent of invalid samples (net
invalidity) is higher in the cluster of a type,
because of the unknown probability that the type itself is invalid as a
whole. It is convenient for me to
assume these two considerations cancel each other. I use SBP as my estimate for the net background percent of
invalid samples in a type.
A sample that matches the modal
haplotype has close to 100% probability of belonging to the corresponding
type. For a type with a high cutoff,
this is true even for a sample a few steps away from the modal haplotype. The reason is that the vast majority of
haplotypes in a type are at the highest step numbers, so that is where most of
the background is. This is explained in
the discussion of Table 1 on page 145 of my publication.
So here is my method: I figure an assignment rule “step < S” to
assign samples, where the samples at step S and greater, equal to about SBP
percent of the type cluster, do not get assigned.
This finishes my brief justification
for using SBP as a guide for assignment.
More discussion of details:
There are other calculations in addition
for SBP, for example haplogroup correlations mentioned above.
Another is the calculation of
correlations for 37 marker rules, which are similar to haplogroup
correlations. Using 67 marker data for
a type, the 37 marker data for those samples provide probabilities that other
samples with only 37 markers belong to this same type.
After I do a particular calculation
many times, I feel confident glancing at new data and making quick estimates
for new rules if the number of samples does not justify detailed calculation.
Let me repeat what I said above: I arrive at probabilities (assignment rules
are 80% estimated minimum probability) with a combination of calculations and
educated estimates.
Mayka, who does
the assignments other than R1a, does not use my calculation methods, but
insofar as he uses his experience to judge STR correlations, he is really
performing estimated correlation calculations.
When a probability is judged close to
the 80% minimum for assignment based on
STR correlations (step close to the rule limit), there are a number of
additional factors that can be considered.
The following paragraphs are examples.
More examples are in my publication.
Mayka uses similar considerations for assignments:
Geographic concentration. P type is an example. P type is concentrated in Poland. I considered P type as more likely valid
because it is geographically concentrated, before it was validated by an
SNP. Back then I considered a Polish
family name associated with a sample as marginal additional evidence of
belonging to P type. Today that
consideration applies to a sample that marginally matches the P type haplogroup
with STR values but has not been measured for the L260 SNP.
Ethnicity. For example, there are a number of haplotypes known to be common
among Jews, so a Jewish name associated with a sample is marginal additional
evidence that the sample belongs to a corresponding haplogroup or type.
Stragglers. We tend to avoid categories for only one or a few samples, so if
one or two samples have 70% probability as a best estimate it makes sense to
adjust the rule a little looser so that the rule picks up those few samples
that do not quite fit, rather than create a borderline category (discussion
below). Conversely, it makes sense to
be a bit stricter for type assignments if a borderline category is available.
67 markers. We are marginally more liberal with assignments using the full 67
markers and marginally stricter for samples with fewer, because those with
fewer can get more accurate assignments by procuring the remaining markers.
Men with closely matching STR should
be classified together, particularly if the family name is the same.
We avoid changing assignment rules
too often, so some assignment rules may remain in place for a while even after
new data has provided slightly better rules.
For a valid type SBP comes down as
data accumulates, with better statistics.
I avoid introducing a new small type with SBP above 25%, because I
expect it to improve with time.
Technically, SBP = 40% means 60% of the samples can be introduced as a
new type category, but I prefer to wait a few months for more data, so that a
new type is substantial at introduction.
We do not wish to be dismissed by
others with experience evaluating STR data.
On the other hand, we do not wish to have others point out that samples
are being left without obvious assignment.
I suppose the goal should be that the number of people complaining that
assignments are too liberal turns out to be about equal to the number of people
complaining that assignments are too conservative (people with experience
evaluating STR data who have read and understood my documentation here).
A person who assigns samples to hypothetical haplogroups based on STR
values acts like a bookie who provides advance estimates for gambling odds,
using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by
reality very quickly. Probabilities of
an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based
predictions. In the past, a number of
STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you
can read as much as you wish about our (Mayka’s and my) methods, judging for yourself the reliability of
our probability estimates.
I use Borderline
categories when a significant number of samples have 50% to 79% probability for
one haplogroup or type, and have less than 20% probability for any other
haplogroup or type that I know of.
These fit too well to go into a Remainder
category. The 80% rule does not apply
to samples in a borderline category, although it would be correct to say those
samples have minimum 80% probability of belonging in that borderline category
as opposed to another type or another borderline category. Borderline samples still have their FTDNA assignment which is either 100%
(green) or 99% (red).
The Polish Project occasionally but
rarely uses a cluster category, which is similar to a
borderline category, except the category itself has less than 80% probability
of corresponding to a valid clade, so the 80% rule does not apply.
Confidence and
Probability; More Statistical Comments
{This
entire topic needs rewrite. This is an
old version. I moved the probability
discussion to a new topic, above. Much
of this topic is OK as is for explanation of “confidence”, but most is
redundant. Watch this space for a
rewrite.}
See the Assignments
topic for a general discussion of Polish Project assignments. This topic explains how I figure percent
confidence for assignments of individual samples (men), for the R1a section at the Polish
Project. My publication
explains my statistical methods. There
is a summary of my mountain method above.
Confidence interval example: Confidence is not the same as probability. By 80% confidence I mean 80% is the lower
number of the 80% confidence interval.
For example, 80% confidence might mean that the actual probability is
90% but the 80% confidence interval is 80% to 96%. As an example, consider a situation where 10 samples match a type with an STR test. Suppose there is a definitive SNP
test available, and 9 of those 10 samples test positive for the SNP, and 1
negative. That means 9 of the 10 really
belong to the haplogroup and that 1 mismatch must
come from a different haplogroup that matched the STRs by the luck of
mutations. Next, consider a new sample
that matches that same STR test. What is
the confidence that the new sample will pass the SNP test for the
haplogroup? The probability is 90%
because we know that 9 out of 10 previous samples like this matched the
SNP. However, 1 out of 10 is a very
small sample. As explained in my publication, I use Poisson statistics for quick
calculation of confidence interval.
Poisson statistics is simple to calculate in Excel. My tool Type.xls has an
“SBP” sheet with a set of cells for quick Poisson calculations.
80% confidence interval of 1 is 0.11
to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80%
confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%; that lower number 61.1% means the 80% confidence
ranges to lower than 80%, so net confidence is lower than 80%.
70% confidence interval of 1 is 0.16
to 3.37, which is 16% to 33.7%, lower number 66.3%; net confidence lower than 70%.
60% confidence interval of 1 is 0.22
to 2.99, lower number 70.1%; confidence higher than 60%.
67.3% confidence interval of 1 is
0.18 to 3.26, lower number 67.4%. So
that’s my one number: 67% confidence.
In other words, if 9 out of 10
samples that match an STR also match the SNP test, we have 67% confidence a
particular future sample matching the STR test will also match the SNP test.
For 18 out of 20, the probability is
still 90%, but a similar calculation shows 75% confidence.
For 36 out of 40, the probability is
still 90%, but a similar calculation shows 80% to 96% confidence interval, net
80% confidence, which is my example that I started with above. These calculations actually take less than a
minute using my Excel cells.
Statistical Background Percent: SBP. I use SBP as a net confidence estimate for
the background (samples that match the STR values but
really do not belong to the clade of a type). My publication does not go into the details
of confidence intervals. That is the
purpose of the explanation here in this topic.
SBP is my estimate for the net statistical confidence before any SNP has
been discovered to validate a hypothetical type. 100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the
corresponding hypothetical clade.
A mountain cluster corresponding to a
type might include outliers from other clades, or might
include foreign clades. These and other caveats associated with STR prediction are
discussed in detail in my publication, where I
point out that the confidence for all such caveats cannot be calculated. I estimate the background by using the low
frequency of samples in the gap as representative of the
background throughout the haplospace
neighborhood. My SBP formula (available
in the tools) includes an increase in SBP to account for
all such caveats.
Part I of my publication
explains: “Much of the background is
probably at the last step of the mountain, just before
the cutoff. Much of the remainder is
probably at the previous step, much of the remainder after that at the previous
step, etc.” My Part I Table 2 justifies
this by demonstrating how the number of possible haplotypes increases very
rapidly with step. In other words, SBP
is a good worst case overall estimate of background percent within a type, but
background percent is very low at step zero and increases rapidly with step. My publication does not provide a formula
for background vs step and in fact I have not derived an formula. For assignment of samples, I estimate the
confidence vs step in a manner to provide a rapid decrease in confidence near
the last step, in a manner to produce overall confidence roughly equal to 100%
minus SBP. Step zero is my rough
estimate that the type is a valid clade, since the step zero samples belong to
the clade with very high probability if the type is valid.
Some outliers from the type
statistically fall within or even beyond the gap, so confidence is not zero at
the cutoff.
Confidence also depends upon the size
of the gap. A wide gap with zero
samples means even samples in the gap near the mountain have reasonable
confidence percent.
Estimates vs Calculations vs Adjustments: A person who assigns samples to hypothetical clades based on
STR values acts like a bookie who provides advance estimates for gambling odds,
using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by
reality very quickly. Probabilities of
an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based
predictions. In the past, a number of
STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you
can read as much as you wish about my methods, judging for yourself the
reliability of my estimates and net probabilities.
The first confidence interval example
above, confidence of STR predictions calibrated to SNP data, can be pure
statistical calculation without any estimates.
However, judgment is involved.
Even such SNP predictions should be split into parts based on the step
value of the samples within a type.
However, if split down to individual steps, the statistics are very poor
due to small sample size, so steps are best combined in batches. For the first data from a new SNP it is
necessary to combine all the steps, so the predictions benefit from an
estimated confidence by step. So the
judgments and calculations can get quite complicated, and often I just estimate
the confidence from experience rather than do the calculations every day as
data comes in.
I avoid changing assignment rules
often, so some assignment rules remain in place even after new data has provided
better rules.
My standard is 80% confidence, but I
avoid introducing a new type until the confidence is a bit higher, because a
new 80% confidence type would provide only a few samples at step zero on the
day when enough data has accumulated.
After waiting for more data, I tend to bend the guidelines a bit below
80% confidence in order to introduce more samples with a new type. Also, if I notice an individual coming out
at 75% when I’m updating rules I’ll tweak the rule to include him.
I tend to be generous in estimates
for samples with all 67 markers, and I tend to be conservative with samples
having fewer than 67. I update the
rules more often at 67. After all,
samples with fewer than 67 markers can get much better confidence by ordering
more markers, and 67 is the most available as a standard commercial test.
I do not look forward to a man
feeling slighted when he is not assigned to a type that is a reasonable fit to
his STR data. On the other hand, I do
not wish to be dismissed by others with experience evaluating STR data, so I
try to be conservative in my probability estimates that particular clades in
fact exist. I will have achieved my
goal if the number of people complaining that I assign too liberally turn out
to be somewhat greater than the number of people complaining that I am too
conservative (people who have read and understood my documentation).
Naturally, my confidence changes from
month to month as more M458 and STR data accumulates, for better statistics.
Assignments at fewer than 67 markers: There are two ways: Some
types have low SBP and seem 80% valid using 37 or only 25 markers, at least for
samples at low step, so samples can be directly assigned.
Second way: I check for correlation using the samples with 67 markers to see
which percent of samples at given genetic distance using fewer markers end up
in the corresponding type at 67 markers.
The confidence of a sample at fewer markers is that confidence multiplied
by the corresponding confidence at 67 markers.
This topic updated 22 Jun 2010.
I look
forward to the discovery of SNPs validating more than 80%, probably more than
90%, of my R1a Polish Project type assignments.
I introduced P, N, and K types in the Fall of 2007, publishing this web page 6 Dec of
that year. I did not predict that P and
N were brother clades, in fact it looked to me like P was
closer to K. I did not make predictions
about the P, N, K structure because the statistics did not justify such
predictions. I assigned samples to P
and N with 80% probability, remarking that my
overall confidence that P and N were valid
(confidence at step zero) was 95% in 2008. I stated my overall confidence in the
subtypes of K type as only 80%, but again my confidence in K type at step zero
was (and still is) 95%.
P type has been validated as R1a1a7b,
defined by the SNP L260.
N type is very
close to the same as R1a1a7*, the paragroup
defined by the SNP M458 minus L260. This is not exactly a validation, because
there are a low percent of M458 (2 samples so far
at 67 markers) that seem to be older than N type, which implies that a future
SNP, younger than M458, may be discovered as equivalent to N type. In previous versions of this document, I
explained: “A new SNP marker may not
fall at the node defining a type.” A
new SNP might be younger, including mostly the samples with low step from the
corresponding type. A new SNP might be
older, including the corresponding type plus some samples with step beyond the cutoff for the type.
K type is not
validated yet.
In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for
those samples that do not fit P, N, or K.
K type plus the R category are equivalent to R1a1a* (M17). The R1a table assigns new types to either K
or R. In the detailed discussion of the
types I discuss which I have: (a) high confidence are subtypes of K; (b) high
confidence are not subtypes of K so surely go into R; and (c) lower confidence
of assignment to K or R so assigned with a best guess. A new SNP for K type might include a few of
these subtypes, and may include some of R, depending upon the age of such a new
SNP.
My publications
have several references of general interest and relevance to my web documents.
My Tools and
data for STR analysis are Excel files.
These are available at the JoGG publication site as Supplementary
Data: www.jogg.info/52/files/cpcindex.htm.
Polish
Clades Update. This folder is for
update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski
(2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my
publications. I specifically mention it
here because this is where I originally found the common Polish haplotype that I now call P type. Link to English abstract: Pawlowski
2002.
Lawrence Mayka
is the Administrator of the Polish Project. Larry helped me to get started when I was
new to genetic genealogy, providing helpful criticism & suggestions. He reviewed & approved my 80%
probability rule for assignments on the Polish Project web page. He also reviewed the original drafts of my
publications. A number of my types were originally suggested to me as STR clusters by
Larry. Larry continues to provide data
for this web page. Many of my
references to other websites in this document were suggested to me by Larry.
Cyndi Rutledge
is the administrator of the R1a Project. Larry and Cyndi send me M458 test results,
which are not listed on the web.
Anatole Klyosov
published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of
publications. Some of the STR types
that I independently discovered I later found as 25 marker modal haplotypes in
Klyosov’s web documents (before his
publication in JoGG - some in Russian).
It was encouraging to me seeing independent identification of clusters
by different methods. He emailed to me
an English version of one of his 2008 publications. His Fall JoGG articles have references to his other
publications. Here is a web link: Klyosov Home.
Russian
web sites: http://www.rodstvo.ru; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a,
brought to my attention by others, particularly by Mayka,
who worked out a correlation with my types. These sites clearly have proposed
subdivisions of R1a based on STR data, but I cannot quickly understand these
due to the language barrier. Klyosov seems to be active at these sites. The sites make use of the FTDNA
projects and Ysearch.
Kenneth Nordtvedt
published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions,
suggesting many STR based clusters.
FTDNA
link: www.familytreedna.com. This is a commercial DNA testing
company. I make extensive use of the
project databases maintained by FTDNA.
These are my primary sources of data.
Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted
for /polish/ in the following URL.
Polish
Project link: www.familytreedna.com/public/polish. One of many FTDNA projects. This is my primary source for Polish
data. The Polish Project tracks both
Y-DNA and mtDNA; click on “Y-DNA Results”
on the left to see the data that I use.
R1a Project
link: www.familytreedna.com/public/R1aY-Haplogroup. Another source.
Ysearch
link: www.ysearch.org. Ysearch is the largest web database for Y-DNA,
run by FTDNA, open to all men, including men who also register with projects
and including men with data from other testing services. I use Ysearch often for analysis so of
course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your
data with Ysearch. Or you can type your Y-STR data into Ysearch. I am not associated with the company
FTDNA. I have Instructions for comparing your STR data to
my types (modal haplotypes)
that I have entered into Ysearch.
Yhrd
link: www.yhrd.org. A forensic Y-DNA data base. Data is separate by city, with many Polish
cities. I relied on Yhrd to figure out
the geography of the various haplotypes.
I wrote a Yhrd
Reminders for myself so that I won’t forget how to navigate the Yhrd web
site; click on that link if you need
some hints.
Sorenson
link: http://www.smgf.org/. Another DNA testing company.
ISOGG link: http://isogg.org/tree/ Y-DNA tree with the most recent SNPs and
corresponding alphanumeric codes.
Peter Gwozdz
I’m a very rare type in Poland -
E1b1b1a2. My maternal 1st cousins are R1a1a. That means my late maternal grandfather was
R1a1a. I became interested in Y-DNA in
2004. My maternal family name is
Iwanowicz. I discovered a family with
that name in my maternal grandfather’s home town in Poland. They are the only Iwanowicz family within 50
miles, so I was suspicious they might be my 3rd or 4th cousins. I brought a cheek swab kit when I visited
them the second time in 2006. Sure
enough, the male son is a perfect 25 STR marker match to my 1st cousin. I didn’t get around to checking the web for
a year. I was shocked to discover that
these maternal cousins matched 80 people in the FTDNA data
base, for a perfect 12 out of 12 STR markers.
That’s a hell of a lot of matches in the summer of 2007. Most of these matches are Polish. I did some research and found an article by Pawlowski (reference in my
publication) about this most common Polish haplotype, which I now call P type. That got me
interested in doing more research, leading to this web page for others to see
my results. My experience, however, is
a reminder that statistics can be misleading.
I was confident that my grandfather’s haplotype was P type, based on a
perfect match at the first 12 markers.
I now (June 2010) figure that the probability was really about 93%,
because 13 out of the 14 current Polish Project members who have 67 markers and
who also match P type perfectly at 12 markers are in fact P type as judged by
all 67 markers. My grandfather does not
match P type at 67 markers. My
grandfather is that 14th one. He
matches the small hypothetical clade that I call I type,
which is also concentrated in Poland.
But my confidence on that I type assignment is only 80%, so maybe
statistics is fooling me again. That’s
how an outsider ended up studying P type and R1a1a, and writing web pages and
articles about common Polish Y-DNA
clades.
Revision History
2007
Dec 6 First web posting of this file
2007
Dec Two revisions
2008 8 revisions
2009
33 revisions
2010
Jan - May 12 revisions
2010
Jun 1 finished update of L260 & M458 tests, with discussion
2010
Jun 6 update L260 & discussion
2010
Jun 10 G type documentation; modal
haplotype update; Ysearch modal
comparison update
2010
Jun 12 EFHJ types documentation plus edit comments on other types
2010
Jun 13 update general rewrite is in process;
not finished
2010
Jun 18 more general rewrite
2010
Jun 22 continue rewrite, including update of the R1a Results Table
2010
Jul 5 rewrite “Assignments” topic, new “80% Probability” topic
2010
Jul 6 update SNP analysis - 3 topics
2010
Jul 20 add 3 R1b types
2010
Jul 21 add 3 R1b modal haplotype references & Ysearch codes
2010
Jul 22 drop “R1a” from the title, and a few rewording changes
2010
Jul 26 add an I1 Type, M253P
2010
Jul 31 add an I2b1 Type, M223CE
2010
Aug 1 edit the I haplogroup discussion
2010
Aug 9 update the SNP analysis
2010
Aug 14 update news one M458- at step 7 from P