Polish Y-DNA Clades

14-Aug-10

Peter Gwozdz

pete2g2@comcast.net

 

           If this is your first time here, consider jumping down to the Abstract for a summary of this web document.

           My methods and results have been published.

           The Polish Project has assignments of men to types as a subdivision of haplogroup R1a, which is a category of Y-DNA.  The Results Table has a summary of the R1a assignments.  If you know your assignment you can click on the link in the right column of the Results Table to read more about your assignment category.

           Lawrence Mayka, administrator of the Polish Project, provides data for this web site of mine.  This web document is for explanation, details, and update news.

 

R1b & I1 Haplogroup News

 

           On 20 July I added three R1b Types to the analysis discussion in this web document.

           On 31 July I added two I Types to the analysis discussion in this web document.

 

L260 and M458 News

 

           This news topic was updated 9 Aug 2010.

           SNP results continue to validate P type and N type.

           The SNP called L260 is equivalent to what I have been calling P type.  (Types are defined by STR values.)  This is preliminary news, based on the first 43 results available to me.  Of 30 positive samples (L260+ results), all are predicted P type based on STR values.  Of 13 negative samples (L260- results), none are predicted P type based on STR values.  All L260+ are also M458+, except eight L260+ not tested for M458.  41 of these 43 have all 67 standard STR markers.

           P type and N type samples are coming out M458+.  So far there are no M458- samples that fit P type or N type.  This is based on 96 M458- test results.  84 of the 96 have all 67 STR standard markers, and none are predicted P type or N type - not even borderline.  This is consistent with my previous analysis, indicating that both P type and N type seem to be clades that resulted from severe population bottlenecks followed by relatively recent rapid population expansions.  P type and N type cannot be completely alone in haplospace, so eventually a few M458- are bound to show up that fit P type or N type based on close STR match; that will provide a measure of the background from other clades.

           Two samples M458+ do not match either P type or N type at 67 markers.  Not even borderline.  In addition, the borderline M458+ and L260+ (just beyond the types in STR values) are more than statistically expected.  These might be representatives of small clades that branched off the M458 haplogroup tree before the main population expansions that led to P and N.

           See L260 and M458 Test Results for more discussion about the data available.

           Because P type and N type are so well isolated, these two clades are good examples of the caveats for age estimation.  Traditional models of STR mutation diffusion do not apply to the M458 haplogroup.  More discussion is in my age caveats topic.  See my age of R1a1a7 topic for very rough estimated ages.

           News update 14 Aug:  One sample just came out M458- that fits P Borderline, 7 STR mutations steps from the P type definition.  P type has cutoff 5 and gap 3, so this is the last step of the gap separating P type from the remainder of haplospace.  This is the closest M458- to P type I have seen;  a couple previous M458- results were at step 8.  No doubt a closer M458- will be seen as data accumulates over the next few months.  This result reminds us that the exact border of a type cannot be predicted exactly.  It’s amazing that up to now all samples with P step less than 8 have been coming out M458+ and L260+.  This particular M458- sample is step 9 from N type, one step beyond the N type gap;  3 other M458- had been seen at step 9.

 

Polish Project R1a Assignment News

 

           This topic was updated 5 Jul 2010.

           If you are R1a but not a Polish Project member, the Ysearch instructions topic has my method for matching to my types.  The news in this topic applies to you if you know your assignment.

           If you are P type or N type you would likely come out positive in the SNP test for M458 (M458+).  If you are P type you are likely L260+.  N type is likely L260-.  If you have not already tested you can pay the small fee to test for these SNP tests to confirm that you belong to the corresponding haplogroup.

           If you are assigned to P borderline or to N borderline you would benefit more from the M458 and L260 tests, because that would provide for you a definite assignment within R1a.

           The assignment rules are done with high probability, so if you are unassigned (category U) there is a low probability that you would test positive for M458, with probability that decreases with your step (genetic mutation distance) from P or N.

           If you have less than the standard 67 STR markers it is generally better to purchase the remaining markers.  That way, you are more likely to get an assignment, because the statistics for STRs improves with more markers.  Nevertheless, if you are not many steps from P or N you might consider doing the M458 test.

           There is a slight chance that  you might test positive for L260 or M458 even if you do not match P or N.  The haplogroup corresponding to M458 is old enough that there may be small clades with STR markers very different than P or N.  I have not seen one yet, but there is no way to estimate this probability.  I hesitate to recommend the M458 SNP test for men whose samples are distant from both P and N in STR values.  I admit you can just wait to see if anyone with STR values similar to yours matches an SNP, then test for that SNP.  However, we all benefit when some men test for all the new SNPs within an established haplogroup, because that way we find out the size and rough age of the corresponding new haplogroup branches.  FTDNA offers “deep clade” test packages to test for all possible haplogroup branches, but my understanding is that L260 and M458 are not yet included in the R1a deep clade test.  You need to purchase them separately from the advanced markers menu.  No doubt FTDNA will add them soon to the deep clade package.

 

Fall Issue of JoGG

 

           The Fall issue of the Journal of Genetic Genealogy came out on 21 Nov.  My publication is split into two parts there:

           Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.

           Part II is the application of that method to Common Polish Clades.  That article has a lot more detail than this web page, but that article was last updated in September, so this web page is an update.

           PolishCladesUpdate is my folder for future updates to those two articles.

           This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.

 

R1a Worldwide

 

           Wikipedia  has a nice R1a entry with primary contribution by Andrew Lancaster.

 

R1a1a Confusion

 

           18 June 2010 confusion comment:

           In 2008, new SNPs were discovered to better define R1a.  The prior R1a1 (M17, M198) is now called R1a1a (M17, M198).  FTDNA still assigns samples using the older (early 2008) nomenclature.  There is a draft Y tree at the FTDNA site, so no doubt they will update soon.  ISOGG has even more detail, with the latest SNP discoveries.  As I rewrite topics here, I use the ISOGG names for discussion purposes, but I still mention the FTDNA assignments, which use the old nomenclature.

           The old R1a1a (M56) is very rare, not discussed here.  It is now called R1a1a1 (M56).

           More than 97% of R1a men are R1a1a (M17, M198), but FTDNA (and Ysearch and projects) is conservative about predictions, so many are listed as R1a if they have not tested for recent SNPs.  All others are currently listed as R1a1 (old nomenclature M17, M198).

 

L260 and M458

 

           22 June 2010; this topic is still current:

           L260 is a new SNP, not published yet.  It has been available as an SNP test since early April at FTDNA. 

           L458 is a new SNP, published by Underhill.  It has been available as an SNP test since early November at FTDNA.

           FTDNA has not yet assigned haplogroup names to these, so men who test positive are not reported on-line yet at FTDNA nor at Ysearch, nor at the projects supported by FTDNA, which include the Polish Project.

           Both L260 and M458 are listed at ISOGG, where M458 is called R1a1a7 and L260 is called R1a1a7b.  In the draft Y tree at the FTDNA site M458 is called R1a1a7, but L260 is not yet assigned a name there.

 

R1a Subdivision

 

           22 June 2010 update:

           Almost all of R1a divides into R1a1a* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260).  These correspond to my original predicted division.

           R1a also has several known rare groups:  R1a*, R1a1*, R1a1aN, where N = 1 to 6 and 8.  There is also a very rare R1a1a7a.  That asterisk is used for paragroups;  R1a1a*, means haplogroup R1a1a without any of those 8 known branches.

           The rare R1a groups are not in my R1a Table.  It’s a shame the corresponding STRs are generally not published in SNP announcements.  I don’t know if the rare groups all together add up to 0.1% or 1% of R1a.  Surely they are less than 3%.  My percentage calculations in my R1a Table do not need adjustment because any Ysearch samples that might belong to these rare clades would probably have unusual STR values, not falling into one of my types, but still be counted in the totals. In my R1a Table, rare samples are included in row R.  That row R might have a few percent from these rare groups, but I don’t know exactly how many.

           Underhill mentions 7 samples (men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.

           Lawrence Mayka, the administrator of the Polish Project, had been assuring me by email that all the Polish Project member tests within R1a had been coming out negative for all the rare SNP subgroups.  So if you are a Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the men from Poland.  About half of these - about 1/4 of men from Poland - are R1a1a7.  These two “about” estimates are approximate;  my data on these SNPs are not random samples, so my population estimates are derived from the types in my table, which are STR based.

           On 17 June Mayka informed me of the first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish Project.  My table, does not show this single exception because the table is for samples with 67 markers, which that one exception does not have.  On 19 June Mayka informed me of evidence that C type might define a new rare subdivision of R1a slightly older than R1a1a;  if this turns out correct it will be less than 1% of R1a.

 

Underhill

 

           An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.

                       Abstract                     STR Data                   See www.gwozdz.org/R1a.html for more discussion

           I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.

           This web page about Polish Clades was completely rewritten using this new information.  Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.

           Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).  See R1a Subdivision for a brief summary of other groups, and for a clarification of what R1a1a* means.

           R1a1a7 is the new M458 haplogroup.  R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.

           R1a1a* is a new paragroup.  This is M458 negative.  It includes all my other types, particularly K type.

           This Underhill article has data for 158 “Poland” samples (Table 2):

                       R1a1a*:           71 samples      44.9%

                       R1a1a7:           87 samples      55.1%

           The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.

           Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).

           M458 Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.

 

Abstract

 

           Abstract rewrite 22 Jul 2010.

           This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to read the Introduction first.

           This web document has three purposes:  1. More detailed explanations for the men (samples) that I assign to types in the Polish Project.  2. Summary of my published results.  3. Update with recent results.

           The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters that are concentrated in Poland.

           Since I originally posted this in December 2007, emphasis has been haplogroup R1a, because about half of Polish men are R1a, with no subdivision at that time.  A new division, roughly 50-50, between R1a1a* and R1a1a7, became available in November 2009.  In July 2010 I added some R1b analysis.  I will soon expand this page to include clades from other haplogroups that seem to be concentrated in Poland.

           I use the word type to mean an STR cluster with statistical validity as established by my Mountain Method.  I expect my types to be validated some day by discovery of new SNPs that will qualify them as haplogroups.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  All types have associated clusters but not all clusters qualify as types.  In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references.  Usually when I discover a type I later find out someone else had mentioned it earlier on the web;  let me know if you the reader have more clues and references for me.

           Most of types that I discuss seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA).  A few of my types might be younger or older than that range.

           I use phrases like “seem to be” over and over because the methods are statistical.

           Click here for a summary of the conservative automatic haplogroup assignments in the Polish Project, for an explanation of the extended assignments, for a discussion of the minimum 80% probability for assignment, and for the R1a Results Table.

           The Polish Project is considered representative of Historical Poland, with caveats explained in my Publication.

 

R1a Abstract

 

           Abstract rewrite 4 Jul 2010.

           About half the men of Polish male line ancestry belong to the R1a haplogroup.  These are all R1a1a with about 99% probability.  This R1a Abstract is a summary discussion of the R1a Results Table.

           U category.  Unassigned.  This is the largest category.  On the Polish Project Y-DNA Results page, detailed assignments are made with minimum 80% probability.  Because of the restriction to 80% probability, many R1a men in the Polish Project are not assigned to detailed categories at the Polish Project web page.  Those men go into this “Unassigned” category.  These still have either R1a or R1a1 automatically assigned by FTDNA.  If you are in this U category, you can promote yourself out by purchasing the full 67 marker STR set, since all R1a samples with 67 markers get a detailed assignment.

           I consider the R1a Polish data as 4 major categories based on STR data.  About half the men of Polish male line ancestry belong to the R1a haplogroup, and that group divides roughly equally into these 4 categories.  Since 2007, I have been calling them P type, N type, K type, and R category.  P and N are in the new R1a1a7.  K and R are R1a1a*.

           R, Remainder, is not a type.  I use R for samples that do not belong to any of the types I have identified in R1a1a* so far.

           My overall confidence in K type is only 85% because there seem to be unidentified types with STR values close to K.  The modal haplotype for K is essentially the same as the modal haplotype for all of R1a.  However, I have identified subtypes of K that have much higher confidence.  In other words I have higher confidence for many individual samples.  I have high confidence in the subtypes although I am not sure all the subtypes assigned to K belong to exactly the same clade along with all the other samples that I have assigned to K outside the subtypes.  Even if K is not a true clade as defined, however, it clear that the K samples belong to branches in the R1a1a* tree with nodes very close to each other.  The only uncertainty is that there are likely many other samples that belong in other branches just as close to K.

           Borderline categories are not types but are samples that match types with less than 80% probability.  Each Borderline category has discussion below.

           P type is concentrated in Poland, rare with increasing distance from Poland.  N type seems to be mostly Slavic, widespread in eastern Europe.  K type corresponds to one of the two largest R1a1 clusters.  Another large R1a1a cluster, the one I call L type, is not common in Poland.

           In the table I assign each R1a1a* (M458-) subtype into either K or R based on how distant the STR values are from K.  Some of these are borderline however.  There is no clean separation of K from R, so the table should not be considered a high probability separation of K subtypes from the R remainder subtypes.  Read the individual type discussions to see which subtypes fit K with high probability;  A type is an example.

           Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.

           You can compare data to my types by clicking this link to instructions for Ysearch.

           Reminder:  I am concentrating on Poland.  The statistics of STR clusters depend a lot on the data base.  For example, P type stands out dramatically in Polish data.  In other countries P type is rare.  If you belong to an R1a1 cluster that is rare in Poland, I’m sorry, but I’m not covering you.  K type is an example of a type that is common both in Poland and elsewhere.  M type is common in northwest Europe but so far absent in the Polish Project.

 

Introduction

 

           This Introduction is for people unfamiliar with the jargon of genetic genealogy.

           There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic.  Back issues of JoGG are good general references.  The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.

           The following several paragraphs are a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages.  The definition words are boldface.  I often use links to those definitions when I use a jargon word for the first time in a topic.  There are more boldface definitions in the summary of my Methods.

           The Y chromosome gets passed from father to son, so it works just like a male family name.  Men are divided into haplogroups based on known rare mutations (called SNP) in the Y chromosome.  Division into haplogroups is done in a manner that has virtually 100% confidence.  I say “virtually” because your confidence in your DNA result from your DNA testing company might be 98% or 99% or 99.9%;  the confidence for haplogroups is better than that.  We can be virtually certain that all the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup.   Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the haplogroup.  Lots of people are working hard to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.  I’m doing some work on this, but I’m not discussing it in this web document.

           Haplogroups have alphanumeric codes, like R1a1a.  A paragroup is a haplogroup considered without its known haplogroup branches.  When a new branch is discovered within a paragroup, it gets removed from the definition;  that changes the meaning of that paragroup.  An asterisk is usually used in paragroup codes, like R1a1a*.

           Other people, like me in this document, try to “stay ahead” of the haplogroups by analyzing other mutations that are not so rare (called STR) on the Y chromosome.  Men submit their Y-DNA data to various web sites.  There are lots of STR data available on the web.  Men are divided into STR clusters as hypothetical subdivisions of the haplogroups.  All such clusters are hypothetical.  Some will be validated in the future by new SNP discoveries.  There are various statistical methods for estimating the confidence of STR clusters.  I recently published a method that I developed.  That publication has references to other methods.  There is a brief summary of my method below.

           A few STR clusters are small family clusters, with the same family name.  Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line cousins identified by genealogy research, due to secret adoptions, illegitimacies, etc.  This is one of the reasons some people prefer to avoid genetic genealogy.  The male line associated with the Y-chromosome is only one ancestral line.  Humans have 24 chromosomes.  Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then;  the one man at the tip of the male line root is only one of those thousand.  That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many.  That said, many people enjoy the challenging hobby of figuring out to which ancient extended male line they belong.

           Most STR based clusters have an MRCA who lived thousands of years ago, before family names were common, so most men assigned to a typical cluster do not have the same family name.

           Many SNP based haplogroups have an MRCA who lived more than ten thousand years ago, so these span multiple ethnic groups and nationalities.  For example, the R1a haplogroup is of interest to me.  R1a is most common in Slavic countries but calling R1a Slavic is misleading because it is found throughout Europe and west Asia.  The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it.  It is possible that he did not even live in what is now the Slavic region of Europe;  maybe his descendants moved there in a massive migration from the Asian steppes, or from India.  No one knows for sure.  Even if he was proto-Slavic in language and culture, by now some of his descendants long ago moved to other parts of Europe and Asia.  One of the appeals of genetic genealogy is trying to figure out ethnic descent and migration from the statistics of haplogroups.  Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of many.  True enough.  Some individuals and some web sites go too far with genetic claims.  That said, statistical analysis of haplogroup data provides many clues on human origins.

           Again, some people try to stay ahead of haplogroups, using statistical analysis of STR based clusters to gain insight into more recent human origins.  I am one of those people.  My interest is Polish origins.  This web document, however, is not for the historical analysis and conclusions, except for occasional comments to remind us of the goal.  This document is dedicated to STR data and analysis, identifying clusters concentrated in Poland, with detailed explanations.

           The bottom of my Method section has more definitions for a number of genetic genealogy terms.

           There are a number of organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy analysis, for example FTDNA.  I am not associated with the company FTDNA;  I mention them because I make extensive use of their data;  check Google for competitors.  At FTDNA, click on Products for cheek swab kits.  DNA results are confidential unless you register the data at a database;  at FTDNA, click on Projects to register your data into one of the many databases;  for example, most of my analysis is from the data in the FTDNA Polish Project.

           I use the FTDNA standard set of 67 STR markers (plus a few non-standard ones occasionally).  I do some analysis using the standard FTDNA 12, 25, or 37 STR marker sets.  Other companies use standard marker sets that may not overlap with all the FTDNA markers.

           Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.

 

Format

 

           Up to here, I have tried to write this web page as news and summary, with links to more discussion below.  I hope anyone having minimal familiarity with genetic genealogy jargon has understood.  If you read this top to bottom, it gets progressively more detailed, with more and more jargon.  I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions.  If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

           If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.

 

L260 and M458 Test Results

 

           This topic was updated 9 Aug 2010.

           This topic provides discussion.  Read L260 and M458 News first, for a summary.  For more numbers see the following topic.

           L260 and M458 are the names of two SNP tests.

           Lawrence Mayka is the administrator of the Polish Project.  SNP results are not posted on the web.  Most of my SNP data comes from Mayka.  Some of my data comes from Cyndi Rutledge, the administrator of the R1a Project.  Many men join both projects, but of course many men purchase the L260 or M458 test and do not join either.  If you are an administrator of an FTDNA project (or a project at another database) you may send me the L260 and M458 results for your project for merging into my analysis, if you wish.  Karen Melis, the administrator of the Zamagurie Project, also sent me a few M458 results.

           Data with the 67 standard markers is most common because Mayka and I selected these for the initial tests.  In addition, men who have purchased less than the standard 67 markers are less likely to purchase SNP tests.

           Mayka and I purchased many L260 and M458 tests for Polish Project members, so test results available to me are biased toward Polish data.  Also, I suppose men who notice my publication and web pages about Polish types are more likely to purchase the L260 and M458 tests, so even data not available to me might be biased toward Polish data.  At first we were concentrating on samples that match P type and N type very well, so much of the data available to me are biased toward P type and N type, of course.  Later we concentrated on borderline samples that just barely match P type and N type, in order to better define the borders in STR haplospace.  If there are clades from outside M458 (outside haplogroup R1a1a7) that just happen to have STR values that match P type or N type we will discover them quickly, but not if they are concentrated far from Poland, and particularly not if they are concentrated in any Eurasian lands where men do not tend to get DNA tests.  If there are M458 clades with STR values very different than P type or N type it will take some time to discover them all, because those will require “deep clade” tests by men without an M458 prediction to do the M458 test anyway. So far I have results for only 2 such samples distant from both P and N, and they are very different from each other in STR values;  they might be statistical flukes do to the random nature of STR mutations, or they may be valid singleton representatives of 2 small older clades that branched off the M458 tree trunk long ago.  These two are discussed further below.

           The SNP results do not provide estimates of population frequency because we are selecting the most interesting samples for SNP tests.  However, since the SNP tests verify my type classification, my STR types provide credible frequency estimates.  My Results Table is still the best estimate of frequencies in Poland:  P type for M458+ L260+;  N type for M458+ L260-.

           P Borderline comments at 67 markers:  The data is listed in the following topic.  Two of the L260+ have P=7 (step 7 from the P type modal haplotype), which is 3rd step beyond the P type cluster - cutoff definition P<5.  P type has a gap of 3 steps (P = 5 to 7) where sample frequency is low and prediction is uncertain.  I would have predicted these two “P Borderline” a couple years ago, because the N and K steps are high.  The Polish Project has only 7 samples in the P gap, including these two, 6 have been tested L260+.  The 7th, at the cutoff P=5, not tested for SNPs, is classified P Borderline for now.  I also have data for another gap sample at P=6 not in the Polish Project, tested L260+.  The gap is the most likely place to find P Borderline predicted samples that are L260-, but I have seen none so far.  With only 8 samples, the probability is uncertain.  P=8 is common - that is beyond the borderline.  Most P=8 are predicted N type and some are predicted other types with M458-;  no L260+ outliers yet at P=8.  P=4 is common, 7 samples in the Polish Project, 3 have been tested for L260, all L260+.  Although this is little data at P=4, the last step of the P type mountain, confidence in P type prediction is boosted by the absence of background in the gap.  Conclusion:  P type samples with cutoff 5 are haplogroup M458+L260+ with high probability.  Gap samples with P = 5 to 7 are probably P type with uncertain probability.

           N Borderline comments at 67 markers:  The analysis is similar to P type.  The cutoff is 7 and the gap is 2 (N = 7 to 8):

N Step

Number in Polish

Predicted N Type

Number tested

Others Tested (not in Polish)

5

12

12

6 M458+

1 N type M458+

6

7

7

2 M458+

2 N type M458+

7

3

3

3 M458 +

 

8

3

1

1 N type M458+

1 N type M458+

9

15

3

2 N type M458+

 

           At N= 5 or 6, samples without the M458 test are classified N Borderline at the Polish Project web page because the statistics is just now approaching 80% probability for N type assignment near the cutoff without the SNP test.

At N=7 cutoff, 3 samples:  As with P type, the frequency at the cutoff is low.  All 3 would be predicted N borderline and all 3 came out M458+.

           At N=8, 2 of the 3 N type would have been predicted N Borderline a couple years ago and one now is N type because of an M458+ test result.  The other is now G type, which type did not have enough data for prediction a couple years ago.  That 3rd one is now classified D type, a subtype of K.  One additional sample from the R1a project has N=8, also tested M458+.

           N=9 is not a good indicator for N type because many of these are P type or M458- types.  Only those that do not match any of the other types can be predicted N Borderline, with low confidence.  The Polish Project has 2 of these;  both are M458+.

           N=10 does not have any of these.  There are two outliers, one each at N=11 and N=12, that would not be predicted N borderline but came out M458+.  Those latter two might be outliers due to the luck of random mutations.  I prefer the interpretation that N type is an isolated mountain in haplospace due to a vigorous population expansion in recent history following a bottleneck.  On this basis, I speculate that those two at N = 11,12 represent small clades that existed before the bottleneck and survived, with nodes on the M458 tree older than the bottleneck.  I classify these two as “M458+ Remainder” to distinguish them from the main N type clade.  What I really mean is a “paratype” of samples that are M458+ but neither N type nor P type.  (The Results Table needs to be updated for this new category.)  There is no way to figure the probability that my interpretation is correct, but my classification tells you that I’m guessing 80% probability.  My publication has discussion about how a type without structure should not have a tail to the distribution of frequency vs step, so I am confident that the high N value samples represent structure in the tree of the type.  Those other two at N=9 might also be such structure, with a node older than a prominent bottleneck, but I’m leaving them in N for now (N Borderline without M458+ result) because the 80% probability rule does not apply to borderline categories.

           Summary:  so far all the M458- samples are falling well beyond the N type cutoff.  M458+ samples have a distribution tail of outliers beyond N.  Conclusion:  N type samples with cutoff 7 are paragroup M458+L260- with high probability.  Samples beyond N type belong to this paragroup with low probability.  Both P type and N type are coming out very isolated - no foreign samples have shown up yet close to the two mountains in haplospace.

 

L260 and M458 Test Results;  Details

 

           This topic was updated 9 Aug 2010.

42 L260 Polish Project

           29 L260+

           13 L260-

1 L260 R1a Project, counting only those not in the Polish Project

             1 L260+

43 L260 Total

           30 L260+

           13 L260-

138 M458 Polish Project

           70 M458+

           68 M458-

38 M458 R1a Project, counting only those not in the Polish Project

           10 M458+

           28 M458-

7 M458 Zamagurie Project, not in lists above 

           7 M458+

183 M458 Total (sum from above)

           87 M458+

           96 M458-

226 Total SNP test results (sum from above)

191 unique samples (191 different men)

           161 have all 67 standard STR markers

           27 have the 37 standard set

           1 has the 25 standard set

           2 have the 12 standard set

76 M458+ results at 67 markers

                       (includes 8 L260+ not tested for M458 assumed M458+):

           35 P<5 solid P type (P<5 means <5 mutation steps from P type definition)

                                  (includes 7 L260+ not M458 tested)

                       2 N=9  all solid P type well isolated from N

                       5 N=10

                       28 N>10

           3 P=5 at cutoff just outside the P mountain;  all 3 are L260+

                       these would be classified P Borderline without that L260 test

                       N=11,11,13 well isolated (one L260+ not M458 tested)

           2 P=6, cutoff +1, N=14,16

                       P still well isolated from N, but not statistically significant yet

           2 P=7, cutoff +2, N=13,14

           0 P type with P>7;  the remainder at all M458+ and none are L260+

           25 N<7 solid N type

                       2 P=8 not considered P Borderline because N=4,6

                       3 P=9 not P Borderline; N=2,4,4

                       20 P>9

           3 N=7 at cutoff just outside the N mountain

                       1 P=8 not considered P Borderline because P is 3 steps beyond P cutoff

                       2 P=11,12

           2 N=8, cutoff +1, P=8,9

           2 N=9, cutoff +2, P=11,15

           0 N=10

           1 N=11, P=9 very far from both P and N borders (cutoff +4)

           1 N=12, P=12 very far from both P and N borders

                       this one has DYS572 null, value 8 used to provide step 3 at this marker

           Those last 2 samples are mentioned in the L260 and M458 News and in the previous topic.

                       These are possible representatives of small old M458 clades

                       That N=12 sample is L260- but that N=11 sample has not been tested for L260, probably negative

                       We are calling these two “M458+ Remainder”

76 M458+ results at 67 markers again, this time sorted by K (cutoff 4)

           1 K=4, (P,N)=(2,10) P type L260+

           8 K=5, all P type, P = 1 to 5, one P=5 is L260+; N = 10 to 13

           12 K=6;  all P type

           55 K = 7 to 13

85 M458- results at 67 markers;  none match closely to P or N type in STR values

           0 at the cutoffs (P=5, N=7)

           0 at cutoff +1

           2 at cutoff +2 - Borderline

                       3 N=9, P=11,12 - these are G and D type

           4 at cutoff +3 - Borderline, but pretty far out

                       2 P=8, N=13,15 - these are I and K Borderline

                       2 N=10, P=13 - these are D and D Borderline

           79 N>10, P>8 clearly STR prediction not M458 - verified

37 marker comments (27 samples):

           18 M458+ categorized P, N, N Borderline

                       There is one sample L260+ with only 37 markers, at P=1, consistent with a P type prediction.

                       All but one are consistent with the 67 marker results - P type and N type - although some are N Borderline, where it is difficult to predict at 37 markers.  That one exception has P,N = 10,10, but DYS388=8 (modal 12), producing 4 of those 10 steps at one marker, so this one may just be a statistical outlier.

           9 M458- all have high P and N values none predicted P or N

 

Haplogroups in the Polish Project

 

           The Polish Project on 25 May 2010 had 1306 members, of which 17 are family set duplicates, for 1289 independent members.  This reduction for family sets is explained in my publication.

           Here are the FTDNA haplogroup assignments.  The left side of the table has the totals by major haplogroup.  The right side provides only those haplogroup branches for which there are more than 25 samples;  most samples are in haplogroup branches with fewer than 25 samples, not listed here:

 

Haplogroup

Count

Percent

 

Common

Count

Percent

 

 

 

 

Branch

 

 

-

1

0.1%

 

 

 

 

C

5

0.4%

 

 

 

 

D

1

0.1%

 

 

 

 

E

88

6.8%

 

E1b1b1

61

4.7%

F

4

0.3%

 

 

 

 

G

37

2.9%

 

 

 

 

H

2

0.2%

 

 

 

 

I

209

16.2%

 

I1

63

4.9%

 

 

 

 

I2a

83

6.4%

J

104

8.1%

 

J1

28

2.2%

 

 

 

 

J2

49

3.8%

L

3

0.2%

 

 

 

 

N

95

7.4%

 

N

54

4.2%

 

 

 

 

N1c1

30

2.3%

Q

16

1.2%

 

 

 

 

R

711

55.2%

 

R1a

66

5.1%

 

 

 

 

R1a1

509

39.5%

 

 

 

 

R1b1b2

76

5.9%

T

13

1.0%

 

 

 

 

Total

1289

 100%

 

 

1019

79%

 

           The left side of the table is a good estimate of the haplogroup frequencies in Historical Poland, insofar as the Polish Project is representative of Historical Poland, as discussed in my publication.

           The right side of the table is not representative.  This is my next point for discussion:

           About half of those haplogroup FTDNA assignments in the table above are based on SNP measurements (green text for haplogroup assignments by FTDNA).  About half are haplogroup predictions based on STR values for samples that do not have SNP measurements, using an FTDNA proprietary method (red text).  The FTDNA predictions are very conservative - at least 99% probability - they almost always come out as predicted when an SNP test is done.  Because they are conservative, most FTDNA predictions really belong to a branch subdivision of the trunk haplogroup to which they are assigned.  In addition, many of the green assignments based on SNP measurements are old, so these do not reflect new SNPs that became available since they were last tested.

           In other words, a conservative prediction has a low probability of being wrong, but it also has a high probability of being incomplete.

           For example, those 76 men assigned to R1b1b2 are really mostly R1b1b2a1, and there are SNPs available to further assign them to eleven smaller haplogroup branches of R1b1b2a1, so those 76 men can easily obtain more detailed assignments by purchasing the SNP tests.  Similarly, those E, I, J, and N men have more detailed assignments available through SNP tests.

           R1a is very different.  Almost all of those 66 R1a samples are really R1a1.  They are classified R1a either because (1) their STR values are unusual, so a very conservative prediction cannot be made, or (2) they have an old R1a test, made before R1a1 was available, so they remain listed R1a even though they could be conservatively predicted R1a1.

           In other words, there are really more than 570 R1a1 samples, 44% of the men in the Polish Project, for whom more detailed automatic haplogroup assignment is not available through FTDNA.  That’s one motivation for this web page of mine.

 

Polish Project Assignments

 

           This topic was updated 5 Jul 2010.

           Lawrence Mayka is the administrator of the Polish Project.  Click on the Polish Project web link to see how Larry assigns samples (men) to categories.  The Polish Project has sections for mtDNA and for Y-DNA.  This web document of mine is restricted to Y-DNA, with emphasis on R1a.  I help Larry with assignments to R1a types.

           Haplogroups are defined by SNP mutations.  STR mutations are easier to test, so many samples have STR data without  SNP data.  Predicted assignments are based on STR correlations.

           I mentioned above that FTDNA automatic haplogroup predictions (red text means STR predicted vs green text SNP measured) have about 99% probability.  We use minimum 80% estimated probability for each individual sample in the Polish Project that gets an extended assignment - a subdivision of its FTDNA assignment.  At 80%, many more assignments are possible.  Most extended assignments are better than 80% probability.  Many are better than 95%.

           Many samples do not have extended assignments, but they still have their FTDNA green measured haplogroup (100% probability) or their FTDNA red predicted haplogroup (99% probability).  These bring up the average for the Polish Project as a whole.

           We are confident that the average is better than 95%, which is to say that more than 95% of the Polish Project samples would test positive for the SNP corresponding to their assigned haplogroup.  Excluding R1a the average is likely more than 97%.

           E1b1b2a2 (V13) is an example of a haplogroup category with some extended assignments:  Larry has me in this category, which is 100% correct because I tested positive for the V13 SNP along with 14 other men in the Polish Project (on 25 May 2010).  However, Larry’s listing includes 48 men in this category, based on his analysis of STR correlations:

           15 green E1b1b1a2.  These are of course certain.

           28 red E1b1b1 because FTDNA does not predict beyond that, but these would likely be E1b1b1a2 if tested, because they have STR values close to those samples that have tested V13+, and unlike the samples that have tested positive for other branches of E1b1b1.  Each has at least 80% probability, and many are even more probably correct.

           2 green E1b1b1 tested for that previous SNP but not for the current V13, but matching in STR values.

           3 green E1b1b1a tested for that previous SNP but not for the current V13, but matching in STR values.

           Note that other E1b1b1 men, both green and red, fall into other categories at the Polish Project, because they do not match V13+ samples closely in STR values.

           R1a is unique because almost half the Polish Project samples are placed together by FTDNA into R1a1 (M198), which is elsewhere called R1a1a (M198).  Many of our R1a assignments are to types, which are hypothetical, without known SNP definitions.  The minimum 80% estimated probability still applies to each sample and again most are much better than 80%.  For type definitions we are confident that the average is about 90%, which is to say that about 90% of the Polish Project R1a samples assigned to a type would test positive someday for an SNP, unique to that type, not yet discovered.

           “Cluster” and “Borderline” and “Unassigned” category probabilities are discussed below.

           I have been active helping Larry with R1a assignments to types since late 2007.  During early 2010, I was busy with other things so Larry introduced a few more R1a clusters, which I have now (June 2010) analyzed and documented here as types.

 

R1a Results Table

 

           The Polish Project on 25 May 2010 had 639 independent samples (members) with the full 67 marker STR set.  Many of the samples with fewer than 67 markers are in category U (unassigned).  Polish Project Assignments at 67 Markers are taken as representative of Poland, with caveats explained in my Publication.

           272 of the 639 are R1a.

           Click on the link in the far right column to jump down to more discussion for that type.

           Read the R1a Abstract for a brief summary of this table.

           Those Types and Subtypes are my own code letters, for brevity.  Please do not confuse these code letters with official haplogroups.  I have been using such code letters for R1a assignments in the Polish Project for over 2 years.  The color coding is for ease of comparison on my web pages.

 

This table was updated based on 25 May 2010 assignments, R1a, at 67 markers, 272 samples:

Cluster

Group

Type

Subtype

Subcluster

Samples

Polish %

Ysearch

Link

P

 

 

 

 

58

9.1%

 

 

 

R1a1a7

P

 

 

57

8.9%

8U92G

P

 

R1a1a7

 

 

PB

1

0.2%

 

PB

N

 

 

 

 

56

8.8%

 

 

 

R1a1a7

N

 

 

44

6.9%

3SEJK

N

 

R1a1a7

 

 

NB

12

 1.9%

 

NB

K

R1a1a*

K

 

 

86

13.5%

MN8R3

K

 

R1a1a*

 

K

 

24

3.8%

 

 

 

R1a1a*

 

A

 

12

1.9%

FCUFG

A

 

R1a1a*

 

B

 

6

0.9%

RU8Z8

B

 

R1a1a*

 

E

 

16

2.5%

GNYBG

 

 

R1a1a*

 

F

 

6

0.9%

EFQM7

 

 

R1a1a*

 

H

 

3

0.5%

559EE

 

 

R1a1a*

 

I

 

13

2.0%

EKVHX

I

 

R1a1a*

 

J

 

6

0.9%

743N9

 

R

 

 

 

 

72

11.3%

 

 

 

R1a1a*

 

 

KB

32

5.0%

 

KB

 

R1a1a*

 

C

 

1

0.2%

 

C

 

R1a1a*

 

D

 

11

1.7%

K49NZ

D

 

R1a1a*

 

G

 

14

2.2%

ZD29Z

G

 

 R1a1a*

 

 

R

14

2.2%

 

R

L

R1a1a*

 

 

 

0

0%

 

L

 

R1a1a*

 

M

 

0

0%

24MB4

M

U

 R1a1a*

 

 

 

0

0%

 

U

Totals

R1a1a

 

 

 

272

42.6%

 

 

 

           The Ysearch links provide the full modal haplotypes, using a selected subset of the standard FTDNA set of 67 markers.  I entered this data into Ysearch for our convenience.  All my modal haplotype definitions are available in the Excel file Haplotypes.xls, which also has experimental types not mentioned here.  Below are Ysearch instructions for quickly comparing your haplotype to all my types at once.

           Assignment to types is with at least 80% estimated probability.

           The estimated percentage for P, N, K, and R in the Results Table add up to 42.6%, which is the percent of R1a in the Polish Project at 67 markers.

 

Description of the R1a Types

 

           Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.  See also the haplotype Excel file at PolishCladesUpdate.

           Please don’t get confused.  The following capital letters are my code for R1a types.  Capital letters are also used for the large official haplogroups, but that’s different.

           Some of the following types have my Excel analysis file for my November 2009 publication;  the files are stored in the Supplementary folder.  Many of the following types have my update Excel analysis at PolishCladesUpdate.

 

           A.  Ashkenazi.  This seems to be a subtype of K.  This type is discussed in my publication, Part II.  I have about 90% confidence in that subtype status, but I am more than 98% certain that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web.  It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi.  This type is not restricted to Poland.  Levy-Coffman wrote an article about Ashkenazi genetic genealogy;  I noticed discussion in a recent Science article.

 

           B.  Another subtype of K, recently identified by Mayka.  Concentrated in Poland.  The B data cluster lies at the edge of the K cluster.  The node for B type in the R1a tree might be slightly younger or slightly older than the K definition node.  I estimate the former is about 80% probability - that B is truly a subtype of K;  if not then B probably lies just outside of K (node slightly older).  Individual assignments to B type have 80% to 90% probability.

 

           C.  Added to Polish Project in Dec 2009 by Mayka, who points out that Didier Vernade originally pointed out the unusual DYS392=13 value in 2007.  Very small cluster.  Only 2 Polish Project members, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4 different samples with 67 markers.  None on Ysearch are identified as “Poland”.  Because this type differs very much in STR values from the rest of R1a1, C type might be a very old clade, with a node in the R1a1 tree much older than the other types.  Mayka pointed out to me that DYS392=11 is almost universal in R1a1a, and the 13 value for C type might be the ancestral value for R1a.  One of the C type Polish Project members has SNP data through 23andMe, and Mayka is working to verify if this data defines a new paragroup level within R1a.

 

           D.  Concentrated in Poland.  This type was added to the Polish Project in Jan 2010.  The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on the very rare DYS462=12 value.  DYS462 is not one of the FTDNA standard markers;  it is a standard at Sorenson;  DYS462 is available in data on Ysearch.  I did an analysis using the 67 FTDNA markers;  the SBP came out 18.4%, better than 80% probability just on that basis for the best fit samples.  However, 462 would significantly reduce SBP, so probability of validity of a clade corresponding to D is quite high considering 462.  On 9 Jan, only 5 of the samples that fit D type in the Polish Project had been tested for 462 and all 5 had that rare 12 value.  Since January more men from D type, plus men with STR values just beyond D type, have been testing for DYS462.  At this update, 15 March, there are 8 men confirmed DYS462=12 and 4 men with STR values close to D type that have DYS462=11 (standard value for R1a).  The original modal haplotype used 61 markers;  last week I spent a little more time on analysis and found a 66 marker modal haplotype, with SBP = 18.1%, cutoff 14;  I updated the Ysearch D type.  All 12 of the DYS462=12 men come out in the D type cluster using this D66 modal.  One man at step 6 has not been measured for DYS462, but step 6 is well with D type, so I classified him D type.  Two more men without DYS462 evaluated come out at distance 9 and 12, so these are classified D Borderline pending their DYS462 measurement, which is coming.

           Two of the DYS462=12 men in D type come out at step 13, the last step (cutoff 14).  There are also two men with DYS462=11 at step 13, the only men with 11 in the type.  The other men with 11 are at steps 16 and 17.  I did not use DYS462 in the D type definition;  if I had used it those 2 men with the 11 value at step 13 would come out step 14, so on that basis I removed those 2 men from D type, for 11 total D type at this time.  If all this discussion is difficult to follow, please consult column CL the sheet “Calculator” in the file “DType.xls”, at my update folder, where you can see the data arranged in order of step.

           D type also has the unusual DYS481=21 value;  only 10 samples in the Polish Project R1a have this value, and all 10 come out D type.  One man has the very unusual 20 value (one of those 2 with DYS462 coming soon).  23 is the standard value for DYS481, and all the men with step just beyond D type have 23.  In other words, the STR pair (481,462) = (21,12) seems to be a very high confidence indicator of D type within R1a1a.  I confidently expect that someday an SNP will be discovered corresponding to this STR pair of values, elevating it to a haplogroup.  Unfortunately, Sorenson does not use the 481 marker, so there are only 3 R1a1 samples on Ysearch with the D type signature pair (481,462) = (21,12);  all 3 are Polish Project members now assigned to D type.  (There are 2 others on Ysearch with this very rare signature pair in other haplogroups - coincidence - a reminder that the STR pair (481,462) should not be used without first establishing membership in R1a1a.)

           D type is clearly a Polish type:  In the Polish Project 8 of those 11 D type indicate “Poland” ancestry;  the exceptions are 1 blank, 1 obvious Polish family name with no country, and 1 Slovakia.  On Ysearch, 5 of the 7 best fits (with D step <6 on 7 Jan using the original 61 marker modal) indicate “Poland”, while at steps 6&7 (cutoff 7 for that modal) only 1 of 9 indicates “Poland”.  That is a hint of a non-polish clade close to the edge of D type, which might be the reason the SBP for D type on Ysearch is 22%, not as good as that 18% in the Polish Project.  Or maybe this is a hint of a larger parent clade that is not Polish.  Four of the samples just beyond D type are I type, which is a hint these may be brother types.  D type is very young, about 1,000 years TMRCA (very rough estimate result), and seems to be composed of subtypes Da and Db (not yet statistically significant).  D looks like a subtype of K, but see my K Borderline discussion in this regard.  For more details, see the “Documentation” sheet in my analysis file “DType.xls”, at my update folder.  That Documentation explains my very rough age estimate of 1,000 years.

 

           E.  V. Rudich entered a modal for this cluster into Ysearch as ID mW7DP, named “North Eurasian”.  Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”.  It’s an excellent type;  on 25 May it has 16 samples at 67 markers in the Polish Project, with SBP = 14%.

 

           F.  This type was recently suggested by Mayka.  Only 6 members of the Polish Project belong to F type.  This type has DYS452= 28 vs modal 30, but 452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated.  SBP = 34% without using 452, but when the data is adjusted for consideration of 452 SBP = 27%, which is quite good for such a small type. See H type below, which also has 452=28.  It’s not clear if these two types are closely related, or if the 452 mutation independently happened twice.  The F definition uses 46 markers.

 

           G.  Pomeranian.  This type was suggested to me by Mayka, who calls it the Pomeranian cluster.  Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland.  Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians.  Kashubians consider themselves an ethnic group or nationality within Poland.  It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians.  Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.

           G type is mentioned only briefly in my publication because not much data was available to me at that time.  My GType.xls update analysis file with June 2010 data has excellent results:  There are 12 samples in a nice type with SBP = 11.2%.  There is preliminary evidence of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in Ysearch;  see Haplotypes.xls for a list including hypothetical working modals.

 

           H.  This type was recently suggested by Mayka.  Only 3 members of the Polish Project belong to H type.  This type has DYS452= 28 vs modal 30, but 452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated.  SBP = 20% without using 452, which is quite good for such a small type.  Using 452 does not improve SBP because of neighboring F type samples which also have 452=18. See F type above.  The H definition uses 60 markers.  I consider this type too rare to justify a Ysearch entry.

 

           I.  Concentrated in Poland.  This type is discussed in my publication, Part II.  About 85% confidence of validity.  About 80% net confidence that both A and I are subtypes of K.

 

           J. This type was recently suggested by Mayka.  Only 6 members in the Polish Project, but this type is well isolated at SBP= 13%.

 

           K.  This seems to be a main R1a1a type.  K type is discussed at length in my publication, Part II.  It is larger than others in the Slavic lands.  P and N (below) are just as close in STR values to K as they are to each other, probably because the K modal haplotype is the same as the R1a1 modal haplotype (using the best 34 markers for K).  So far I have discerned a few subtypes of K in my List of R1a types, but I do not have high confidence that they are all exact subtypes of K, as explained in my K Borderline discussion.  I suppose that as data accumulates more subtypes will become clear within K and K Borderline.

           In the Results I use K* to signify those samples that match type K but do not match one of the subtypes.  Although I have high overall confidence in the validity of K type, individual assignments to K* are not as confident.  Because K is located at the modal heart of R1a, I expect some outlier samples from distantly related clades to match K* fairly closely just due to the statistics of random STR mutations.  Because of the possibility of foreign outliers, I consider samples at K step 3 to be K Borderline, even though the cutoff for the K definition is 4.  Even K* samples with step <3 have confidence of only 80 to 90%.  That’s in Poland, where K is fairly well defined with SNP = 26%.  Worldwide K* cannot be discerned with confidence.  The Ysearch SNP for K is 71%, not significant.  That means there are K borderline clades close to the K cutoff that are rare in Poland but causing interference on Ysearch.  This is evident by a glance at the K type results on Ysearch, where “Poland” origin is concentrated at steps <3, and “Poland” becomes progressively less common at higher steps.  A type is a very high confidence subtype of K, so these caveats about K* do not apply to the very high confidence of individual assignments to A type, and similarly to the other subtypes.

           The Kurgans are the ones who domesticated the horse more than 6,000 years ago.  Many scientist think that one pre-Kurgan man is the male line ancestor of all R1a1 men who live today.  The Kurgan hypothesis is controversial, and not necessary for this web page.  You may have noticed that I used the letters of “Kurgan” for my original types and categories during 2008.

 

           L.  This cluster is highly hypothetical.  It is rare in Poland, but second in size to K in European R1a1.  Larry Mayka suggested this cluster to me.  It is a well known Scandinavian cluster.  I quickly checked it briefly, and it seems to be a “type” by my definition.  However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here.  More documentation about L will be available here when I find time to study it.

 

           M.  Needs documentation.

 

           N.  Concentrated in Slavic countries.  This type is discussed in my publication, Part II.  This is a type that according to Yhrd seems to be spread all around the Slavic lands and central Europe, from East Germany to Russia.  N has more mutations than P, so that means it is older.  Within Poland N seems to be slightly smaller than P, but overall N is larger than P.  Previous versions of this page had Na and Nb as speculative subtypes, but I removed those because it seems N type should be properly studied in a database that is not restricted to Poland.  However, I’ll continue to watch the Polish Project, because it will be interesting if more data provides a Polish subtype within N.

           There are web comments about a new R1a1 SNP, to be announced shortly.  My guess is that this new SNP might correspond to the cluster of data associated with what I call N type.

 

           P.  Concentrated in Poland.  This type is discussed at length, in my publication, Part II.  It seems that about 8% of Polish male line ancestry men belong to this type.  According to Pawlowski, this cluster is concentrated in Poland.  I verify Polish types using both Yhrd and Ysearch.  P has fewer mutations than N and K, so it must be younger.  My TMRCA age assessment is 1600 years old, but in light of age caveats P type might be 1 to 3 thousand years old.  Regardless of age, P type seems to have had a population expansion less than 1 thousand years ago.  My publication provides details on the size and age calculations along with evidence regarding the validity of P type.  In my R1a web document, I used P type as an example for a discussion of the caveats associated with TMRCA calculations, and also as an example to explain the possibility of hidden clades, and also as an example for population bias in databases such as Ysearch, so you can find lots more discussion about P type by clicking on those links.  I identified P type and submitted my analysis for publication before the M458 mutation was announced by Underhill.

 

           Pc & Pg.  These subclusters have about 70% confidence, so no assignments yet.  Previous versions of this web page used Pa & Pb & Pe.  The new versions, Pc & Pg, are different, so they got a different subscript letter, although I have modified the same Ysearch IDs.  I have a Pd and other subtypes that are too speculative to mention at this time.

 

           R.  Remainder.  Updated 2 Jul 2010.  This is not a haplogroup or a type.  This is a category for samples that are distant in STR values from all the R1a1a types I have defined so far.  If you are in this category, I highly recommend that you get all 67 markers plus the M458 test.  More markers will help me define a new type for you.  Your M458 test is unlikely to come out positive, but if it does that means you would be the first member of a new type within M458.

           I also recommend that you test for all the several SNPs that FTDNA considers equivalent to R1a1 (called R1a1a by others).  Your unusual STR values make you a candidate for an unusual small clade that has a very old node with the R1a tree.  Each SNP is unlikely to come out negative.  In fact, all such tests most likely will come out positive.  But if one comes out negative that’s excellent, because you will join a very rare group, perhaps even define a new haplogroup.  If you cannot afford all these tests, OK, just hope for people with STR values close to yours to do the tests and watch this web page for your sample to move into a new category.

           R is equivalent to a paragroup.  Just like R1a1a* means only R1a1a samples that are negative for all known SNP subdivisions, my R category extends that to mean only samples that do not match any of my known types.  At 67 markers, R also means that the sample does not qualify for one of my borderline categories.  I have a policy not to use the U category for samples with all 67 markers, so in some cases I need to make a close call on a sample that is on the edge a borderline category - some R samples are right at my cutoff at 67 markers.

           For a sample with 37 or fewer markers, I require 80% probability that the sample would not match one of my types if all 67 markers were obtained.  There used to be quite a few R at 37 markers back when I had only a few types, but there are none right now (July 2010) because there are none that have STR markers so unusual that they are far from all types.

           The 80% rule does not apply to R.  If a sample has 30% probability of belonging to its best fit type it would be assigned to R.  That means it only has 70% probability of being a true R.  R samples still have their FTDNA assignment which is either 100% (green) or 99% (red).

           When I started this hobby a couple years ago, R was the 2nd biggest category after U.  I now have enough types that R is small.

           In June 2010 I subdivided R into two categories.  R (M458-) is those tested negative for M458.  R (needs M458) is those not tested for that SNP.  If an R sample would test positive it would be moved to the NR category.

 

           U.  Unassigned.  Updated 3 Jul 2010.  This is not a cluster, but a holding place for samples with less than 80% probability for assignment.  I use U in the Polish Project for R1a uncertain samples with less than 67 markers.  Samples with all 67 standard markers are not assigned to U, but instead are assigned to the R (remainder) category, or into “Borderline” categories such as N Borderline or K Borderline.  U is 0% in the Results Table, which is samples with 67 markers, but considering all samples U is the largest category in the Polish Project, with 200 members on 25 May 2010 - 15% of the project, 35% of R1a.  If you are classified U you can become promoted to another category by obtaining results for the remainder of the 67 markers.  U is the largest category in the Polish Project

           The 80% rule does not apply to U.  If a sample has 70% probability of belonging to its best fit type it would be assigned to U.  That means it only has 30% probability of being a true U.  Many U have >30% probability of belonging to two or more different types.  U samples still have their FTDNA assignment which is either 100% (green) or 99% (red).

           Probabilities include estimates, so they are not exact.  I tend to be strict for samples with fewer than 67 markers, using U for marginal situations.  At 67 markers is do not use U - I use R, and I’m not strict at 67.  Also, I concentrate my time on improving the assignment rules at 67 markers and have not yet found time for 37 marker rules for some of the newer small types.

 

R1b Haplogroup Types

 

           On 20 July I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).

           Mayka had already added these three to the Polish Project web page during the previous week, based on my recommendation, based on my SBP analysis.

           I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters.  We judge that my analysis justifies adding them to our list of types.  Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10).  These three small types are roughly 1% each.

           I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.

           The STR definitions for these are available at Haplotypes.xls.  PolishCladesUpdate has a link to an Excel analysis file for each of these three types.

           Instructions for Ysearch comparison are below.  Here is the “UserIDs” bar for R1b comparison:

USEID, CX94E, MKM4R, 7HB9C

           Change USEID to your User ID.

           Reminder:  These two types are calibrated to Polish data.  The definition modal haplotypes may not be optimal for other regions.  If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade.  Up to step 15 there is lower probability that you belong.  You should test the appropriate SNPs (explained below) for higher confidence.  If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.

 

           L23EE.  20 Jul 2010 documentation:  This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a.  This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.

           Nordtvedt pointed out the cluster for this type some years ago, calling it R1b-EE (Eastern Europe).  Mayka suggested the L23EE code to me.

           There are only 6 samples in the Polish Project in this type (13 Jul 2010).  SBP = 10.7% using all 67 markers, which is excellent for such a small type.  The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%.  In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+.  So this type is very well isolated in haplospace in Poland.

           On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project.  There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia.  That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain.  I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.

           This type has evidence of structure.  A number of markers are bimodal with no obvious correlation.  To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.

           If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers.  Even at 67 markers, I recommend the L51 test;  a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member.  We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.

 

           L47P.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.

           Mayka announced the cluster corresponding to this type on the web in March 2009.

           There are only 4 samples in the Polish Project in this type (13 Jul 2010).  SBP = 9.3% using 64 markers, which is excellent for such a small type.  The cutoff is 7 and the gap is 10.  There are no samples from step 7 to 16.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.

           This type is very robust;  the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.

           Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.

           Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project.  Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.

           The “P” in the code L47P represents my hypothesis that this type is Polish.

           Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative.  That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.

           See the last paragraph of L47A, next topic, for more comments.

 

           L47A.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  I do not know yet if this type is negative for L44, a known branch of L47.

           Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down).  I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers;  I would appreciate an email of a reference to add here, even if it does not exactly match my definition.  It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.

           There are only 5 samples in the Polish Project in this type (13 Jul 2010).  SBP = 7.6% using all 67 markers, which is excellent for such a small type.  The cutoff is 10 and the gap is 9.  There are no samples from steps 9 to 18.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.

           This type is very robust;  the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.

           This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics;  SBP = 4.6%, which is remarkable.  It might be even better with an optimized definition;  I used the modal haplotype that I extracted from the 4 Polish Project samples.

           This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.

           So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48.  In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A.  SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades.  I also have not searched the web for the STR values expected for L148.  (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.).  All this will quickly become visible when FTDNA updates their haplotree.  As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148.  Mayka provided the SNP data that I have documented here.

 

I Haplogroup Types

 

           At the end of July added the following two I Types to this web document (next two subtopics, M223CE and M253P).

           Mayka added these two to the Polish Project web page, based on my recommendation, based on my SBP analysis.

           I independently found these two by analyzing the Polish Project I data.  They were previously known as clusters, hypothetical clades, discussed some time ago by Nordtvedt.  Since I’m using 639 samples with 67 marker data as representative of Poland, a small clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10).  These two small types have only 4 and 5 statistically independent samples, so each is probably slightly less than 1%.

           The STR definitions for these are available at Haplotypes.xls.  PolishCladesUpdate has a link to an Excel analysis file for each of these two types.

           Instructions for Ysearch comparison are below.  Here is the “UserIDs” bar for I comparison:

USEID, WC8JD, SB6YK

           Change USEID to your User ID.

           Reminder:  These types are calibrated to Polish data.  My definition modal haplotypes may not be optimal for other regions.  If you have Polish ancestors, and if you have all 67 markers, and if you match one of these (step distance less than 4 for WC8JD - M253P, or less than 16 for SB9YK - M223CE) I figure there is more than 80% probability that you belong to the corresponding clade.  At higher step there is lower probability that you belong.  You should test the appropriate SNPs (explained below) for higher probability.  If your ancestors are not from the region around Poland you should follow the links to Nordtvedt’s worldwide definitions to compare your data.

           Comment about validity:  Neither of these two types forms a statistically compelling type on Ysearch, although the types seem convincing within Poland.  There seem to be clades elsewhere, particularly in western Europe, with similar STR values, so those others interfere with M253P and M223CE on Ysearch.  I’m using 80% probability in my predictions, and in the case of these two, I estimate the probability at roughly 10% that either type is not valid as a unique clade.  In addition, even if both types are valid, men of Polish male ancestry who match at high step close to the cutoff have up to 10% probability of being descended from a distantly related clade due to the statistics of STR mutations (for example a descendant of a recent immigrant from one of those western clades on Ysearch).

           General comments about haplogroup I:  I1 and I2a are each large in Poland.  I2b1 is smaller but significant.  I can easily form types for all three haplogroups, and the STR values do not overlap.  Various definitions work well for those three haplogroups.  That means they are much younger than the ancestral I haplogroup, which experienced extreme population bottleneck leading to these three haplogroups that are very well isolated as mountains in haplospace.  It appears that only small isolated populations survived the bottleneck, with an MRCA for each population very distantly related to the MRCAs for the other populations (I1 vs I2a vs I2b1 vs a few other small haplogroups in I).  There is evidence of additional structure with I1 and I2a - common STR haplotypes (see Nordtvedt).  However, with quite a bit of effort I could not form more types with low SBP.  That means I1 and I2a grew rapidly in population, so that each is relatively homogeneous in STR values, at least as represented in the Polish Project, which is the source of my data for my analysis.  More SNP data is required to further subdivide these two large haplogroups with high confidence in the Polish Project.

           My Haplotypes.xls file has other haplotypes defining clusters in the I haplogroup that I am watching as data accumulates, but only the following two are significant enough to warrant discussion here:

 

           M223CE.  On 31 July I added this Central European type for I2b1 haplogroup to this web page.  This type has been known as a cluster for a few years.  Mayka points out that Nordtvedt listed it on the web.

           It has been known for some time now that this cluster is positive for the M223 SNP, which corresponds to I2b1.  One of the Polish Project samples in this type is positive for P95, which is I1b1d.  I hesitate to predict P95 for this type, because individual samples are roughly 80% probability.  It does not matter if M223CE turns out to be P95+ or P95.  Either way, P95 will serve as a test for higher probability predictions.  So if you match M223CE you should test for both the M223 and P95 SNPs to determine your category with high confidence.  I’ll update this topic when more P95 data is available.

           SBP comes out 7.6% for the 5 samples in M223CE in the Polish Project.  However, two of these are a family pair where one encouraged the other to join, so only 4 of these 5 samples should be used for statistical purposes.  SBP calculated on the basis of 5 samples is 10.3%, which is still a high confidence result.

           A good signature is (392, 437, 450,446) = (12, 14, 9,11), which distinguishes these samples from all others in the Polish Project, although exceptions are bound to show up in the future.

           This M223CE type does not seem to be a unique Polish clade because all 4 samples have step values of 16 or more from each other, even though there is a gap from step 16 to 24 (relative to the 66 marker modal definition), with no samples in the Polish Project, separating these samples from all others.

           This type is equivalent to Nordtvedt’s I2b1-Continental.  A definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard markers plus 8 additional markers (on 31 Jul 2010);  this definition differs from mine by only 7 mutation steps.

 

           M253P.  On 26 July I added this Polish type for I1 haplogroup to this web page.  This type has been known as a cluster for a few years.  Mayka points out that Nordtvedt listed it on the web.  Marek Skarbek Kozietulski has studied this cluster quite a bit, since he’s a member.  I mentioned this type briefly in my publication, where I was previously calling it Y type, considering it not high confidence based on the data available last year.

           It has been known for some time now that this cluster is positive for I1 (M253) and negative for the known branches I1a through I1e.  So this is a type within the paragroup I1*, although not all samples from I1* are members of this M253P type.

           SBP comes out 6.4% for the 9 samples in M253P in the Polish Project.  However, Marek informs me that he had identified 4 men who matched at 12 markers and actively recruited them to obtain all 67 markers and to join the Polish Project.  That means only 5 of these 9 samples should be used for statistical purposes.  SBP calculated on the basis of 5 samples is 13.6%, which is still a high confidence result.

           I used all 9 samples in my analysis file at PolishCladesUpdate in order to best estimate the modal haplotype and definition, which are available at Haplotypes.xls.

           The best signature marker for M253P is 392=12;  that marker alone distinguishes the M253P samples from all other I1 samples in the Polish Project, although quite a few I2 samples have this value.  Exceptions are bound to turn up.  That marker alone does not distinguish worldwide M253P samples on Ysearch.

           The best trio signature, (391, 392, 447) = (11, 12, 24) captures 8 of the 9 M253P at step 0.  At step 1, (mismatch of 1), that 9th sample matches, but 1 other sample from I1 and 6 from I2 also match.  This demonstrates how short signatures can be useful but not foolproof.  On Y search that trio does not work well.

           Nordtvedt uses a broader definition for his cluster equivalent to M253P, which I understand he calls I1-ASP, for Anglo Saxon Polish Pomeranian Prussian.  My 54 marker definition at cutoff 4 is narrower, but my definition seems to capture most of the I1-ASP samples at higher step values.  The corresponding SBP at higher cutoff is not low enough to qualify as a type.  The future may provide an SNP defining a parent clade broader than M253P along these lines, although I doubt the STRs will score a low SBP value.

           Mayka informs me that further M253 work includes testing members for the many SNPs considered equivalent to M253.  If someone shows up negative for one of those, that will define a new haplogroup in I1.

           Here is some interesting speculation for which I do not have convincing statistical evidence:  Marek points out that the sample at step 4 on Ysearch is Danish, which adds to his evidence that there might be a related clade in Denmark, perhaps with a node in the I1 tree older than the node for the Polish clade but younger than the node for the parent I1-ASP clade.

           Meanwhile, if you match M253P, it is a good idea to confirm with SNP testing:  you should be positive for M253 and negative for the 4 known branch SNPs listed at ISOGG.

 

Instructions for Use of Ysearch

 

           Link to the site:  http://www.ysearch.org.  Brief description of Ysearch.

           Click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services.  Or, you can type in your data.  You end up with a “User ID”.

           Ysearch has a Research Tools tab to click, where you can type in other User ID’s for comparison.

Cluster Genetic Distance Method; for:  P - Pc - Pg - N - K - A - I - B - D - E - H - M - G:

           Click here:  Research Tools

           Copy the following line into the “UserIDs” bar at the Research Tools page:

USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX, RU8Z8, K49NZ, GNYBG, 559EE, 24MB4, ZD29Z

           Change USEID to your User ID.

           You need to type the Captcha puzzle for access.

           Click on ‘Show genetic distance report”.  You get a table of results.

           Result:  If there is a small genetic distance result (3 or less) for one of these types, you have a high probability of belonging to that type.  There are more detailed rules available, assignment rules above, followed by several paragraphs of discussion.

           Reminder:  this web page is for men with R1a1a type Y-DNA.  If you are not R1a1a, these instructions will not produce a matching result, except very rarely, in which case the result would be meaningless.

           The emphasis is on men of Polish male line ancestry.  Just about all R1a Polish line men are R1a1a.  Anyone from the haplogroup R1a1a from other countries may get good results, but that may be misleading if there are other types, rare in Poland, not noticed by me, but with haplotypes that overlap one of these 9 types.  Many men of Polish male line ancestry do not match any of these types.  For non-Polish there is a higher probability of not matching any of these types.

 

37 Marker Network

 

           Lawrence Mayka (independently, March 2007) constructed a “median joining network” Network for the 37 marker samples of the Polish Project.  This network supports the definitions of the P & N clusters, and of the A subcluster.  The P cluster is the left side of Mayka’s network;  N is the top branch, and A is a small branch on the lower right.

 

Russian Sites

 

           29 March 2010 correspondence:  I mentioned Russian sites for R1a clusters in my publication.  It’s not easy for me to figure out which of those clusters correspond to my types.  Mayka worked out a correspondence on 29 March, warning me that the correspondence is not exact.  Some of the Russian clusters are broader than my types;  some are narrower.  Here are Mayka’s findings:

My Type code vs Russian cluster name:

A         Ashkenazi Jewish

B         Western Eurasian

C         Old European

D         Baltic - Carpathian

E         Northern Eurasian

F         Central Eurasian

G         Central Eurasian

H         Western Carpathian

I          Northern Carpathian

N         Central European

P         Western Slavic

 

My Mountain Method

 

           Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method.  For more explanation, see the fall issue of JoGG.  By haplospace I mean multidimensional sets of STR values;  each haplotype is a point in haplospace.

           A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain.  The step is the genetic distance (mutation count) from the modal haplotype of the cluster.  I use the method of Ysearch to calculate step.  The cutoff is the next step just beyond the mountain.  A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1).  In other words, the cluster forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the parent haplogroup usually) at higher step numbers.

           The Statistical Background Percent (SBP) is an objective measure of the quality of the type.  Low SBP is taken as evidence that a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP (yet to be discovered).  Larger types with lower gaps have lower SBP.  SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade.  SBP is increased to account for the estimated probability of outliers from other clades.  An outlier is a sample that has very unusual STR values due to the luck of mutations.  SBP is also increased to account for the estimated probability of small foreign clades that just happen to have the same STR values but are not closely related to the type.  The SBP is also increased to provide the rough equivalent of the maximum in a confidence interval.  Small sample counts have wide confidence intervals.  So larger types (more samples) automatically get lower SBP.  For a valid clade, SBP should decrease with time as data accumulates in a database.  A very well isolated clade will have a low SBP even with only a few samples.  SBP < 5% is very rare - a very well isolated type, very likely to be a clade.  SBP < 25% is good enough to be published.  SBP < 50% is a type worth watching as data accumulates with time.  The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains.  The number of markers in the definition should be chosen to provide as small an SBP as possible;  my Excel tools provide automatic rank of markers as an aide;  human judgment can be used to include or exclude markers with obvious problems.  A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.

           I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff.  The definition of a type is the modal haplotype plus cutoff.  The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible.  The definition of a valid type may change slightly as data accumulates.

           Here are some common terms (in boldface) for genetic genealogy.  I did not define these, although I use them in a restricted sense:  A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation.  A haplotype is a set of gene values at any number markers, here restricted to Y-DNA STR values.  I use the word sample (plural  samples or data or database) for the Y-DNA STR values from one man.  A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples.  A clade is a general term for common descent, so an SNP haplogroup is one kind of clade.  I use the word clade in general, when meaning a Y-DNA clade that may or may not be a defined official haplogroup.  All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP.  A cluster is a set of samples with similar STR values.  All types have associated clusters but not all clusters are associated with types.  The modal value for a marker is the most common value in the cluster.  The modal haplotype is the set of most common values, usually the most common haplotype in a cluster.  Many people use the adjective “modal” as a noun, meaning “modal haplotype”;  so do I;  I tried to avoid that in this web document.

           Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous.  A type corresponds to a clade that experienced a population bottleneck - isolation or migration or very rapid population growth.

 

Probability Minimum 80% for Polish Project R1a Assignments

 

           This topic was updated 5 Jul 2010.

           See Polish Project Assignments for a brief overall explanation of how assignments are done.  This topic provides more detailed discussion.  This topic focuses on the R1a categories, but some of this discussion obviously applies to other categories.

           Each sample (individual man) is assigned to a category.  Some categories are known haplogroups.  Other than R1a, all categories in the Polish Project are haplogroups or paragroups.  At this time there are only two in R1a.  Haplogroups are defined by SNPs, but not all haplogroups are supported by FTDNA assignments, which may cause some confusion.

           Some categories are types, which are hypothetical haplogroups.  Borderline and cluster categories are discussed here a few paragraphs down.  Click on Remainder and Unassigned for discussion of those two categories elsewhere.

           The assignment guideline is at least 80% probability for each individual sample. Using an 80% minimum, most assignments are better than 80%, of course.  So the average probability for a category is higher than 80%, and the average varies by category depending upon how many samples are marginal near 80%.

            For haplogroups, “80% probability” means that if a large number of samples with 80% probability were SNP tested, about 80% of them would test positive for the haplogroup into which they were predicted.  Probability is determined by correlating STR values with samples that have been tested for that SNP.

           Some assignments are 100% probability - samples with positive SNP test results, assigned to that haplogroup, and not given an extended assignment.  Actually, there is no such thing as 100% because the genetic test might be in error, but it seems from experience that testing errors are much lower than 1%.

           I arrive at probabilities with a combination of calculations and educated estimates.  This topic is my explanation.

           Confidence is a separate topic, related to probability.  Confidence is next, below.

           I figure probability as a decreasing function of step from a modal haplotype.  My assignment rules are step distances at which I figure 79% probability.  If a sample matches the modal haplotype at less than the 79% step distance, I assign that sample to the corresponding haplogroup or type or other category.  In practice it’s complicated.  I use an Excel file for assignment.  You can view the file at www.gwozdz.org/PolishCladesUpdate/Assigner.xls.  That may not be the current version.  In that file the “PolishProjectRules” sheet has the list of rules for human reading - next to the coded logic functions for Excel.  If you are a Polish Project member you can find your kit number and view your step to each category in the table - “Modal Calculator” sheet.

           The following paragraphs explain how I figure probability for types.  This is not something I proved in my publication, but it seems to me that my publication makes it reasonable.   I hope you the reader find the following method reasonable.  I expect this method will be proven with time as most of my predicted types are validated.

           If a type has 90% probability of being valid and a particular sample has STR values that match the type with 90% probability, those two numbers get multiplied for net probability.  That particular sample has 81% net probability of validity, and 19% probability of invalidity.  I do not actually calculate this.  This paragraph is a conceptual explanation introducing the explanation in the following paragraphs.

           My publication has detailed discussion of my statistical method for types.  Briefly, I use SBP as a quality measure.  SBP is a measure of the background - the percent of samples that match the type but really do not belong.  For example if SBP = 15%, that means 15% is a measure of how many samples within the type (step less than cutoff) really do not belong to the type.  For this example, a typical sample in the type has 85% probability of really belonging to the type.

           It is not possible to calculate the probability that a type really is a clade that will be validated some day by an SNP not yet discovered.  Although 100% minus SBP is not the probability of type validity, 100% minus SBP is closely related to validity.  Certainly a type with high SBP has low probability of being valid.  Certainly a type with SBP less than 15% has high probability of validity.

           SBP is a high calculation, designed for roughly 70% confidence interval, with additional increase for many statistical reasons explained in my publication.  That’s why I call it “Statistical Background Percent”.  This statistical increase is small for small SBP and larger for larger SBP.  The way SBP is calculated, it goes over 100% for type candidates with high background;  SBP should not be used over 50%.

           The best estimate for background percent is lower than SBP.  However, as explained a few paragraphs above, the net percent of invalid samples (net invalidity) is higher in the cluster of a type, because of the unknown probability that the type itself is invalid as a whole.  It is convenient for me to assume these two considerations cancel each other.  I use SBP as my estimate for the net background percent of invalid samples in a type.

           A sample that matches the modal haplotype has close to 100% probability of belonging to the corresponding type.  For a type with a high cutoff, this is true even for a sample a few steps away from the modal haplotype.  The reason is that the vast majority of haplotypes in a type are at the highest step numbers, so that is where most of the background is.  This is explained in the discussion of Table 1 on page 145 of my publication.

           So here is my method:  I figure an assignment rule “step < S” to assign samples, where the samples at step S and greater, equal to about SBP percent of the type cluster, do not get assigned.

           This finishes my brief justification for using SBP as a guide for assignment.  More discussion of details:

           There are other calculations in addition for SBP, for example haplogroup correlations mentioned above.

           Another is the calculation of correlations for 37 marker rules, which are similar to haplogroup correlations.  Using 67 marker data for a type, the 37 marker data for those samples provide probabilities that other samples with only 37 markers belong to this same type.

           After I do a particular calculation many times, I feel confident glancing at new data and making quick estimates for new rules if the number of samples does not justify detailed calculation.

           Let me repeat what I said above:  I arrive at probabilities (assignment rules are 80% estimated minimum probability) with a combination of calculations and educated estimates.

           Mayka, who does the assignments other than R1a, does not use my calculation methods, but insofar as he uses his experience to judge STR correlations, he is really performing estimated correlation calculations.

           When a probability is judged close to the 80% minimum for assignment based on STR correlations (step close to the rule limit), there are a number of additional factors that can be considered.  The following paragraphs are examples.  More examples are in my publication.  Mayka uses similar considerations for assignments:

           Geographic concentration.  P type is an example.  P type is concentrated in Poland.  I considered P type as more likely valid because it is geographically concentrated, before it was validated by an SNP.  Back then I considered a Polish family name associated with a sample as marginal additional evidence of belonging to P type.  Today that consideration applies to a sample that marginally matches the P type haplogroup with STR values but has not been measured for the L260 SNP.

           Ethnicity.  For example, there are a number of haplotypes known to be common among Jews, so a Jewish name associated with a sample is marginal additional evidence that the sample belongs to a corresponding haplogroup or type.

           Stragglers.  We tend to avoid categories for only one or a few samples, so if one or two samples have 70% probability as a best estimate it makes sense to adjust the rule a little looser so that the rule picks up those few samples that do not quite fit, rather than create a borderline category (discussion below).  Conversely, it makes sense to be a bit stricter for type assignments if a borderline category is available.

           67 markers.  We are marginally more liberal with assignments using the full 67 markers and marginally stricter for samples with fewer, because those with fewer can get more accurate assignments by procuring the remaining markers.

           Men with closely matching STR should be classified together, particularly if the family name is the same.

           We avoid changing assignment rules too often, so some assignment rules may remain in place for a while even after new data has provided slightly better rules.

           For a valid type SBP comes down as data accumulates, with better statistics.  I avoid introducing a new small type with SBP above 25%, because I expect it to improve with time.  Technically, SBP = 40% means 60% of the samples can be introduced as a new type category, but I prefer to wait a few months for more data, so that a new type is substantial at introduction.

           We do not wish to be dismissed by others with experience evaluating STR data.  On the other hand, we do not wish to have others point out that samples are being left without obvious assignment.  I suppose the goal should be that the number of people complaining that assignments are too liberal turns out to be about equal to the number of people complaining that assignments are too conservative (people with experience evaluating STR data who have read and understood my documentation here).

           A person who assigns samples to hypothetical haplogroups based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, educated guesses, and intuition.  A bookie’s estimates are usually tested by reality very quickly.  Probabilities of an STR estimator may not be verified or falsified by a new SNP for years.  You need to be skeptical of STR based predictions.  In the past, a number of STR based assignments have been shown wrong by new SNP discoveries.  This long web document is provided so you can read as much as you wish about our (Mayka’s and my) methods, judging for yourself the reliability of our probability estimates.

           I use Borderline categories when a significant number of samples have 50% to 79% probability for one haplogroup or type, and have less than 20% probability for any other haplogroup or type that I know of.  These fit too well to go into a Remainder category.  The 80% rule does not apply to samples in a borderline category, although it would be correct to say those samples have minimum 80% probability of belonging in that borderline category as opposed to another type or another borderline category.  Borderline samples still have their FTDNA assignment which is either 100% (green) or 99% (red).

           The Polish Project occasionally but rarely uses a cluster category, which is similar to a borderline category, except the category itself has less than 80% probability of corresponding to a valid clade, so the 80% rule does not apply.

 

Confidence and Probability;  More Statistical Comments

 

{This entire topic needs rewrite.  This is an old version.  I moved the probability discussion to a new topic, above.  Much of this topic is OK as is for explanation of “confidence”, but most is redundant.  Watch this space for a rewrite.}

           See the Assignments topic for a general discussion of Polish Project assignments.  This topic explains how I figure percent confidence for assignments of individual samples (men), for the R1a section at the Polish Project.  My publication explains my statistical methods.  There is a summary of my mountain method above.

           Confidence interval example:  Confidence is not the same as probability.  By 80% confidence I mean 80% is the lower number of the 80% confidence interval.  For example, 80% confidence might mean that the actual probability is 90% but the 80% confidence interval is 80% to 96%.  As an example, consider a situation where 10 samples match a type with an STR test.  Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for the SNP, and 1 negative.  That means 9 of the 10 really belong to the haplogroup and that 1 mismatch must come from a different haplogroup that matched the STRs by the luck of mutations.  Next, consider a new sample that matches that same STR test.  What is the confidence that the new sample will pass the SNP test for the haplogroup?  The probability is 90% because we know that 9 out of 10 previous samples like this matched the SNP.  However, 1 out of 10 is a very small sample.  As explained in my publication, I use Poisson statistics for quick calculation of confidence interval.  Poisson statistics is simple to calculate in Excel.  My tool Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.

           80% confidence interval of 1 is 0.11 to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%;  that lower number 61.1% means the 80% confidence ranges to lower than 80%, so net confidence is lower than 80%.

           70% confidence interval of 1 is 0.16 to 3.37, which is 16% to 33.7%, lower number 66.3%;  net confidence lower than 70%.

           60% confidence interval of 1 is 0.22 to 2.99, lower number 70.1%; confidence higher than 60%.

           67.3% confidence interval of 1 is 0.18 to 3.26, lower number 67.4%.  So that’s my one number:  67% confidence.

           In other words, if 9 out of 10 samples that match an STR also match the SNP test, we have 67% confidence a particular future sample matching the STR test will also match the SNP test.

           For 18 out of 20, the probability is still 90%, but a similar calculation shows 75% confidence.

           For 36 out of 40, the probability is still 90%, but a similar calculation shows 80% to 96% confidence interval, net 80% confidence, which is my example that I started with above.  These calculations actually take less than a minute using my Excel cells.

           Statistical Background Percent:  SBP.  I use SBP as a net confidence estimate for the background (samples that match the STR values but really do not belong to the clade of a type).  My publication does not go into the details of confidence intervals.  That is the purpose of the explanation here in this topic.  SBP is my estimate for the net statistical confidence before any SNP has been discovered to validate a hypothetical type.  100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the corresponding hypothetical clade.

           A mountain cluster corresponding to a type might include outliers from other clades, or might include foreign clades.  These and other caveats associated with STR prediction are discussed in detail in my publication, where I point out that the confidence for all such caveats cannot be calculated.  I estimate the background by using the low frequency of samples in the gap as representative of the background throughout the haplospace neighborhood.  My SBP formula (available in the tools) includes an increase in SBP to account for all such caveats.

           Part I of my publication explains:  “Much of the background is probably at the last step of the mountain, just before the cutoff.  Much of the remainder is probably at the previous step, much of the remainder after that at the previous step, etc.”  My Part I Table 2 justifies this by demonstrating how the number of possible haplotypes increases very rapidly with step.  In other words, SBP is a good worst case overall estimate of background percent within a type, but background percent is very low at step zero and increases rapidly with step.  My publication does not provide a formula for background vs step and in fact I have not derived an formula.  For assignment of samples, I estimate the confidence vs step in a manner to provide a rapid decrease in confidence near the last step, in a manner to produce overall confidence roughly equal to 100% minus SBP.  Step zero is my rough estimate that the type is a valid clade, since the step zero samples belong to the clade with very high probability if the type is valid.

           Some outliers from the type statistically fall within or even beyond the gap, so confidence is not zero at the cutoff.

           Confidence also depends upon the size of the gap.  A wide gap with zero samples means even samples in the gap near the mountain have reasonable confidence percent.

           Estimates vs Calculations vs Adjustments:    A person who assigns samples to hypothetical clades based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, educated guesses, and intuition.  A bookie’s estimates are usually tested by reality very quickly.  Probabilities of an STR estimator may not be verified or falsified by a new SNP for years.  You need to be skeptical of STR based predictions.  In the past, a number of STR based assignments have been shown wrong by new SNP discoveries.  This long web document is provided so you can read as much as you wish about my methods, judging for yourself the reliability of my estimates and net probabilities.

           The first confidence interval example above, confidence of STR predictions calibrated to SNP data, can be pure statistical calculation without any estimates.  However, judgment is involved.  Even such SNP predictions should be split into parts based on the step value of the samples within a type.  However, if split down to individual steps, the statistics are very poor due to small sample size, so steps are best combined in batches.  For the first data from a new SNP it is necessary to combine all the steps, so the predictions benefit from an estimated confidence by step.  So the judgments and calculations can get quite complicated, and often I just estimate the confidence from experience rather than do the calculations every day as data comes in.

           I avoid changing assignment rules often, so some assignment rules remain in place even after new data has provided better rules.

           My standard is 80% confidence, but I avoid introducing a new type until the confidence is a bit higher, because a new 80% confidence type would provide only a few samples at step zero on the day when enough data has accumulated.  After waiting for more data, I tend to bend the guidelines a bit below 80% confidence in order to introduce more samples with a new type.  Also, if I notice an individual coming out at 75% when I’m updating rules I’ll tweak the rule to include him.

           I tend to be generous in estimates for samples with all 67 markers, and I tend to be conservative with samples having fewer than 67.  I update the rules more often at 67.  After all, samples with fewer than 67 markers can get much better confidence by ordering more markers, and 67 is the most available as a standard commercial test.

           I do not look forward to a man feeling slighted when he is not assigned to a type that is a reasonable fit to his STR data.  On the other hand, I do not wish to be dismissed by others with experience evaluating STR data, so I try to be conservative in my probability estimates that particular clades in fact exist.  I will have achieved my goal if the number of people complaining that I assign too liberally turn out to be somewhat greater than the number of people complaining that I am too conservative (people who have read and understood my documentation).

           Naturally, my confidence changes from month to month as more M458 and STR data accumulates, for better statistics.

           Assignments at fewer than 67 markers:  There are two ways:  Some types have low SBP and seem 80% valid using 37 or only 25 markers, at least for samples at low step, so samples can be directly assigned.

           Second way:  I check for correlation using the samples with 67 markers to see which percent of samples at given genetic distance using fewer markers end up in the corresponding type at 67 markers.  The confidence of a sample at fewer markers is that confidence multiplied by the corresponding confidence at 67 markers.

 

Validation Comments

 

           This topic updated 22 Jun 2010.

           I look forward to the discovery of SNPs validating more than 80%, probably more than 90%, of my R1a Polish Project type assignments.

           I introduced P, N, and K types in the Fall of 2007, publishing this web page 6 Dec of that year.  I did not predict that P and N were brother clades, in fact it looked to me like P was closer to K.  I did not make predictions about the P, N, K structure because the statistics did not justify such predictions.  I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008.  I stated my overall confidence in the subtypes of K type as only 80%, but again my confidence in K type at step zero was (and still is) 95%.

           P type has been validated as R1a1a7b, defined by the SNP L260.

           N type is very close to the same as R1a1a7*, the paragroup defined by the SNP M458 minus L260.  This is not exactly a validation, because there are a low percent of M458 (2 samples so far at 67 markers) that seem to be older than N type, which implies that a future SNP, younger than M458, may be discovered as equivalent to N type.  In previous versions of this document, I explained:  “A new SNP marker may not fall at the node defining a type.”  A new SNP might be younger, including mostly the samples with low step from the corresponding type.  A new SNP might be older, including the corresponding type plus some samples with step beyond the cutoff for the type.

           K type is not validated yet.

           In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for those samples that do not fit P, N, or K.  K type plus the R category are equivalent to R1a1a* (M17).  The R1a table assigns new types to either K or R.  In the detailed discussion of the types I discuss which I have: (a) high confidence are subtypes of K; (b) high confidence are not subtypes of K so surely go into R; and (c) lower confidence of assignment to K or R so assigned with a best guess.  A new SNP for K type might include a few of these subtypes, and may include some of R, depending upon the age of such a new SNP.

 

References and Sources

 

           My publications have several references of general interest and relevance to my web documents.

           My Tools and data for STR analysis are Excel files.  These are available at the JoGG publication site as Supplementary Data:  www.jogg.info/52/files/cpcindex.htm. 

           Polish Clades Update.  This folder is for update of Tools and for new data:  www.gwozdz.org/PolishCladesUpdate

           Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish).  This reference is listed in my publications.  I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type.  Link to English abstract:  Pawlowski 2002.

           Lawrence Mayka is the Administrator of the Polish Project.  Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions.  He reviewed & approved my 80% probability rule for assignments on the Polish Project web page.  He also reviewed the original drafts of my publications.  A number of my types were originally suggested to me as STR clusters by Larry.  Larry continues to provide data for this web page.  Many of my references to other websites in this document were suggested to me by Larry.

           Cyndi Rutledge is the administrator of the R1a Project.  Larry and Cyndi send me M458 test results, which are not listed on the web.

           Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications.  Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his  publication in JoGG - some in Russian).  It was encouraging to me seeing independent identification of clusters by different methods.  He emailed to me an English version of one of his 2008 publications.  His Fall JoGG articles have references to his other publications.  Here is a web link:  Klyosov Home.

           Russian web sites:  http://www.rodstvo.ru;  http://dnatree.ru/;  http://molgen.org/.  These have been active analyzing R1a, brought to my attention by others, particularly by Mayka, who worked out a correlation with my types.  These sites clearly have proposed subdivisions of R1a based on STR data, but I cannot quickly understand these due to the language barrier.  Klyosov seems to be active at these sites.  The sites make use of the FTDNA projects and Ysearch.

           Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG.  His excel files of data and tools are available at his web site.  Ken has been active in web discussions, suggesting many STR based clusters.

           FTDNA link:  www.familytreedna.com.  This is a commercial DNA testing company.  I make extensive use of the project databases maintained by FTDNA.  These are my primary sources of data.  Click on the “Projects” tab at the home page to look for projects.  Also, the project name can be substituted for /polish/ in the following URL.

           Polish Project link:  www.familytreedna.com/public/polish.  One of many FTDNA projects.  This is my primary source for Polish data.  The Polish Project tracks both Y-DNA and mtDNA;  click on “Y-DNA Results” on the left to see the data that I use.

           R1a Project link:  www.familytreedna.com/public/R1aY-Haplogroup.  Another source.

           Ysearch link:  www.ysearch.org.  Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.  I am not associated with the company FTDNA.  I have Instructions for comparing your STR data to my types (modal haplotypes) that I have entered into Ysearch.

           Yhrd link:  www.yhrd.org.  A forensic Y-DNA data base.  Data is separate by city, with many Polish cities.  I relied on Yhrd to figure out the geography of the various haplotypes.  I wrote a Yhrd Reminders for myself so that I won’t forget how to navigate the Yhrd web site;  click on that link if you need some hints.

           Sorenson link:  http://www.smgf.org/.  Another DNA testing company.

           ISOGG link:  http://isogg.org/tree/  Y-DNA tree with the most recent SNPs and corresponding alphanumeric codes.

 

Peter Gwozdz;  My Interest

Peter Gwozdz

pete2g2@comcast.net

           I’m a very rare type in Poland - E1b1b1a2.  My maternal 1st cousins are R1a1a.  That means my late maternal grandfather was R1a1a.  I became interested in Y-DNA in 2004.  My maternal family name is Iwanowicz.  I discovered a family with that name in my maternal grandfather’s home town in Poland.  They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins.  I brought a cheek swab kit when I visited them the second time in 2006.  Sure enough, the male son is a perfect 25 STR marker match to my 1st cousin.  I didn’t get around to checking the web for a year.  I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers.  That’s a hell of a lot of matches in the summer of 2007.  Most of these matches are Polish.  I did some research and found an article by Pawlowski (reference in my publication) about this most common Polish haplotype, which I now call P type.  That got me interested in doing more research, leading to this web page for others to see my results.  My experience, however, is a reminder that statistics can be misleading.  I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers.  I now (June 2010) figure that the probability was really about 93%, because 13 out of the 14 current Polish Project members who have 67 markers and who also match P type perfectly at 12 markers are in fact P type as judged by all 67 markers.  My grandfather does not match P type at 67 markers.  My grandfather is that 14th one.  He matches the small hypothetical clade that I call I type, which is also concentrated in Poland.  But my confidence on that I type assignment is only 80%, so maybe statistics is fooling me again.  That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about  common Polish Y-DNA clades.

 

Revision History

 

2007 Dec 6  First web posting of this file

2007 Dec  Two revisions

2008  8 revisions

2009 33 revisions

2010 Jan - May 12 revisions

2010 Jun 1 finished update of L260 & M458 tests, with discussion

2010 Jun 6 update L260 & discussion

2010 Jun 10 G type documentation;  modal haplotype update;  Ysearch modal comparison update

2010 Jun 12 EFHJ types documentation plus edit comments on other types

2010 Jun 13 update general rewrite is in process;  not finished

2010 Jun 18 more general rewrite

2010 Jun 22 continue rewrite, including update of the R1a Results Table

2010 Jul 5 rewrite “Assignments” topic, new “80% Probability” topic

2010 Jul 6 update SNP analysis - 3 topics

2010 Jul 20 add 3 R1b types

2010 Jul 21 add 3 R1b modal haplotype references & Ysearch codes

2010 Jul 22 drop “R1a” from the title, and a few rewording changes

2010 Jul 26 add an I1 Type, M253P

2010 Jul 31 add an I2b1 Type, M223CE

2010 Aug 1 edit the I haplogroup discussion

2010 Aug 9 update the SNP analysis

2010 Aug 14 update news one M458- at step 7 from P