Polish Y-DNA Clades

20-Nov-11

Peter Gwozdz

pete2g2@comcast.net

News

 

           20 Nov R1a news:  The new SNP paragroup Z280* is looking very similar what I have been calling K type.  A type is clearly Z280- Z93+, not part of K.  Preliminary data:  E type seems to be Z280+ Z92+, not part of K.  D type seems to be K.  More SNP data will soon clarify where I type and other types fall.

 

Common Polish Y-DNA Haplogroups

 

           Comment 20 Nov 2011;  this table needs update, but the links get you quickly to the appropriate sections.

           Data from the Polish Project.  This is an abbreviated version of the more detailed Results Table.  In the “Haplogroup” column, click on the link to jump directly to the corresponding section of the Results Table.  In the “Description” column, click on the link to jump to a description of that particular clade.  There is a topic below with Explanation of the Results Table.

 

 

 

Number

 

 

Number

 

 

 

Haplogroup

Samples

%

Description

Samples

%

 

 

CD

6

0.7

 

 

 

 

 

E

48

5.2

 

 

 

 

 

FGH

29

3.2

 

 

 

 

 

I

131

14.3

 

 

 

 

 

 

 

 

I1-M253-P type

7

0.8

 

 

 

 

 

I2b-M223-CE type

5

0.5

 

 

J

72

7.8

 

 

 

 

 

L

3

0.3

 

 

 

 

 

N

86

9.4

 

 

 

 

 

 

 

 

N1c1-M178-G type

8

0.9

 

 

 

 

 

N1c1-M178-M cluster

6

0.7

 

 

Q

15

1.6

 

 

 

R

R1a

R1a (M417-)

2

0.2

 

 

 

(56.5%)

 

R1a1a1 (M417) + no SNP

206

22.4

 

 

 

 

 

 

 

 

K cluster

130

14.2

 

 

R1a1a1 (M417) + new SNP

56

6.1

 

 

 

 

 

 

 

 

G type

23

2.5

 

 

R1a1a1g (L260-)

87

9.5

N type

87

9.5

 

 

R1a1a1g2 (L260)

83

9.0

P type

83

9.0

 

R1b

R1b (M269-)

2

0.2

 

 

 

 

 

R1b (M269)

83

9.0

 

 

 

 

 

 

 

 

R1b-L23-EE

14

1.5

 

 

 

 

 

R1b-L47-A

9

1.0

 

 

 

 

 

R1b-L47-P

6

0.7

 

R2

R2

1

0.1

 

 

 

 

 

T

8

0.9

 

 

 

 

 

Total

918

100.00

 

 

 

 

Abstract

           Edited 10 Sep 2011.  Abstract rewrite 27 Jul 2011.

           The Polish Project has assignments of men to haplogroups based on their Y-DNA data.  Lawrence Mayka, administrator of the Polish Project, provides data for this web site of mine.  This web document is for explanation, details, and update news.

           The Results Table has a summary of assignments.  Some assignment categories have a link to more detailed discussion.  If you know your assignment you can click on the link in the right column of the Table to read more about your assignment category.

           Most of the assignments are to well established haplogroups, as determined by Mayka.  In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence.  With less than 80% confidence, assignment categories are tentative, not called types, often called clusters.

           About half of Polish men belong to haplogroup R1a.  Most of my work has been on R1a. 

           This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to read the Introduction first.

           This web document has three purposes:  1. More detailed explanations for the men (samples) that I assign to types in the Polish Project.  2. Summary of my published results.  3. Update with recent results.

           The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters that are concentrated in Poland.

           Since I originally posted this in December 2007, emphasis has been haplogroup R1a, because about half of Polish men are R1a, with no subdivision at that time.  A new division, roughly 50-50, between R1a1a* and R1a1a7 (M458), became available in November 2009.  In 2010 I expanded this page to include clades from other haplogroups.

           I use the word type to mean an STR cluster with statistical validity as established by my Mountain Method.  I expect my types to be validated in due time by discovery of new SNPs that will qualify them as haplogroups.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  All types have associated clusters but not all clusters qualify as types.  In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references.  Usually when I discover a type I later find out someone else had mentioned it earlier on the web;  let me know if you the reader have more clues and references for me.

           Most types that I discuss seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA).  A few of my types might be younger or older than that range.

           I use phrases like “seem to be” over and over because the methods are statistical.

           Click here for a summary of the conservative automatic haplogroup assignments in the Polish Project, for an explanation of the extended assignments, for a discussion of the minimum 80% probability for assignment, and for the Results Table.

           The Polish Project is considered representative of Historical Poland, with caveats explained in my Publication.

           I am interested in Polish origins.  This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal.  This document is dedicated to identifying haplogroups and types and clusters concentrated in Poland, with detailed explanations.  I am aware that some people object to the use of Y-DNA for historical analysis, so I try to mention caveats along with my comments.

 

R1a Abstract

           Abstract rewrite 4 Jul 2010.

           About half the men of Polish male line ancestry belong to the R1a haplogroup.  About 99% of Polish R1a are R1a1a.  This R1a Abstract is a summary discussion of the R1a Results Table.

           U category.  Unassigned.  This is the largest category in R1a.  On the Polish Project Y-DNA Results page, detailed assignments are made with minimum 80% probability.  Because of the restriction to 80% probability, many R1a men in the Polish Project are not assigned to detailed categories at the Polish Project web page.  Those men go into this “Unassigned” category.  These still have either R1a or R1a1 automatically assigned by FTDNA.  If you are in this U category, you can promote yourself out by purchasing the full 67 marker STR set, since all R1a samples with 67 markers get a detailed assignment.

           I consider the R1a Polish data as 4 major categories based on STR data.  About half the men of Polish male line ancestry belong to the R1a haplogroup, and that group divides roughly equally into these 4 categories.  Since 2007, I have been calling them P type, N type, K type, and R category.  P and N are in the new R1a1a7 (M458).  P is R1a1a7b (L270).  K is R1a1a*.  R is mostly R1a1a*.

           R, Remainder, is not a type.  I use R for samples that do not belong to any of the types I have identified in R1a1a* so far.

           My overall confidence in K type is only 85% because there seem to be unidentified types with STR values close to K.  The modal haplotype for K is essentially the same as the modal haplotype for all of R1a.  However, I have identified subtypes of K that have much higher confidence.  In other words I have higher confidence for many individual samples.  I have high confidence in the subtypes although I am not sure all the subtypes assigned to K belong to exactly the same clade along with all the other samples that I have assigned to K outside the subtypes.  Even if K is not a true clade as defined, however, it clear that the K samples belong to branches in the R1a1a* tree with nodes very close to each other.  The only uncertainty is that there are likely many other samples that belong in other branches just as close to K.

           Borderline categories are not types but are samples that match types with less than 80% probability.  Each Borderline category has discussion below.

           P type is concentrated in Poland, rare with increasing distance from Poland.  N type seems to be mostly Slavic, widespread in eastern Europe.  K type corresponds to one of the two largest R1a1 clusters.  Another large R1a1a cluster, the one I call L type, is not common in Poland.

           In the table I assign each R1a1a* (M458-) subtype into either K or R based on how distant the STR values are from K.  Some of these are borderline however.  There is no clean separation of K from R, so the table should not be considered a high probability separation of K subtypes from the R remainder subtypes.  Read the individual type discussions to see which subtypes fit K with high probability;  A type is an example.

           Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.

           You can compare data to my types by clicking this link to instructions for Ysearch.

           Reminder:  I am concentrating on Poland.  The statistics of STR clusters depend a lot on the data base.  For example, P type stands out dramatically in Polish data.  In other countries P type is rare.  If you belong to an R1a1 cluster that is rare in Poland, I’m sorry, but I’m not covering you.  K type is an example of a type that is common both in Poland and elsewhere.  M type is common in northwest Europe but so far absent in the Polish Project.

 

Introduction

 

           This Introduction is for people unfamiliar with the jargon of genetic genealogy.

           There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic.  Back issues of JOGG are good general references.  The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.

           The following several paragraphs are a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages.  The definition words are boldface.  I often use links to those definitions when I use a jargon word for the first time in a topic.  There are more boldface definitions in the summary of my Methods.

           The Y chromosome gets passed from father to son, so it works just like a male family name.  Men are divided into haplogroups based on known rare mutations (most of them are called single nucleotide polymorphisms SNP) in the Y chromosome.  Division into haplogroups is done in a manner that has virtually 100% confidence.  I say “virtually” because your confidence in your DNA result from your DNA testing company might be 98% or 99% or 99.9%;  the confidence for haplogroups is better than that.  We can be virtually certain that all the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup.   The MRCA corresponds to the node, or branching point, in the Y-DNA tree of male line ancestry.  Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the haplogroup.

           Lots of people, including me, are working to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.

           Haplogroups have alphanumeric codes, like R1a1a.  A paragroup is a haplogroup considered without its known haplogroup branches.  When a new branch is discovered within a paragroup, it gets removed from the definition;  that changes the meaning of that paragroup.  An asterisk is usually used in paragroup codes, like R1a1a*.

           Many people, including me, try to “stay ahead” of the haplogroups by analyzing other mutations that are not so rare (called STR) on the Y chromosome.  Men submit their Y-DNA data to various web sites.  There are lots of STR data available on the web.  Men are divided into STR clusters as hypothetical subdivisions of the haplogroups.  All such clusters are hypothetical.  Some will be validated in the future by new SNP discoveries.  There are various statistical methods for estimating the confidence of STR clusters.  I recently published a method that I developed.  That publication has references to other methods.  There is a brief summary of my method below.

           A few STR clusters are small family clusters, with the same family name.  Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line cousins identified by genealogy research, due to secret adoptions, illegitimacies, etc.  This is one of the reasons some people prefer to avoid genetic genealogy.  The male line associated with the Y-chromosome is only one ancestral line.  Humans have 24 chromosomes.  Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then;  the one man at the tip of the male line root is only one of those thousand.  That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many.  That said, many people enjoy the challenging hobby of figuring out to which ancient extended male line they belong.

           Most STR based clusters have an MRCA who lived thousands of years ago, before family names were common, so most men assigned to a typical cluster do not have the same family name.

           Many SNP based haplogroups have an MRCA who lived more than ten thousand years ago, so these span multiple ethnic groups and nationalities.  For example, the R1a haplogroup is of interest to me.  R1a is most common in Slavic countries but calling R1a Slavic is misleading because it is found throughout Europe and west Asia.  The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it.  It is possible that he did not even live in what is now the Slavic region of Europe;  maybe his descendants moved there in a massive migration from the Asian steppes, or from India.  No one knows for sure.  Even if he was proto-Slavic in language and culture, by now some of his descendants long ago moved to other parts of Europe and Asia.  One of the appeals of genetic genealogy is trying to figure out ethnic descent and migration from the statistics of haplogroups.  Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of many.  True enough.  Some individuals and some web sites go too far with genetic claims.  That said, statistical analysis of haplogroup data provides many clues on human origins.

           Again, some people try to stay ahead of haplogroups, using statistical analysis of STR based clusters to gain insight into more recent human origins.  I am one of those people.  My interest is Polish origins.  This web document, however, is not for the historical analysis and conclusions, except for occasional comments to remind us of the goal.  This document is dedicated to STR data and analysis, identifying clusters concentrated in Poland, with detailed explanations.

           The bottom of my Method section has more definitions for a number of genetic genealogy terms.

           There are a number of organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy analysis, for example FTDNA.  I am not associated with the company FTDNA;  I mention them because I make extensive use of their data;  check Google for competitors.  At FTDNA, click on Products for cheek swab kits.  DNA results are confidential unless you register the data at a database;  at FTDNA, click on Projects to register your data into one of the many databases;  for example, most of my analysis is from the data in the FTDNA Polish Project.

           I use the FTDNA standard set of 67 STR markers (plus a few non-standard ones occasionally).  I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets.  Other companies use standard marker sets that may not overlap with all the FTDNA markers.

           Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.

 

L260 and M458 News

 

           Comment 14 May 2011:  recent data continues to confirm the analysis as presented in this topic a few months ago.

           This topic was completely rewritten during Dec & Jan;  last update edit 17 Jan 2011.

           SNP results continue to validate P type and N type.

           The SNP called L260 is almost equivalent to what I have been calling P type.

           The SNP called M458 is almost equivalent to the combination of what I have been calling P type plus N type.  In other words, N type is almost equivalent to M458+ L260- (positive result for the M458 SNP test but negative for the L260 SNP).

           The bottom of this topic has recommendations for testing regarding these two SNPs.

           All L260+ are M458+ if tested for M458, confirming that L260 defines haplogroup R1a1a1g2 within the M458 haplogroup R1a1a1g.

           Test results available to me:  204 M458 and 59 L260, from 213 samples.   The following paragraphs summarize results for the 180 samples that have all 67 standard STR markers.  By “predicted” I mean my type assignment based on STR values, ignoring the SNP results:

           All samples predicted P type are coming out M458+ L260+.  43 of them so far.

           All samples predicted N type are coming out M458+ L260-.  31 of them so far.

           In other words, all samples with 67 markers that fit the P type or N type definitions based on STR values are coming out correctly with SNP tests.  This is 100% accuracy so far for samples predicted P type or N type.  However, I am using the words “almost equivalent” because there are outliers:

           In the P branch there are only 2 outliers:  one with STR values at the cutoff and one that is 1 step beyond the cutoff for P type.

           In the N branch there are 11 outliers;  discussed below.

           The percent of outliers expected in the male population is lower than implied by these results because my SNP data over represents the STR cutoff regions.  Such samples have been prioritized for SNP evaluation in order to better establish the limits of the types.  In the Polish Project, all samples at or just beyond the cutoffs have been SNP tested.

           In addition, all outliers so far are “just beyond” P or N types.  Almost all of these could have been predicted into the correct type based on STR values alone, because so far almost all other “just beyond” M458- samples fit well to other known types outside the M458 haplogroup.  Those 2 P type outliers with SNP data could have been predicted based on STR data,

with 100% probability (but only >50% statistical confidence due to the small sample size).  All but 3 of the N type outliers could have been similarly predicted.

           In my discussion topic, I mention a few caveats, including an explanation of why I use the word “branch” not “type” for the outliers, with quantitative explanation of what I mean by “just beyond”.

           Recommendations for R1a men not yet tested for M458 / L260:  If you are a member of the Polish Project with an N Borderline assignment you should purchase the M458 test to determine your haplogroup.  If you have a P Borderline assignment you should purchase the L260 test.  My STR rules for the Polish Project are complicated, and those rules may not apply to R1a men outside Poland, where exceptions to my assignment rules are more likely.

           If you are not a member of the Polish Project, with all 67 markers, you can compare your STR values to P type and N type following the Ysearch instructions below.  If you fit with lower step to one of the known types other than N or P you are less likely to need either SNP test because you would likely come out M458- L260-.  If you do not fit well to another type:  If your step (genetic distance) from P type is less than 6 you are very likely P type;  step greater than 9 is very likely not P type.  From steps 6 to 9 you should purchase the L260 test to determine your status.  If your step from N type is less than 7 you are very likely N type;  step greater than 12 is very likely not N type.  From steps 7 to 12 you should purchase the M458 test.

           For samples known to be M458+, the single marker DYS385a=10 provides a very high confidence prediction for P type L260+, as explained below.

           Even if your STR values provide a “very likely” assignment, you do everyone a favor if you test SNPs anyway.  In this case you are unlikely to learn anything more about your DNA, but as more men perform these “wildcat” tests, we all gain more confidence that there are no small clades with unusual STR values waiting to be discovered.  There is a slight chance you might discover that you belong to such a small clade with a “wildcat” STR test.

           See L260 and M458 Test Results for more discussion about the data available.  The end of that topic has brief speculation on the age and structure of the M458 clade.

           See L260 and M458 Test Results; Details for data summaries.

           See L260M458Results.xls for all my SNP data.

           See also L260 and M458 Signatures.

 

Polish Project R1a Assignment News

 

           This topic was updated 5 Jul 2010.

           If you are R1a but not a Polish Project member, the Ysearch instructions topic has my method for matching to my types.  The news in this topic applies to you if you know your assignment.

           If you are P type or N type you would likely come out positive in the SNP test for M458 (M458+).  If you are P type you are likely L260+.  N type is likely L260-.  If you have not already tested you can pay the small fee to test for these SNP tests to confirm that you belong to the corresponding haplogroup.

           If you are assigned to P borderline or to N borderline you would benefit more from the M458 and L260 tests, because that would provide for you a definite assignment within R1a.

           The assignment rules are done with high probability, so if you are unassigned (category U) there is a low probability that you would test positive for M458, with probability that decreases with your step (genetic mutation distance) from P or N.

           If you have less than the standard 67 STR markers it is generally better to purchase the remaining markers.  That way, you are more likely to get an assignment, because the statistics for STRs improves with more markers.  Nevertheless, if you are not many steps from P or N you might consider doing the M458 test even with fewer than 67 markers.

           There is a slight chance that  you might test positive for L260 or M458 even if you do not match P or N.  The haplogroup corresponding to M458 is old enough that there may be small clades with STR markers very different than P or N.  I have not seen one yet, but there is no way to estimate this probability.  I hesitate to recommend the M458 SNP test for men whose samples are distant from both P and N in STR values.  I admit you can just wait to see if anyone with STR values similar to yours matches an SNP, then test for that SNP.  However, we all benefit when some men test for all the new SNPs within an established haplogroup, because that way we find out the size and rough age of the corresponding new haplogroup branches.  FTDNA offers “deep clade” test packages to test for all possible haplogroup branches, but my understanding is that L260 and M458 are not yet included in the R1a deep clade test.  You need to purchase them separately from the advanced markers menu.  No doubt FTDNA will add them soon to the deep clade package.

 

Fall Issues of JoGG

 

           The Fall 2009 issue of the Journal of Genetic Genealogy has my publication is split into two parts:

           Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.

           Part II is the application of that method to Common Polish Clades.  That article has a lot more detail than this web page, but that article was last updated in September 2009, so this web page is an update.

           PolishCladesUpdate is my folder for future updates of the Excel analysis files for those two articles.

           This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.

           The Fall 2010 issue has my publication announcing the L260 SNP.

 

R1a Worldwide

 

           Wikipedia  has a nice R1a entry with primary contribution by Andrew Lancaster.

 

R1a New SNPs

 

           11 Jan 2011 update:  There is a lot of activity these days in the discovery of new SNPs for dividing R1a into branch haplogroups.  You can follow the activity at the R page of the ISOGG Y-DNA tree, and also at the FTDNA Draft tree.

           The new SNP named L365 includes what I have been calling G type, based on preliminary data.  It is too early to say if other samples in addition to G type are positive for this new SNP.

           The new SNP named M417 excludes what I have been calling C type, based on preliminary data.  So far very few R1a samples are negative for this new SNP, but it is too early to estimate the rarity of M417-.

           In early 2011 FTDNA released some new SNPs for commercial testing, including the following for R1a:  L365, M417, L366, L291, and others.  To order new SNP tests, go to your home page at FTDNA, on the left under “My Account” click on “Order Tests & Upgrades”, then click on “Go To Advanced Orders” and check “SNP”.  Use your browser search to find the SNP of interest.  If you wish to publish your results, join one of the projects (click on “Projects”) and the administrator with analyze your data.

           L260 and M458 are discussed below.

           There are other new experimental SNPs discussed on the web.  I’m not trying to list everything here, just the ones that are of interest for discriminating new R1a haplogroup branches.

 

R1a Confusion

 

           25 Oct 2010 update:  The new SNPs cause confusion in the alphanumeric notation for the haplogroups and paragroups.

           In my fall 2009 publication I used the notation that was well known at the time, where more than 95% of R1a was known to be paragroup R1a1.  The R1a1 samples with one of four very rare SNPs that have been known for a few years were called haplogroups R1a1a through R1a1d.  Ysearch still (25 Oct) uses the notation described in this paragraph.  FTDNA Projects still use this notation for automatic assignment of samples.  Individual samples are not actually assigned to a paragroup because most have not been tested for all SNPs.  Most R1a samples are listed as R1a1.  Many samples are listed as just R1a but almost all of those would come out R1a1 if tested for the appropriate SNP (the well known M17 or M198, or one of the new ones that all seem to be equivalent).  I mentioned in my publication that all Polish Project R1a were coming out R1a1.  Since then only one sample (out of 1441 R1a total in the Polish Project) has come out M198-.

           New SNPs were discovered equivalent to SRY10831.2, the original R1a SNP.  Subsequently, rare samples were found positive for some of these new SNPs but negative for SRY10831.2.  I’ll use L62 to represent these;  there are others that seem to be equivalent.  Those define two small paragroups, R1a(L62, SRY10831.2-) and R1a1(SRY10831.2, M198-).  That previous R1a1 paragroup becomes R1a1a(M198).  Accordingly, when Underhill announced the M458 SNP, he called that haplogroup R1a1a7.  L260 was called R1a1a7b when first discovered.  Last spring I rewrote this entire web page using the notation described in this paragraph.

           The recent new SNPs change the notation again.  I shall not attempt to rewrite this entire web page.  As I update topics, I’ll use the current notation.  For clarity, I’ll add the defining SNP in parenthesis when I do updates.

           For example, what I have been calling P type is equivalent to the haplogroup now called R1a1a1g2(L260).  What I have been calling N type is equivalent to the paragroup R1a1a1g(M458, L260-).

           The choice of which SNP to put in parenthesis is arbitrary for haplogroup notation.  For example, R1a1a1(M17), R1a1a1(M198), and a few others, all seem to be equivalent.  But any day now someone might announce a few samples that test negative for one of those SNPs and positive for all the others, which would define a new paragroup and force the renaming of all branches beyond that new node in the tree.

           There is ambiguity in assignment of samples.  For example, a sample that tests negative for M198 might be called R1a(M198-), but it is not clear if this sample belongs to the paragroup R1a(L62) or to the paragroup R1a1(SRY10831.2) if it has not been tested for the latter.

           My types have an uncertainty similar to SNPs.  For example, I said N type is equivalent to R1a1a1g(M458, L260-).  Recently two samples showed up in the Polish Project that are M458, L260- but just beyond N type as defined by STR fit.  We can think of these two as a new “paratype”, although I’ll not use that word.  We classify these two in the Polish Project as “M458+R”, the Remainder in M458 excluding N type and P type.  Actually, as I discuss in the N type topic, it is not statistically certain where to place the cutoff for N type, so you could argue that the M458+R category has more than two samples in the Polish Project.

 

L260 and M458

 

           24 Dec 2010 update.

           L260 is a new SNP.  I published it in the Fall 2010 issue of JOGG.  It has been available as an SNP test since early April 2010 at FTDNA. 

           L458 is a new SNP, published by Underhill.  It has been available as an SNP test since early November 2009 at FTDNA.

           FTDNA has not yet assigned haplogroup names to these, so men who test positive are not reported on-line yet at FTDNA nor at Ysearch, nor at the projects supported by FTDNA, which include the Polish Project.

           Both L260 and M458 are listed at ISOGG and at the FTDNA draft tree, where M458 is called R1a1a1g and L260 is called R1a1a1g2.

 

R1a Subdivision

 

           See R1a Confusion 25 Oct 2010 update.

           9 June comment:  This web page need update because a new node has been added to the tree, changing the codes slightly.

           22 June 2010 update:

           Almost all of R1a divides into R1a1a1* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260).  These correspond to my original predicted division.

           R1a also has several known rare groups:  R1a*, R1a1*, R1a1aN, where N = 1 to 6 and 8.  There is also a very rare R1a1a7a.  That asterisk is used for paragroups;  R1a1a*, means haplogroup R1a1a without any of those 8 known branches.

           The rare R1a groups are not in my R1a Table.  It’s a shame the corresponding STRs are generally not published in SNP announcements.  I don’t know if the rare groups all together add up to 0.1% or 1% of R1a.  Surely they are less than 3%.  My percentage calculations in my R1a Table do not need adjustment because any Ysearch samples that might belong to these rare clades would probably have unusual STR values, not falling into one of my types, but still be counted in the totals. In my R1a Table, rare samples are included in row R.  That row R might have a few percent from these rare groups, but I don’t know exactly how many.

           Underhill mentions 7 samples (men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.

           Lawrence Mayka, the administrator of the Polish Project, had been assuring me by email that all the Polish Project member tests within R1a had been coming out negative for all the rare SNP subgroups.  So if you are a Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the men from Poland.  About half of these - about 1/4 of men from Poland - are R1a1a7.  These two “about” estimates are approximate;  my data on these SNPs are not random samples, so my population estimates are derived from the types in my table, which are STR based.

           On 17 June Mayka informed me of the first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish Project.  My table, does not show this single exception because the table is for samples with 67 markers, which that one exception does not have.  On 19 June Mayka informed me of evidence that C type might define a new rare subdivision of R1a slightly older than R1a1a;  if this turns out correct it will be less than 1% of R1a.

 

Underhill

 

           An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.

                       Abstract                     STR Data                   See www.gwozdz.org/R1a.html for more discussion

           I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.

           This web page about Polish Clades was completely rewritten using this new information.  Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.

           Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).  See R1a Subdivision for a brief summary of other groups, and for a clarification of what R1a1a* means.

           R1a1a7 is the new M458 haplogroup.  R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.

           R1a1a* is a new paragroup.  This is M458 negative.  It includes all my other types, particularly K type.

           This Underhill article has data for 158 “Poland” samples (Table 2):

                       R1a1a*:           71 samples      44.9%

                       R1a1a7:           87 samples      55.1%

           The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.

           Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).

           M458 Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.

 

Format

 

           Up to here, I have tried to write this web page as news and summary, with links to more discussion below.  I hope anyone having minimal familiarity with genetic genealogy jargon has understood.  If you read this top to bottom, it gets progressively more detailed, with more and more jargon.  I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions.  If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

           If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.

 

Polish Project Assignments

 

           This topic was updated 29 Dec 2010.

           Lawrence Mayka is the administrator of the Polish Project.  Click on the Polish Project web link to see how Larry assigns samples (men) to categories.  The Polish Project has sections for mtDNA and for Y-DNA.  This web document of mine is restricted to Y-DNA, with emphasis on R1a.  I help Larry with assignments to types.

           Haplogroups are defined by SNP mutations.  STR mutations are easier to test, so many samples have STR data without  SNP data.  Predicted assignments are based on STR correlations.

           I mentioned above that FTDNA automatic haplogroup predictions (red text means STR predicted vs green text SNP measured) have about 99% probability.  We use minimum 80% estimated probability for each individual sample in the Polish Project that gets an extended assignment - a subdivision of its FTDNA assignment.  At 80%, many more assignments are possible.  Most extended assignments are better than 80% probability.  Many are better than 95%.

           Many samples do not have extended assignments, but they still have their FTDNA green measured haplogroup (100% probability) or their FTDNA red predicted haplogroup (99% probability).  These bring up the average for the Polish Project as a whole.

           We are confident that the average is better than 95%, which is to say that more than 95% of the Polish Project samples would test positive for the SNP corresponding to their assigned haplogroup.  Excluding R1a the average is likely more than 97%.

           Example:  E1b1b2a2 (V13) is an example of a haplogroup category with some extended assignments:  Larry has me in this category, which is 100% probable because I tested positive for the V13 SNP along with 14 other men in the Polish Project (data in this example is from 25 May 2010).  However, Larry’s listing includes 48 men in this category, based on his analysis of STR correlations:

           15 green E1b1b1a2.  These are of course certain.

           28 red E1b1b1 because FTDNA does not predict beyond that, but these would likely be E1b1b1a2 if tested, because they have STR values close to those samples that have tested V13+, and unlike the samples that have tested positive for other branches of E1b1b1.  Each has at least 80% probability, and many are even more probably correct.

           2 green E1b1b1 tested for that previous SNP but not for the current V13, but matching in STR values.

           3 green E1b1b1a tested for that previous SNP but not for the current V13, but matching in STR values.

           Note that other E1b1b1 men, both green and red, fall into other categories at the Polish Project, because they do not match V13+ samples closely in STR values.

           End of E1b1b2a2 example.

           R1a is unique because almost half the Polish Project samples are placed together by FTDNA into R1a1 (M198), which is elsewhere called R1a1a (M198).  Many of our R1a assignments are to types, which are hypothetical, without known SNP definitions.  The minimum 80% estimated probability still applies to each sample and again most are much better than 80%.  For type definitions we are confident that the average is about 90%, which is to say that about 90% of the Polish Project R1a samples assigned to a type would test positive someday for an SNP, unique to that type, not yet discovered.

           “Cluster” and “Borderline” and “Unassigned” category probabilities are discussed below.

           I have been active helping Larry with R1a assignments to types since late 2007.

 

Explanation of the Results Table

 

           Note 20 Nov 2011:  I just updated the Results Table with recent data.  Next, I need to rewrite this explanation and update ResultsTable.xls;  coming soon.

           Edit 23 Sep 2011.  Complete rewrite 28 Jul 2011:

           The Results Table is based on data from the Polish Project, at 67 markers.  The data was downloaded on 18 Jul 2011, at which time there were 1743 Y-DNA samples (men), including 951 with 67 or more markers.  Data was edited for family sets, 55 samples, as explained in my publication.  Total at 67 markers after this edit was 918 samples, which is the total in the Results Table.

           Polish Project Assignments at 67 Markers are taken as representative of Poland, with caveats explained in my publication.

           I did the editing and tabulation in an Excel file, which is available:  ResultsTable.xls

           Column Haplogroup has the conventional main branch haplogroup codes.

           Column Haplogroup Category has labels determined by Mayka.  Most of these are branch haplogroup (or paragroup) codes as defined by ISOGG, with the defining SNP in parenthesis.  Some of these are types as defined by me.  A few of these are clusters.  A few of these categories are for borderline samples, or for unassigned samples as explained in the corresponding sections of this web page.

           Column Short Code is my own code for use in this web page.  Some of these have links for jump to a description of that particular clade.  Some have a Ysearch link in the far right column for the modal haplotype.  Many do not have links because I have not found the time to work on them;  my priority is clades that seem to be concentrated in Poland.

           The Num and % columns are the number of samples for each category, and percent of the total.  The number of samples mentioned in those detailed descriptions (below) may not correspond to the numbers in the table because the particular description updates may have been done at a different times than the table update.  The description section has descriptions of some experimental subtypes that are not listed in the Results Table.

           ISOGG names change often due to new SNP discoveries.  See R1a Confusion for examples.

           Those types and subtypes and clusters are my own code letters, for brevity.  Please do not confuse these code letters with official haplogroups.  I have been using such code letters for R1a assignments in the Polish Project since 2007.

           The data is based on the 18 Jul 2011 download, but the table has been edited since then with a few new clusters.

           My Update Folder has an Excel analysis file for most types, plus many more files.

           The Ysearch links provide the full modal haplotypes, using a selected subset of the standard FTDNA set of 67 markers.  I entered these data into Ysearch for our convenience.  All my modal haplotype definitions are available in the Excel file Haplotypes.xls, which also has experimental types not mentioned here.  Below are Ysearch instructions for quickly comparing your haplotype to many  of my types at once.

           Assignment to types is with at least 80% estimated probability.

           Column % provides a good estimate of the frequencies in Historical Poland, insofar as the Polish Project is representative of Historical Poland, as discussed in my publication.

           Unassigned:  The Polish Project has many unassigned samples, but most of them have fewer than 67 STR markers.  The Results Table is based on 67 marker data.  At 67 markers, many samples in “unassigned” categories obviously belong to the corresponding paragroup, indicated by * in the Short Name column.  The exception is R1b, where I lumped the unassigned samples into the known categories, in proportion to the number of assignments in each category;  you can examine my method in that Excel file by checking the formulae in the totals for each category.

           The largest group in Poland is the paragroup R1a1a1* defined by M417 with no subsequent SNP.  I split these into 3 categories (way back in 2007 - still useful):  K type, K Borderline, and Reminder (K, KB, and R).  K type is further subdivided into subtypes (hypothetical haplogroup branches of the hypothetical K branch of M417) including K* for the samples that do not fall into subtypes.  R is the category for samples isolated in STR values, apparently due to a large number of small old branches with very divergent STR values.  The KB samples are between K and R in STR fit, so of course those cannot be assigned with confidence to K or R;  in my Excel sheet I split the KB total - half to a larger “K cluster” and half to R - for statistical summaries;  that’s why the K cluster has 130 samples in the Summary.

 

Description of the R1a Categories

 

           This large topic has descriptions for the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters.  Types and clusters are high confidence hypothetical haplogroups.  Borderline categories are lower confidence.  There is also one Unassigned category for uncertain samples.

           Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.

           Please don’t get confused.  The following capital letter names are my codes for R1a categories.  Capital letters are also used for the large official haplogroups, but that’s different.

           Some of the following types have my Excel analysis file for my November 2009 publication;  the files are stored in the Supplementary folder.  Many of the following types have my update Excel analysis at PolishCladesUpdate.

 

           A.  Ashkenazi.  This topic needs rewrite, because A type is now known to be a subtype of Z93 (L342.2).

           This seems to be a subtype of K.  This type is discussed in my publication, Part II.  I have about 90% confidence in that subtype status, but I am more than 98% certain that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web.  It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi.  This type is not restricted to Poland.  Levy-Coffman wrote an article about Ashkenazi genetic genealogy;  I noticed discussion in a recent Science article.

 

           B.  Another subtype of K, recently identified by Mayka.  Concentrated in Poland.  The B data cluster lies at the edge of the K cluster.  The node for B type in the R1a tree might be slightly younger or slightly older than the K definition node.  I estimate the former is about 80% probability - that B is truly a subtype of K;  if not then B probably lies just outside of K (node slightly older).  Individual assignments to B type have 80% to 90% probability.

 

           C.  Added to Polish Project in Dec 2009 by Mayka, who points out that Didier Vernade originally pointed out the unusual DYS392=13 value in 2007.  DYS392=11 is almost universal in R1a1a.  C type is very small.  There are only 2 Polish Project samples in C type, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4 different samples with 67 markers.  I calculated SBP = 7% using only 37 markers with Ysearch data.  None on Ysearch are identified as “Poland”.  C type differs very much in STR values from the rest of R1a1.  That is evidence for an old node for C type in the R1a tree.

           25 Oct 2010 update:  The C type samples are coming out negative for a new SNP called M417.  Other R1a samples are coming out positive so far, so the prediction that C type has an old node in the R1a tree is being verified.  Of course, it is too early to say how rare M417- samples are;  it is possible more will turn up that do not belong to C type.

           M417 is one of a few new SNPs that look like they will receive the notation R1a1a1x, where x = i, j, k, etc.

           I’ll update this topic when M417 becomes available for purchase.

 

           D.  Update 12 Nov 2011:         Based on 1 Nov 2011 Polish Project data.  Analysis file:  DType.xls.  59 marker definition, cutoff = 9, no samples in the gap at 9 to 11;  SBP = 5.3%. 

           Concentrated in Poland:  Ysearch K49NZ;  34% have origin Poland.

           This type was added to the Polish Project in Jan 2010.  The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on DYS462=12.

           Signature (460,481,462,560) = (10,<22,12,18).  Any one of these four markers by itself can distinguish D type with high probability from other R1a1a1i (Z280) samples, but those values can be found individually as independent mutations in other R1a clades.  D type cannot be distinguished using the 25 FTDNA standard markers.  At 37 markers, only 460 is available.

           At 67 markers, 481<22  is an effective signature:  16 total D type:  13 D have 481=21, and only one other R1a sample has the 21 value. 2 D have <21, with no other R1a samples.  One D has the 22 value along with several other R1a.  481=25 is modal for R1a.

           DYS462 is a standard STR marker at Sorenson, and has been  available for years at Ysearch;  462 is now available at FTDNA with the 111 marker set.  In Nov 2011 I noticed that DYS560=18 is another marker for D type from the 111 set, but that is not available at Ysearch (Nov 2011).

           That DType.xls analysis file provides SBP = 5.3%, although I did manual editing of the definition to improve SBP, providing some selection bias.  On the other hand, isolation of D type is even better than indicated by SBP for two reasons:  Samples just beyond D type, steps 12 and 13, all have solid assignments to other types.  Most of the D samples have 462=12 and a few have 560=18, and those samples beyond step 11 with data have other values at those 2 markers, so a future definition using all 111 markers should provide even better (lower) SBP.  Only 3 D type have 111 markers;  most of the DYS462 data was obtained some time ago by purchasing that marker separately.

           D type seems to be Z280+ Z92-, based on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file).  Z92 is a new SNP, so not much data is available;  confirmation should be available soon.  D is a subtype of what I had been calling K type;  I’m now using K as a code for the paragroup defined by Z280*.

           D type is clearly a Polish type:  In the Polish Project 10 of the 16 D type at 67 markers indicate “Poland” ancestry;  the exceptions are 2 “Unknown” (one with an obvious Polish name and one with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.

           On Ysearch, there are 32 samples below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which is quite high for Ysearch.  SBP is 15% on Ysearch, implying there are clades near the cutoff that are rare in Poland;  indeed none of the 5 samples in the gap at steps 9 and 10 indicate Poland.  For details see the “Ysearch” sheet in DType.xls.

           Age (ASD sheet cell N12) comes out 1,385 years using all 67 markers.  Old human Y-DNA clades have age older than the raw ASD calculation because of population bottlenecks and because of other statistical adjustments.  However, D type is not very old, so this correction may not be needed.  On the far right of that ASD sheet I sorted markers by age, and I added notes about problem values, and suggested four markers that should be masked out, but the age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216 years.  I see evidence of subclades, so D type might be composed of younger subclades that might be identified with more data.

           I noted three markers (on the far right of the ASD sheet) that I consider hints for subclades.  Last year in this topic I mentioned Da, with the signature (458,576,444)= (16,20,14) and that still looks promising, but not convincing.  One of the three D samples with 111 markers fits Da, and provides a hint that markers 463 and 715 from the 111 extension might help to resolve Da, so it will be interesting to see what happens as more D men order the 111 extension.

 

           E.  V. Rudich entered a modal for this cluster into Ysearch as ID mW7DP, named “North Eurasian”.  Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”.  It’s an excellent type;  on 25 May it has 16 samples at 67 markers in the Polish Project, with SBP = 14%.

 

           FH Clade.  F and H types were suggested by Mayka.  They have the signature (439,511,452 = 11,11,28).  They differ from each other, so I could not make a combined FH type.  I can make a reasonable FH cluster, but it is not necessary, since the FH clade can be better defined as the combination of the three types Fa, Fb, and H.  The original F type (introduced Jun 2010) was split into Fa and Fb in Dec 2010.  DYS452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated.  Mayka and I helped most of the Polish Project members in FH, and members just beyond FH, to get 452 evaluated.  Samples beyond FH have 452=30.  My analysis files do not use 452 for determination of SBP.  452 would not significantly lower SBP because most of the background near the cutoff for each type are samples from the other two.  In other words, Fa, Fb, and H are very well isolated from the rest of R1a, but not so well isolated from each other.  These three FH types do not seem to be specifically concentrated in Poland (per Ysearch) although they are concentrated in Slavic countries including Poland.  All three types seem quite young, without relatively low STR variance (see the ASD sheets in the analysis files).

 

           FH Borderline.  The borderline samples from Fa, Fb, and H are combined into a single FH Borderline category in the Polish Project, because these clearly belong to the FH clade but have less than 80% probability of belonging to any one of the 3 types.

 

           Fa.  Ysearch YQ6D2.  66 markers, cutoff, 9 gap 2.  SBP = 27%.  See FH clade, above.

 

           Fb.  Ysearch EFQM7.  56 markers, cutoff, 5 gap 4.  SBP = 23%.  These samples were the original F type, before Fa was split off.  See FH clade, above.

 

           H.  Ysearch 559EE.  58 markers, cutoff, 7 gap 3.  SBP = 14.5%.  See FH clade, above.

 

           G.  This type was suggested to me by Mayka, who calls it the Pomeranian cluster.  Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland.  Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians.  Kashubians consider themselves an ethnic group or nationality within Poland.  It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians.  Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.

           G type is mentioned only briefly in my publication because not much data was available to me at that time.  My GType.xls update analysis file with June 2010 data has excellent results:  There are 12 samples in a nice type with SBP = 11.2%.  There is preliminary evidence of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in Ysearch;  see Haplotypes.xls for a list including hypothetical working modals.

           11 Jan 2011 news:  Mayka informs me that one of the new SNPs, L365, is positive for all of 5 G type samples that were tested so far.  A few samples from other types all tested negative for L365.  It seems like G type is included in the new haplogroup defined by L365.  One of those 5 is in that tentative Ga subtype.

           Of course, this is very preliminary.  It is possible, if unlikely, that some of the G type samples still might turn out negative for L365.  It is quite possible other samples not matching G type might be found L365 positive.  I’ll provide updates here.

           Those 5 samples are positive for M417, negative for M458, and negative for a few other new SNPs. 

           L365 is one of a few new SNPs that look like they will receive the notation R1a1a1x, where x = i, j, k, etc.

           This type should not be confused with another G type in the N haplogroup.

           14 May 2011 comment:  Sorry I have not taken the time to update this G type topic.  Recent data continues to verify that G type seems the same as the haplogroup divided by L365, now called R1a1a1i.

 

           I.  Minor edits 5 Aug 2011.  Complete rewrite 4 Aug 2011.  Based on 2 Aug 2011 Polish Project data.  Three analysis files:  IType.xls;    IaType.xls;    IbType.xls.

           I type is discussed in my publication, Part II, page 178.

           On Ysearch, I type is concentrated in Poland and in other Eastern European countries.

           On 28 Jun 2011 Lukasz Lapinski suggested two small clusters based on recent I Borderline samples.  These are currently called Ia and Ib types in the Polish Project.  Ia and Ib are probably not really subtypes of I, as discussed in the following paragraphs.

           I type seems to have structure.  Some of the 67 STR markers are bimodal, which hints at subtypes.  The bimodal markers are not correlated with each other, so I have not been able to identify subtypes with confidence.

           My published 2009 definition for I type, I59, uses 59 of the 67 STR markers, cutoff 8.  That definition still works quite well, with SBP 17.8% (Aug 2011).  I consider SBP <20% sufficient to use the term type.  I found a better definition, I62, cutoff 9, SBP 12.3%.  The two definitions are compared in the file IType.xls.  That 2009 definition had 22.4% SBP in 2009, so it did not quite qualify as a type back then.  (Background means foreign samples with matching STRs that do not belong to the hypothetical I type clade;  SBP is a high confidence statistical limit estimate.)  Six of the 24 using that old definition are excluded by the new definition;  if the latter is exactly valid that means background was actually 25%, which is close.  The new SBP with the old definition is 17.8%, which is lower than 25%, but I’m comfortable with this because most of my published SBP’s have been shown to be larger than subsequent new data, as intended.  The new definition also captures two samples that were previously borderline, one of which was classified I type anyway because that sample has close matches in I type.  The new definition captures an A type sample;  that sample is a good fit to A type;  this false call is not incompatible with the 12.3% SBP which predicts less than 3 samples background (12.8% of 20).  More about A type in a paragraph below.

           The new I type definition lacks breadth - changing the number of markers increases SBP.  This is displayed in Itype.xls as columns for different marker sets.  For such analysis, the database needs to be restricted to the samples with step not too far beyond the cutoff.  For I type the ranking of markers is sensitive to exactly where the database is truncated, so the automatic definition comes out differently for different truncation of the database.  For the database in the Calculator sheet I truncated the database by removing samples at step > 13, except I left in two samples at steps 14 and 15 that had been classified Ib and IB (discussed below).  The definition for I type is also sensitive to exactly which markers are assumed for the first iteration as the type.  The TypeRank sheet in IType.xls uses the 19 I type samples, excluding only that one that A type.  I tried quite a few other database truncations, and various assumed sets;  those yielded different definitions with higher SBP.  My published SBP formula is defined in a way that provides a larger number to compensate in part for such selection bias.

           On the other hand, for the dozen or so samples that fit I type best, step < 7, the database and the number of markers do not matter;  the same dozen or so samples are captured as I type for any reasonable definition using a wide breadth of markers.  We can be confident that there is a valid clade corresponding to those dozen best I type samples that will some day be captured as a haplogroup by a new SNP.  Beyond those best dozen samples, steps 7 to 8, there are another dozen or so samples that seem to be I type but at lower confidence;  the background might be significantly more than the best fit SBP.  In my publication I explain why background increases very rapidly with step.  I suppose the actual percent of background might vary from maybe about 1%at step 2, to maybe about 40% at step 8.

           What does this mean?  The simplest explanation:  There was a “father” haplogroup thousands of years ago.  Due to population bottlenecks, only a small number of the males from that father haplogroup are MRCA’s (ancestors of clades that exist today).  The descendants of the I type MRCA participated in a significant population expansion.  I type is the only large clade from that haplospace neighborhood showing up today in the Polish Project.  Other smaller “brother” clades show up, and because there are many more haplotypes at larger step values, those brothers are randomly distributed at large steps in my I type analysis.  This is a simple explanation;  more complex explanations are possible - for example involving migration of tribes from distant lands.

           IB are Borderline, at step just beyond the cutoff for I type, not fitting any other known type, with only about 50% confidence that they will someday end up in a haplogroup corresponding to I type.  Samples are also assigned to I Borderline when the nearest matches at 67 markers are I type.  There are two samples at step 10 (new definition) now changed from I type (old definition) to IB using the new definition.  There are 4 more prior IB samples at steps 12 to 15 now changed to K and KB.  The next update of the Results Table will show slightly smaller totals in I and IB.

           As 67 marker data accumulates in the near future, it is likely a slightly better definition may turn up with even lower SBP, and I type may separate into subtypes with <20% SBP.  The 111 marker data is promising (discussed in a following paragraph).

           A clade that is very well isolated (<5% SBP) has a high chance of soon being defined by a newly discovered SNP haplogroup.  For I type with 12.3% SBP, a new SNP might be older, including some small older clades, or a new SNP might be younger, leaving out some marginal I type small clades.  For example, I recently discovered a new SNP in my own Y-DNA that is slightly older than my predicted type - see L540.

           My maternal Iwanowicz grandfather was I type.  This explains my extra effort analyzing I type.  The two Iwanowicz samples are my maternal first cousin and a man that I found in Poland who seems to be my 4th or 5th cousin.  Technically, one of those should be removed for slightly higher SBP because I recruited that data, but the bias for 20 samples is small (SBP becomes 13.0%).

           One of the Iwanowicz samples was removed for the Results Table, along with editing of family sets in other categories.

           SBP for Ia and Ib are 11.9 % and 17.0%.  The definitions have breadth.  These are good results, providing better than 80% confidence of validity for each.  However, these all fall outside I type with my new definition.  Even with my old definition, only 4 of these were I type at high step, the rest were IB.  Using an I code was a bit arbitrary.  Now is not a good time to change their code names, because quite a few new SNPs will soon be available.  With more SNP data small types such as these can soon be renamed with more confidence.

           Back in 2009, and still today, A type overlaps with I type at the margin.  So does the newer D type.  However, A type is coming out positive for the new haplogroup based on the L342 mutation, which seems to be rare in Poland.  Mayka informs me that a WTY for one I type sample has come up L342-, as have two D type samples.  In the past, I have always speculated that A type and I type are both subtypes of a larger K type.  It now seems A type is really in a distantly related branch (L342) of the Y-DNA tree with similar STR values by coincidence.  My prediction that I type is a subtype of K type is still a low confidence speculation.

           The best ranked marker for I type is DYS578=9.  DYS578 has the second slowest mutation rate of the 67 standard markers per the Chandler rates.  The ancestral value is 8.  The 9’s are colored orange in that analysis file IType.xls.  From the 450 Polish Project samples at 67 markers, only 6 samples outside I type have the 9 value, one sample has a 7, the remainder are all 8, consistent with very few independent mutations.  In the analysis file, notice that all the predicted I type samples have the 9 value with one exception, that A type (discussed above) at the last step of I has the ancestral 578=8.  There are two A type with 578=9 at steps 11 and 12;  the former has been tested L342+ (coded SNP results are in column BX of the file).  All the other A’s have 578=8, so the obvious interpretation is an independent mutation to 9 within the A type clade.  The only other 9 in that analysis file is an IB sample at step 12;  that one might be another independent mutation;  on the other hand, perhaps the mutation to 9 is much older than the TMRCA for I type, with that one sample representing a very small clade with an older node.  The Ia and Ib samples all have the ancestral value 8;  that’s evidence that Ia and Ib have old nodes with I - older than the 8 to 9 mutation.

           The second best marker is DYS458=14, again orange in the file.  This is a rapid mutator, so there is more variance.  All but 2 of the I type samples with 578=9 have this 14 value.  This is evidence of youth for I type.  Those two, at 15 and 16, are probably independent mutations, although we cannot rule out the speculation that the 15 is the ancestral value telling us that the 458 mutation to 24 came after the 578 mutation.

           Only 8 I type samples have 111 STR marker data and 2 of those are my Iwanowicz samples, so analysis at 111 is premature.  That said, all but 1 of the 8 have DYS532=12;  that one exception has 11.  Value 11 also shows up for the one Ia sample, and for the two IB samples at 111 markers.  DYS532 seems slow, but there are quite a few 11’s and 12’s in the 71 R1a samples at 111, so 532 will not displace 578 as the best marker for I type.  Lapinski pointed out to me that a couple other markers also show promise at 111 markers for I type.

           [Note inserted on 14 Sep 2011:  There are now 9 I type samples and 7 of them have the signature (532,,504) = (12,14).  All other R1a samples have the modal (532,,504) = (11,>14).  This is evidence that the I type node with R1a tree is not much older than the M458 mutation.  DYS532 and DYS504 are two of the new 44 markers in the extension from 67 to 111 markers.  I'll call this pair of values the signature for a hypothetical IPN clade.  This is not strong evidence, because there is a small chance those 2 mutations happened twice independently - in the M458 clade and in the I type clade.  The two exception samples were previously classified Ia and IB, so they might be from branches older than the signature mutations.  I need to update my analysis to include these 2 markers, and update this I type topic.  I’ll be busy with other things for a few months, so I added this note.]

           I modified the Ysearch I type definition, EKVHX for the new I62.  I type has no samples at the step 9 cutoff in the Polish Project;  on Ysearch there is only one Russian sample at step 9 (plus a couple modals), so I type is also well isolated on Ysearch, not just in Poland.

           All 67 markers can be used for estimating the age of I type, because there are no significant recLOH problems with the compound markers in the I type data.  Age comes out 1,208 years.  See the ASD sheet in IType.xls.  Raw ASD age is usually adjusted older due to population bottlenecks, as explained in my publication, but the adjustment should be small for I type because it is not very old and because I type obviously went through a population expansion.  ASD age is highly uncertain due to caveats.

 

End of 5 Aug 2011 rewrite of I Type.  Reminder:  most of this web page has not been updated for quite a few months.

 

           J. This type was suggested by Mayka.  Only 6 members in the Polish Project, but this type is well isolated at SBP= 13%.

 

           K.  News 10 Nov 2011:  It’s looking like what I have been calling K type will end up equivalent to the paragroup Z280*.

           In early October, ISOGG officially recognized Z280 as R1a1a1i.

           I’m waiting results from the new SNP Z92, particularly for I type samples, so that I can construct a new definition for the K paragroup without Z92.

           In the past I considered A type a subtype of K, but A type turned out R1a1a1h (Z280- Z93+).  This change has no effect on assignment of individual samples to A type, which has very low SBP and hence very high confidence of being a valid clade.

           The following paragraphs in this K topic were written more than a year ago.  My next priority is to rewrite the following with more detail:

           This seems to be a main R1a1a type.  K type is discussed at length in my publication, Part II.  It is larger than others in the Slavic lands.  P and N (below) are just as close in STR values to K as they are to each other, probably because the K modal haplotype is the same as the R1a1 modal haplotype (using the best 34 markers for K).  So far I have discerned a few subtypes of K in my List of R1a types, but I do not have high confidence that they are all exact subtypes of K, as explained in my K Borderline discussion.  I suppose that as data accumulates more subtypes will become clear within K and K Borderline.

           In the Results I use K* to signify those samples that match type K but do not match one of the subtypes.  Although I have high overall confidence in the validity of K type, individual assignments to K* are not as confident.  Because K is located at the modal heart of R1a, I expect some outlier samples from distantly related clades to match K* fairly closely just due to the statistics of random STR mutations.  Because of the possibility of foreign outliers, I consider samples at K step 3 to be K Borderline, even though the cutoff for the K definition is 4.  Even K* samples with step <3 have confidence of only 80 to 90%.  That’s in Poland, where K is fairly well defined with SNP = 26%.  Worldwide K* cannot be discerned with confidence.  The Ysearch SNP for K is 71%, not significant.  That means there are K borderline clades close to the K cutoff that are rare in Poland but causing interference on Ysearch.  This is evident by a glance at the K type results on Ysearch, where “Poland” origin is concentrated at steps <3, and “Poland” becomes progressively less common at higher steps.

           The Kurgans are the ones who domesticated the horse more than 6,000 years ago.  Many scientist think that one pre-Kurgan man is the male line ancestor of all R1a1 men who live today.  The Kurgan hypothesis is controversial, and not necessary for this web page.  You may have noticed that I used the letters of “Kurgan” for my original types and categories during 2008.

 

           Kt, Ku, Kx.  New small clusters need documentation here.

           I have been using the subscripts “z”, “y”, “x”, etc backwards through the alphabet because I am running out of letters for new clusters and types.  These small hypothetical clades seem to be subclades of K, although I do not have high confidence about the subclade status.

 

           Kw.  New small haplogroup, equivalent to the new SNP L366.  Need documentation here.

 

           Ky.  Update 7 Oct 2011:         Based on 1 Oct 2011 Polish Project data.  Analysis file:  KyType.xls.  Ysearch BBB9T.

           Ky type was suggested to me by Mayka on 21 Dec 2010.  There were only 3 samples in Ky last year;  now there are 5.

           That KyType.xls file demonstrates that the same 5 samples are extracted using any number of markers from 11 to 67, although at some of those definitions one or two other samples are also extracted.  The full 67 markers work best, SBP=23%.

           Ky was more isolated last year;  a few samples showed up in the gap, reducing SBP.

           I’m using a hand edited definition, Ky63, using 63 markers, for the following reasons:

           Ky is unusual in that 4 of the 5 samples have an unusual value for at least one markers.  I highlighted these values in red in that file.  Notice also the high step values for those four, 8 through 11, using all 67 markers (column BX), although SBP came out 23%, which is an excellent low result for 67 markers.  The obvious (but speculative) interpretation:  each of the 5 samples seems to be a representative of a branch of this hypothetical clade, where each of the 5 branches has a node not much younger than the TMRCA.

           Hand editing like this does introduce some selection bias, so the calculated SBP=13.6% for Ky63 is misleading.  Countering the selection bias, some if not all of those 4 markers that I masked out might represent small tribal sized subclades, so future prediction of new Ky samples should work better using Ky63 with those 4 removed.  T

           he far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out.  You can see that my selection is a bit arbitrary;  I could have masked less than 4, or more than 4.

           ASD age using all 67 markers comes out 917 years, cell N12.  ASD age using the 63 markers not masked out comes out 878 years, cell N29, not much less.  ASD age has a number of caveats, and 4 samples are not significant, so this age is highly uncertain.  Ky seems young, as haplogroups go.

           Ky doesn’t not have a prominent signature.

 

           Kz.  Update 5 Oct 2011:         Based on 1 Oct 2011 Polish Project data.  Analysis file:  KzType.xls.  Ysearch 9QJFQ.

           Kz type was suggested to me by Mayka on 6 Oct 2010.  Mayka speculates this might be a clade of Kazakh origin.  There were only 3 samples in Kz last year;  now there are 6.

           That KzType.xls file demonstrates that the same 6 samples are extracted using any number of markers from 2 to 67, so the definition is not critical for this well isolated type.

           Kz is effectively more isolated than the SBP values (row 12 in that file) indicate, because the samples just beyond Kz are all confidently assigned to other clades and types.  For this reason, those SBP values are moot.

           I’m using a hand edited definition, Kz59, using 59 markers, for the following reasons:

           Kz is unusual in that 5 of the 6 samples have an unusual value for at least 2 markers.  I highlighted these values in red in that file.  Notice also the high step values for those 6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%, which is an excellent low result for 67 markers.  The obvious (but speculative) interpretation:  each of the 6 samples seems to be a representative of a branch of this hypothetical clade, where each of the 6 branches has a node not much younger than the TMRCA.

           Hand editing like this does introduce some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but moot).  Countering the selection bias, many if not most of those 8 markers that I masked out might represent small tribal sized subclades, so future prediction of new Kz samples should work better using Kz59 with those 8 removed.  Again, this is moot, because any number of markers extract the same samples.

           The far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out.  You can see that my selection is a bit arbitrary;  I could have masked less than 8, or more than 8.

           ASD age using all 67 markers comes out 724 years, cell N12.  ASD age using the 59 markers not masked out comes out 704 years, cell N29, not much less.  ASD age has a number of caveats, and 6 samples are not significant, so this age is highly uncertain.  Kz is clearly young, as haplogroups go.

           Additional information supplied to me by Mayka:  Three of the Kz type samples are from non-Polish men who suspect they have Polish male line ancestry, so it is not certain Kz type is Polish.  Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP appears to be private, restricted to his family.  Insofar as that man recruited 3 more Kz samples into the Polish Project, Kz seems proportionally twice as large.  My next edit of the Results Table will reduce the percent size of Kz.

           Kz has the prominent signature DYS459b=18.  Mayka points out the additional signature DYS461=12, not one of the 67 marker set;  most of the samples in Kz have been verified with this 12 value.  Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other hypothetical clades, we do not know if the signature markers define a larger father clade.

 

           L.  This cluster is highly hypothetical.  It is rare in Poland, but second in size to K in European R1a1.  Larry Mayka suggested this cluster to me.  It is a well known Scandinavian cluster.  I quickly checked it briefly, and it seems to be a “type” by my definition.  However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here.  More documentation about L will be available here when I find time to study it.

 

           L260.  See P type.

 

           L342.2.  New topic 30 Oct 2011.  This SNP was recognized as a new haplogroup by ISOGG during the summer of 2011.  This was an L342 haplogroup category at the Polish Project for a short time in the summer and fall of 2011, but it has been replaced by Z93, because it seems all the L342.2+ samples are also Z93+ in the Polish Project.  Apparently there are very few men elsewhere in the world found to be Z93+ L342.2-.

           Z93 is a more reliable SNP than L342.2, so it is recommended that men first test for Z93.  L342.1 is the same mutation as L342.2, discovered earlier in the E haplogroup.  L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests together are more reliable.  These 4 mutations are in the same segment, which is apparently a segment that mutates relatively rapidly.  Z93 is recommended as the better test for R1a samples that do not fit STR definitions of other R1a haplogroups;  the Z93+ samples can do the L342.2 test.  This information about L342.2 was supplied to me by Mayka.

           The Z93 category has the samples that do not fit the two known subdivisions:  A type and L342T cluster (next topic).

 

           L342T.  New topic 30 Oct 2011.  Based on 26 Oct 2011 Polish Project data.  Analysis file:  L542TCluster.xls.  I just noticed this cluster.

           L342T is not a type, because SBP did not come out low enough.  However, I included this cluster discussion here for the following reasons:

           Seven samples at 67 markers fit my new 48 marker definition for L342T.    There are 19 A type samples, which should all be in the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T;  the closest A’s are at step 8, where the cutoff is 6.  There are 5 more L342.2 (Z93) samples at 67 markers, and those 5 also do not fit L342T, falling at steps 11 through 21.  In other words, L342T is well isolated from the other L342.2 (Z93) samples, including the A type branch.  The one background sample (STR values fit the L342T definition) and the four samples beyond the cutoff, are assigned to K type and to subtypes of K;  Z280 has recently become available for K type;  as those background samples get tested in the future for Z280, my L342T cluster will start looking better.  Let me say that another way:  a cluster should be analyzed with data from its own haplogroup, so L342T should be compared only to L342.2 (Z93) data.  But there is very little L342.2 (Z93) data available, so I used the full R1a database in that xls file.  That means L342T is likely more isolated than it seems right now, so it is more likely to correspond to a valid haplogroup.

           Mayka pointed out to me that some of the L342T samples have Tatar ancestors.  That’s why I used the “T” in the code name.  Of course, Tatars may belong to only a branch of L342T;  I have no idea what fraction of L342T in Poland are Tatar.  And of course Tatars are expected to be a mix of multiple haplogroups.

           Three of the L342T samples, with the name Muchla, are apparently a family set, so they count statistically as only one sample, reducing the current count from 6 to 4, so SBP as calculated in that xls file should be increased (not as good).  This is evidence against L342T being valid.

 

           M.  Needs documentation.

 

           N.  I have been rewriting this topic throughout the late summer of 2011.  Finished 25 Sep 2011.

           Based on 5 Aug 2011 Polish Project data.  Analysis file:  NType.xls

           N type is concentrated in Slavic countries.  N type is discussed in my publication, page 179.

           According to Ysearch and Yhrd N type seems to be spread all around the Slavic lands and central Europe, common from East Germany to Russia.  Within Poland N type seems to be about the same size as P type, both about 9% of men.  Worldwide, N is much larger than P.  N type should be properly studied in a database that is not restricted to Poland.  However, there seem to be subtypes of N that are concentrated in Poland.  See the discussions on N subtypes below.  I’ll continue to watch the Polish Project, because it will be interesting if more data provide more Polish subtypes within N.

           During review of my publication in 2009, the SNP called M458 was published.  I added notes about this to my publication on page 184.  The corresponding haplogroup is now called R1a1a1g.  This haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-).  M458+ samples may turn up someday that do not fit either N type or P type, but I have not noticed any yet.

           My current definition for N type, N46, is a modal haplotype using 46 of the 67 standard markers.  The cutoff is 8, which means all samples less than step (genetic distance) 8 from N46 are predicted N type (predicted M458+ L260-).  That definition is available in the NType.xls analysis file, in my Haplotypes.xls file, and at Ysearch as 3SEJK.

           N type age (age means TMRCA) is about 2,000 years.  That’s highly uncertain, but I’m 80% confident that age of 2,000 years is not off by more than a factor of 2 - age 1,000 to 4,000 years.  The M458 mutation is likely much older than the age of N type.

           I’m suspicious that N type includes many younger clades that just happen to have similar STR values, difficult to resolve into clusters or types.  I offer some speculation along these lines in the hypothetical subclade topics below.

           I highly recommend that someone from N type purchase WTY, a commercial product for discovering SNPs.  No sample from N type has been submitted for WTY.  That means there is a good chance that the first N man to submit his sample to WTY will discover one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an SNP that captures about half of N type - or perhaps an SNP that captures a small subclade - or perhaps multiple such SNPs.  My WTY was the first in a long time in my haplogroup, so I found 14 new SNPs.

           It’s interesting to wonder why R1a1a1g seems to be composed of two types that differ substantially in STR values (N and P are separated in haplospace).  I speculate about this in the P type topic.  Much of my P type discussion is also related to N type, so I avoided repeating all the details here;  please read my P type discussion if you are interested in more about N type.

           N seems to be older than P.  I wonder if there are subtypes of N about the same age as P.  I avoid too much speculation in this web page - just enough to indicate my motivation.  I’m wondering if there are clades in various haplogroups, mostly P and N, associated with the origin of the Polish nation - a few centuries more than a millennium ago.

           I have only identified 4 small subclades of N so far:  I am quite confident of Ng type, but less confident of N-Ashk type.  The Nt and Ns clusters are hypothetical;  I have about 70% confidence in them.  These 4 are used for assignments at the Polish Project web page.  I also identified a few clusters with roughly 50%confidence;  these are too speculative for formal assignments.  All are discussed  below.  I made speculative assignments based on all these types and clusters within N type, in column CD of that file NType.xls, Calculator sheet.  My file NClusterAssignments.xls has lots of details.  If you are N type, you can find your row with your kit number, and see your speculative assignment.  For the “clusters”, I estimate a 50-50 chance an assignment will need to be changed in the next year or so, as more data becomes available

           In addition, N type has many bimodal markers, hints at yet more subclades not discussed here.  This is evidence that N type experienced population expansion when it was young (not long after the TMRCA).  More discussion below.

           The paragraphs up to here are a brief summary.  The rest of this topic is a detailed discussion about N type and hypothetical subclades:

           This Sep 2011 analysis includes only data from the Polish Project.  I’ll wait a few months before reviewing data outside the Polish Project.  My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011.  For those last results, see the following topics, which have not been updated for several months:

           L260M458News

           L260M458

           L260M458TestResults

           L260M458TestResultsDetails

           L260M458Signatures.

           L260M458Results.xls

           For the size of N type, please see the table at the top of this page, where N has only 4 more samples than P (87 vs 83 - 5 Aug 2011 data).  In my 2009 publication N had one less than P (28 vs 29, Table 6 page 169).  The 70% confidence interval for 87 samples is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by implication in Poland) within statistical sampling accuracy, at about 9%.

           My 2009 published definition for N type, N45, still works very well.  I did not change that definition at my Jan 2011 update and analysis here in this topic.  This Sep 2011 N46 update is just a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP data that has accumulated over the past year.  Both definitions are compared in that analysis file NType.xls, Calculator sheet, columns BZ to CC.

           Tweaking the definition like this, to better fit SNP data, introduces some selection bias.  I discuss this issue in the P type topic, where I did a similar tweak;  please read that topic if you are interested in the statistical justification.  The justification is not as good for N type, so I’ll return to this issue in the “old branches” paragraph below.

           This new N46 definition fails to capture only one M458+ sample, which falls at the cutoff step 8.  This new N46 definition captures only one foreigner, L540+, at step 7, the last step of the type.  The other samples at step 8 have tested either M458- or L260+, except one that fits D type well, so they are all confirmed as not N type.  Similarly, 7 of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to other types, with only 2 that are Borderline fits to other types.  In other words, the N46 definition captures the M458+ L260- samples with apparent 98% accuracy.  However, my confidence is about 80% for step 7, about 90% for step 6, and 95% or better for step <6.  Again, please see the P type discussion about confidence for a general explanation.  P and N are similar in this regard.  I have related discussion about N type confidence in the “old branches” paragraph below.

           Almost all the samples near the cutoff for the previous N45 definition have been SNP tested.  This high testing rate is not a coincidence;  Mayka and I have been encouraging men with marginal samples to do the M458 and L260 tests.  (We paid if cost was a problem.)

           The NType.xls analysis file has 10 columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as tentative N type definitions, with automatic selection of the best markers.  For each column, I colored the step count violet for samples captured by that definition.  You can see at a glance that any definition using 2 to 67 markers captures more than 80% of the N type (M458+ L260-) samples, and not many foreigners, so just about any definition works surprisingly well.  In other words, N type is very well isolated in haplospace.

           For the two best automatic definitions, I used boldface to highlight the N type samples missed by that definition, and also boldface to highlight the foreign samples captured by that definition.  I used boldface similarly for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the effect of 3 different cutoff choices.

           You might try resorting the sheet by column (select everything from cell A14 to the end) to better compare the results.

           The issue of SBP is moot for N type now that the SNPs M458 and L260 are available, but an analysis is instructive:  That NType.xls file has automatic marker selection of N type, and automatic calculation of SBP, disregarding the SNP data.  The best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%.  However, N46 is a better definition because N61 captures only 80 of the 87 N type plus that same one foreigner.  But still, 8 misses out of 87 is not bad for N61, better than the 13.2% SBP (SBP is a high estimate for statistical confidence).

            I considered calling N46 a definition for M458+ L260-, with a different definition for N type as a slightly smaller subtype, leaving out some samples that do not fit the N type definition with lowest SBP.  I could not come up with a convincing definition for such a smaller subtype.  So at least for now, I am considering N type as the same as M458+ L260-, with the understanding that may change in the future.

           The summary conclusion for all those columns of trial definitions:  My preferred N46 definition (column CC) does the best job of capturing N type (M458+ L260-).  Most of the other columns are trying to define N type as slightly smaller, leaving out a few of the samples (not always the same samples).  Most definitions for N type have many samples at or near the cutoff.  My explanation is in the next paragraph:

           Old branches:  A type is a hypothetical unique clade.  Of course, every clade is composed of subclades - branches in the Y-DNA tree.  Here is a simple explanation for the previous few paragraphs of discussion:  N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type MRCA at a few STR values from the standard 67 set.  Those old branches have many younger branches (twigs) that differ at yet more STRs.  In other words:  the N tree might have a few small branches near the ground.  Those small old clades provide samples in the database with large step, but each sample is from a different twig, so these do not correlate into obvious clusters.  Any clade has statistical outliers with large step;  a few small old branches would provide more outliers for N.

           Those old branches may not be small world wide.  One possibility - a large subclade of N concentrated outside Poland might have one small branch in Poland, corresponding to a man or tribe that moved to Poland long ago.  I am watching for evidence along these lines, but so far this paragraph is speculative.

           In addition, there might be additional large old subclades that seem young.  I consider this possibility in the discussions below.  The age of a clade can be much younger than the node.  I discuss this in another topic, where I call such clades smooth branches.  The N tree might have a number of small smooth trunks with nodes near the ground - that would not necessarily be evident as STR correlations.  On the other hand, the N tree might have only one main trunk, almost smooth, with only few small branches near the ground.  The actual situation might be more complicated, with multiple trunks of various sizes, at various distances from the ground.  I can’t tell yet from the STR data.  Perhaps another year of additional STR data may help.

           Why am I speculating about N type smooth branches?  I see plenty of hints for more branches in the N type data, but little statistical confirmation.  In the discussion below for subclades, I offer evidence (not definitive proof) for many more significant sub clades within N type.

           This discussion is personal.  It is my opinion, based on my statistical analysis.  Someone might send me an email any day now pointing out a convincing cluster or type in N that I missed.  Someone else might disagree with my analysis about particular hypothetical N subclades.

           Reminder:  This discussion is limited to Poland, as represented by the Polish Project.  Outside Poland there is additional probability of M458 branches showing up someday that fit neither N type nor P type.  Outside Poland I expect yet more N type branches.

           Regarding concentration in Poland, I use percent of samples in Ysearch with “Origin” Poland as an objective measure.  This is discussed in my publication, where Table 1 shows P12 (the P type modal haplotype using only the original standard 12 markers from the Polish Project) with 42%, while N12 has only 14%.  Those numbers 42% vs 14% are not calibrated (because of the unknown concentration of men with Poland origin in Ysearch) but those numbers are a relative indication of concentrated in Poland vs not particularly concentrated in Poland.  My file NYsearch.xls has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift due to more data.  That same file has the N46 definition at 24%.  This is evidence that N type, defined using 46 of 67 markers, is only slightly more concentrated in Poland than the 12 marker equivalent.  The simplest explanation:  There are probably large M458 clades outside Poland that match N12 and also match N46 at less than the cutoff, but the Polish samples are only a twigs on those branches, descended from one man or family or tribe that moved to Poland a millennium or so age.  It makes sense that clades within M458 might be regionally concentrated.  That 24% concentration for N46 is of course an average;  there are subclades of N with higher and lower concentration.  I found a few, discussed below;  that file NYsearch.xls has a sheet for each subclade analysis.

           Age:  N type comes out 2,340 years old using all 67 markers.  See cell N12 in the ASD sheet in NType.xls.

           Because of recLOH issues, the compound markers 464, YCA, and CDY present difficulties estimating age in the N type data.  Other compound markers are OK.  The ASD sheet allows a mask, row 21, where I masked out the 8 markers for these recLOH difficulties.  The result, using 59 markers, cell N29, is 2,010 years.  That’s my best guess for the age.

           On the far right of the ASD sheet I sorted the markers by apparent age.  YCAb comes out 20,704 years, demonstrating the recLOH problem.

           The second (apparently) oldest marker is DYS454, at 18,744 years.  This old age is due to only 5 mutations in this slow mutator.  DYS454 is clearly bimodal.  In my notes, I use the Nj code for the 2nd mode with these 5 samples, defined by 454>11.  This is evidence of a subclade, but the statistics are not convincing yet.  Maybe with more data in the near future I might call some of these samples the Nj cluster.  It’s not fair to exclude this “old” marker, DYS454, because there are 7 markers with zero age (no mutations in the 87 samples) and there are 7 more markers with less than 1,000 years apparent age.  The reason for averaging markers is that apparently old markers should be averaged out with apparently young markers.  Anyway, you can go ahead and mask out DYS454 by deleting the mask number at cell AE21, and the new age (58 markers) without 454 is 1,990 years, only a 20 year decrease.  I offer this paragraph of discussion as one example of preliminary evidence of an N type subclade, based on 454<11.

           The third oldest marker is DYS531, at 14,319 years;  at this bimodal marker I use the code Np for the 2nd mode value.  Again, I’m waiting for more statistical evidence for a subclade.

           That far right side of the ASD sheet has more notes about markers with old apparent age.

           Age estimation from STR variance is highly uncertain.  At another of my web pages, I use M458 as an example of age caveats.  I have more discussion about age estimation methods in the P type topic;  please read those two topics if you would like more discussion;  N is similar to P in this regard.

           I’m not too concerned about getting the age of N type correct in Polish data because I suspect in less than a year there will be enough evidence to subdivide N - new SNPs and / or more STR data for better statistical significance.  I suspect there will be younger subclades.  Furthermore, M458+ L260- is not really a tree;  it seems to be branch of the Y-DNA tree that is well isolated - a long smooth segment near the node;  but I mentioned above my suspicion that the main branch might not be really smooth - there might be significant old branches concentrated outside Poland;  if this is true I’ll need to soon redefine N type as younger, excluding any such significant branches.  I’ll leave it for someone else to estimate the age of M458+ L260- from worldwide data;  I’ll concentrate on N type, and hypothetical sub clades in Poland.

           There are 12 samples from N type available with the new 111 STR marker set (18 Jul 2010).  Only DYS532=12 is an obvious signature marker for N type from the 44 new markers;  10 of the 12 have this value.  Modal for R1a is 532=11.  P type also has the 532=12 value, also 10 of 12 samples, so this marker also provides a signature for M458 with good statistical significance.  I type also has the 532=12 value;  see the I type discussion below.

           The following topics are my proposed subclades for N type in the Polish Project.  Please consider reading the section P Type Bimodal Markers, if you would like more discussion of how I use bimodal markers as hints for subclades;  that same discussion applies here for N type.  If you are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls.  Near the bottom of the “Haplotypes” sheet is a list of 70 code names for signatures that I considered for N type subdivision.  I discuss only a few of these here.  I spent a lot of time studying tentative subclades of N because I’m anxious to find significant subtypes that are concentrated in Poland.  I uploaded a total of 17 Excel analysis files associated with N and tentative subclades, all discussed above and below.

 

           Ng.  Rewrite finished 22 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NgType.xls.  Ng is a small subtype of N type, but it has highest confidence.

           This is a very small subtype, only 3 samples, but it is very well isolated.  The definition uses 56 markers, cutoff 4, gap 9.   There are no samples in the gap, from step 4 to 12.  SBP = 15.8%.

           These same 3 samples are present in Ysearch, where the gap with no samples is from 4 to 11.  Two samples at step 12 are from Germany and Unknown.  There are none at step 13 and 11 samples at step 14.  It seems Ng is concentrated in Poland.

           The signature is (537, 492) = (10, 14).  These are the only 3 Polish Project samples in N type that have any mutation from the 12 value at 492, and they have a 2-step mutation.  492 is ranked 18th of 67 in the extended Chandler mutation rates.  The 10 value at 537 is also rare - only these 3 plus 2 other samples have it in N type in the Polish Project.  The same 3 Ng samples are extracted from N type using 1 to 67 markers.  They are well isolated using as few as 7 markers because they have little variation from each other in the rapidly mutating markers, so those rank well for the Ng definition.  ASD age comes out 619 years using all 67 markers but of course that is a very rough estimate.

           The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and passed on those 2 unusual mutations.

           The 3 Ng samples fall at steps 4, 5, 6 with the N45 definition of N type, a hint that the Ng node is near the center of the N type branch, not one of those old branches I speculated about, but this is just a preliminary hint.

           I introduced Ng type in Oct 2010;  there have been no new 67 marker data in the STR neighborhood of Ng type, so SBP has been 15.8% since, with the same definition.

           Ng also has what I call the Na signature, discussed below.

           The “g” is only my arbitrary code name that I have been using for the DYS492=14 signature.

 

           N-Ashk.  Rewrite finished 25 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NashkType.xls.  N-Ashk is a small subtype of N type.  Only 4 samples.

           These seem to be Ashkenazi samples.  Mayka pointed out to me that the names seem Ashkenazi, per his experience.  The samples beyond the cutoff are apparently not Ashkenazi.

           Signature (19,385a,594) = (15,12,11)

           I introduced this type in Jan 2011, with SBP 23%, slightly more than my stated 20% limit for using the word type.  Two reasons:  First, the Ashkenazi names are independent evidence of a clade.  Second, the N-Ashk modal haplotype differs from the N modal at 6 markers, which is evidence of a fairly old node in the N branch of the Y-DNA tree.

           I introduced this type as Nca type, because of what I have been calling the Nc signature, DYS19=15.  The “a” meant Ashkenazi, but that was confusing because the samples do not match what I have been calling the Na marker.  Nc is large;  I doubt N-Ashk is a twig in a large Nc branch;  the Nc mutation more likely arose independently in the N-Ashk hypothetical clade.

           This Sep 2011 reanalysis makes a cleaner cluster of data, although still small with only 4 samples.  The 594=11 marker is very clean;  these 4 samples are the only R1a samples in the Polish Project with this value.  SBP increased to 47%, so it is a stretch to call this a type, but the Ashkenazi connection is improved now and the 594=11 marker seem to be strong evidence.  Also, I avoid making changes in classification names without significantly more data, so I’ll continue to call this a “type” for now.  There are no longer any N-Ashk Borderline samples at 67 markers;  the Borderline category is used for apparent Ashkenazi samples that match well with only 37 markers.

           The improved definition uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4.  (The previous definition used 59, cutoff 5.)  The improvement:  I masked out CDY.  The previous definition used CDYb, missing an Ashkenazi sample that fits the type well, but has recLOH, providing a misleading step of 5 at this one marker.  With that new sample the ranking of markers came out slightly differently, so a few other markers were added or removed from the definition.  The old and new definitions are available in NashkType.xls.  The new definition is also available at Ysearch as 2TZKF, and in my Haplotypes.xls file.

           The ASD age comes out only 668 years, cell N29 in the ASD sheet in NashkType.xls.  Age calculated from only 4 samples is highly speculative, but N-Ashk seems young because of little variation in marker values.  The ASD should use (4-1) in the denominator instead of the total 4 samples (although most genetic genealogists do not do this for small sample sizes);  with that adjustment the age comes out 890 years, but that is still highly speculative.  That cell N29 is using 61 markers;  CDY and 464 are masked out.  (The mask is row 21, which you can easily edit.)  All 67 markers yield 1,024 years, cell N12, because of CDY.  DYS464 has no mutations in the set of 4, so including those 4 reduce the age, but I left 464 out because most people routinely exclude the 464 set from ASD.

           N-Ashk is quite young, but the node seems old because of the 6 marker distinction from  N type.  The simplest explanation:  N-Ashk has a long smooth branch, having an old node with N, but no further branching near that main node.  The samples in the Polish Project all seem to come from twigs with young nodes.  I speculate that there may actually be some branches of N-Ashk outside Poland.  Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat less than a millennium ago.  More data will eventually confirm or refute this speculation.

           2TZKF is the modal haplotype at Ysearch, where only two of these samples are present, and where there are 2 additional samples in the gap, from Russian and Belarus;  the simple explanation is that N-Ashk is concentrated in Poland, although there is too little data for confidence.  See NYsearch.xls for my Ysearch analysis.

           N-Ashk has what I call the Nb signature, discussed below.

 

           Nt.  Edited 25 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NtCluster.xls.

           With 17 samples, Nt cluster is my largest speculative subclade of N type identified so far.

           SBP = 27%;  this cluster is close to the 20% maximum SBP for Polish Project assignments as a type.  I am suspicions of this Nt cluster due to selection bias:  I considered 70 signatures for N type during the summer of 2011, and carefully analyzed more than 30 of them.  With that many attempts, a false positive is likely.  One of the clusters I analyze will necessarily have the lowest SBP, but that might be just the luck of the data.  No one knows how to calculate the statistical confidence in such a case.  I discovered Nt at the end of this major effort.  If SBP improves with more data for Nt I’ll upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll loose interest in Nt.

           If Nt is valid, it is probably concentrated in Poland.  See NYsearch.xls.  See my Ysearch method discussed above.  I consider this additional evidence that Nt corresponds to a clade, boosting my estimated confidence to about 70%.  We don’t always use 70% confidence for assignments, but everyone is anxious for more subdivision of N type in the Polish Project, so we started using Nt in Sep 2011.

           The Nt definition uses 48 markers, cutoff 4, one sample in the gap at step 4.  The definition is available at Ysearch as 2544E.

           Nt is based on the signature DYS442<14.  However, there are 29 samples with that signature, and 5 of the 17 Nt cluster samples have the N modal 14 value at this marker.  My simple speculative explanation:  the 442 mutation from 14 to 13 occurred independently in the Nt clade after the node with the main N type branch.  Other speculative explanations are possible - those 14’s might be a back mutation within a much larger “father” clade that carries the Nt signature on most but not all samples.

           One Nt cluster sample has the 12 value at 442, which could be another mutation or an independent double mutation.

           If we subtract the 12 Nt signature samples with <14, that leaves 17 more samples (not included in my Nt cluster) with this second modal value at 442.  There are only 3 samples at 15 in all of N, and we expect step up to be more common than step down for a slow mutator (see my publication for references), so that still leaves an excess of samples with <14, implying yet another hypothetical clade with an independent mutation, or a larger “father clade” but this paragraph is getting highly speculative.  I have more speculation like this about independent clades vs large clades in the Na, Nb, and Nc topics below, similar speculation applies to Nt.

           Thirteen Nt samples match what I call the Na signature, discussed below, but two samples match the alternate mode Nb;  the last two samples are one step away from Na.  This is evidence of an even larger Na father clade, but as discussed below the Na vs Nb signatures may have arisen multiple times independently, so I’m not confident to speculate further along these lines.

           See also NclusterAssignments.xls.

 

           Ns.  Edited 23 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NsCluster.xls.  Ns cluster is a speculative subclade of Nt cluster.

           With 6 samples and SBP = 27%, this cluster is close to the 20% maximum SBP for Polish Project assignments as a type.  I am suspicions of this Ns cluster for the same reasons given above for Nt:  On the other hand, Ns looks like a credible subclade of Nt, which adds credibility to both of them.

           If Ns is valid, it is probably concentrated in Poland.  See NYsearch.xls.  The 67% concentration is the best I have seen so far, but this % is highly uncertain because it is based only 2 Ns samples at Ysearch.  Such as it is, I consider this additional evidence that Ns corresponds to a clade, same as my confidence for Nt.

           The definition uses 47 markers, cutoff 2, no samples in the gap at steps 2 and 3.  The definition is available at Ysearch as A5NSG

           Ns is based on two signatures.  Ns is my code for DYS446=12, 9 samples, vs 446=13 modal for N type.  Nt is my code for DYS442=13, 5 samples, vs 442=14 modal for N type.  The 6 Ns samples are all at steps 0 and 1 with the 47 marker definition;  the other 3 with that signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation happened twice independently in the N type clade.  Five of the 6 Ns samples have the Nt signature, but that 6th one has the value 12, two steps from the N modal 14, so it should be considered Nt also.

           See also NclusterAssignments.xls.

           All 6 Ns have what I call the Na signature, discussed below.

 

           Nd.  Edited 24 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  Nd53Cluster.xls.

           Based on the signature DYS389I = 14, vs N modal 389 = (13,29).  Nine samples have the Nd signature.  Only 3 of these 9 fit Nd53.  My confidence is only about 50% that these 3 samples really belong to the same clade;  I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.

           DYS389II has the value 30 for Nd but this is not a mutation at 389II.  See compound markers for an explanation.

           I call this Nd53 because the 53 marker definition is somewhat arbitrary - there is no very likely definition.  It is likely I’ll need to change the definition soon, when more STR data becomes available.  Also, “Nd53” makes it clear that this is not the same as the cluster formed using only the Nd signature.

           Nd53 is not used for assignments in the Polish Project;  see NclusterAssignments.xls for speculative assignments.

           The 3 samples do not have Poland as origin, although I suppose those men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples.  On the other hand, Nd53 might be representative of a clade that is concentrated outside Poland.

 

           Ne.  Edited 24 Sep 2011.  New topic 23 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  Ne40Cluster.xls.

           Based on the signature DYS390 = 24, vs N modal 25.  Twelve samples have the Ne signature.  Only 3 of these 12 fit the Ne40 cluster.  My confidence is only 50% that these 3 samples really belong to the same clade;  I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.  Nd and Ne have similar status.

           I call this Ne40 because it is likely I’ll need to change 40 marker definition soon, when more STR data becomes available.

           Ne40 is not used for assignments in the Polish Project;  see NclusterAssignments.xls.

           Only one of the 3 samples has Poland as origin, although I suppose the other two Ne men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples.  On the other hand, Ne40 might be representative of a clade that is concentrated outside Poland.

 

           Na and Nb.  I have been rewriting this topic throughout the late summer of 2011.  Finished 24 Sep 2011.  Based on 5 Aug 2011 Polish Project data.

           Clusters based on DYS464, a maker set that is multimodal in N type.  Analysis files Na45Cluster.xls and Nb32Cluster.xls

           I introduced Na and  Nb in my publication, page 179 and Table 3.  I have been updating the discussion for Na and Nb here at this web page.  I consistently emphasize that these are speculative subclades.  In retrospect, I should have avoided the word “type” for these because more data over the years has convinced me that the explanation for what is going on is not two subtypes of N.  It will take me a few paragraphs to explain the issue of Na and Nb:

           One way to split the N type data, obvious at a glance, is by the number of markers for 464.  Some samples have 4 values, some have 6, just a few have 5 or 7.

           I understand that the 464 set is the most prone to genetic testing evaluation errors, so this or any categorization using 464 will have uncertainties.  If 464 is taken in combination with other markers that means some statistical uncertainty due to possible evaluation errors at 464.  Specifically, a sample in a database with 4 values at 464 might really have 5 or more values, and vice versa.

           Follow my links if you wish to read more about compound markers and recLOH issues, which introduce confusion for the 464 marker set.  Briefly, copy mutations can increase the number of 464 markers, but recLOH mutations might reduce the number.  A single copy mutation can change more than one value in the set.  Copy mutations and recLOH mutations are rare, about the same frequency as very slowly mutating STR markers.  Net mutations in the 464 set are common, with frequency among the fastest in the standard 67 set.  For the Chandler rates, each of the four markers 464a to 464d are assigned a rate 1/4th the net rate for single mutations for the set of 4.

           I use Na as my code for the signature 464 = (12,12,15,15,15,16) - the most common value set for 464.  28 of the 87 samples.  My Nb signature is the next most common, 464 = (12,15,15,16).  16 samples.  I say 464 is multimodal because there are also two sets with 4 samples each;  that’s why I’m using Na as a signature even though it is the modal value for N type as a whole.  This is for the 87 N type samples in my 5 Aug 2011 download of the Polish Project;  the proportions change every few months as data accumulates due to the statistics of small sample sizes.

           Na and Nb differ by 2 steps following the Ysearch method, but that is misleading because Na can turn into Nb in a single recLOH mutation, which might have happened more than once in the past in this N type database.  Nb can turn into Na with a single copy mutation.  I may not be exactly correct in this paragraph if my assumption of the structure of 464 in N type is incorrect, but this paragraph is certainly a brief example of the kind of confusion that arises with 464.

           It is easy to construct clusters using 464 in N type.  Too easy.  Too many choices for clusters, as I discuss in the following.  I could not come up with clusters with good statistical confidence.  My Excel analysis files allow setting maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any variation of a sample from a trial definition;  still I found no clusters with confidence.

           My analysis files allow an alternate method, treating the 464 markers as individual markers.  This is the method I used in my 2009 publication, still no clusters with confidence.

           My default is to follow the Ysearch method for counting step at 464, although this method is obviously less than perfect.

           My list of code names is available in Haplotypes.xls.

           When trying individual markers, DYS464b is best.  In my notes I use Na1 - 464b<14, and Nb1 - 464b>13;  these two signatures neatly split all the N type data.  Na1 captures all the Na samples plus mostly  samples with more than 4 markers;  Nb1 captures all the Nb plus mostly samples with 4 markers;  there are exceptions.  Using Na1 vs Nb1 I come to the same conclusions as using Na and Nb, discussed below.

           DYS464e provides another way to split the data.  In my notes I use Nx - any value for 464e, and Ny - no value for 464e.  Nx captures all the samples with more than 4 markers including the Na samples;  Ny captures all the samples with 4 markers including the Nb samples.  Using Nx vs Ny I come to the same conclusions as using Na and Nb, discussed below.

           Consider my definitions Na45 and Nb32, with 45 and 32 markers.  See those two Excel files for details.  My choices for 45 and 32 are arbitrary.  Those files show columns with trial definitions using a wide range of markers, automatically chosen by rank.  A wide breadth of number markers seem roughly equivalent.  It is remarkable how many samples fit very well using up to 50 markers for trial definitions:  Na has 16 samples at step zero using 11 markers, and 15 samples at step less than 2 using 45 markers;  Nb has the same 14 samples at step zero using from 11 to 32 markers.  When the 464 set is excluded from the definition, some Na samples fit the Nb definition, and some Nb samples fit the Na definition.  One simple explanation:  Na45 and Nb32 might correspond to two very young clades.  However, there is an alternate explanation:  Na45 might correspond to two or more young clades, and Nb32 might correspond to two or more young clades, and they may be a “bushy” set of branches where some Na45 clades are connected by nodes to some Nb32 clades.  I see no way to be confident that most of the Na samples are in a branch distinct from a branch with the Nb samples.  I suppose if your sample matches Na45 at step zero or one, there might be better than a 50-50 chance that you and others who match at <2 belong to a unique clade that may someday have an SNP definition, but such a clade will surely exclude some of the step <2 samples, and include some samples from steps 2 and 3, so Na45 does not provide a definition.  The same can be said if you match Nb.

           Some samples that fit the Na signature at 464 = (12,12,15,15,15,16) come out at high step using more markers.  Similarly, some samples that fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more markers.  You can see this at a glance in those two files.  Two opposite simple explanations come to mind:  Na and Nb may have independently arisen more than once, followed by population expansion - multiple branches in the N tree.  The opposite explanation:  Na and Nb sets might be signatures for two old clades that each have a few old subclades - two main N branches that have a few old branches and where both Na and Nb have a bushy clump of branches at the ends.  More complicated explanations also come to mind.  That second explanation, two main branches, is attractive, but I see no proof that is true, or even highly likely.

           In the file NclusterAssignments.xls, I make speculative assignments.  Most of the Na45 and Nb32 samples fit other more believable types and clusters.  I went ahead and assigned the few leftovers to Na and Nb, but these are just speculative assignments, meant so show you which of my clusters you best fit.

           Summary:  There is not enough evidence to consider Na and Nb to be two unique subclades of N.  Maybe Na45 and Nb32 do correspond to the top of two main branches of the N tree, with most of the samples that fit Na45 or Nb32 belonging to the corresponding clades.  Maybe not.  I see no way of ruling out multiple independent clades (branches far apart in the tree) for both Na45 and Na32, or for any other definitions based on the 464 set.  Perhaps in a year or so more STR data will provide convincing sub cades along these lines.  Perhaps in a few years SNPs will be discovered to subdivide N type.

           I have more discussion along these lines below, in the Nc topic.

           At all 67 standard markers, the Na and Nb modal haplotypes are essentially the same for STR markers other than 464.  I say “essentially” because the rapid mutators, particularly the CDY pair and DYS576, typically vary modally from month to month due to the statistics of small samples.  At CDYb, Na type signatures with multiple markers are typically modal 40, while Nb are typically modal 39, but this marker always ranks poorly for definitions because of the wide range of values.  In Nb less than 1/3 of the samples typically have the modal value at CDYb.

           The Russian site independently came up with this same haplotype distinction.  Two modal haplotypes are available on Ysearch, from the Russians.  Each use 78 markers and each match my Na and Nb types at 67 markers, including that 39 value for CDYb in Nb.  Central European-1 Modal GTAVR corresponds to my Nb, using only 4 values, 464a-d.  Central European-2 Modal 495M5 corresponds to my Na, using 6 values, 464a-f.

 

           Nc.  New topic 25 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis files Nc32Cluster.xls

           My Nc code is for the signature DYS19 = 15, compared to the modal value of 16.  Similar to Na and Nb, my publication and previous versions of this web page proposed Nc as a tentative subdivision cluster of Nb.  The samples with the 15 value last year had mostly Nb samples, but this year that correlation is not significant.

           My opinion of Nc is very similar to my opinion of Na vs Nb:  No confident conclusion.  Nc might correspond to a single large clade.  Then again, Nc might correspond to independent unrelated clades where the Nc mutation arose independently.

           My Nc analysis complements my Na and Nb analysis:  If you look at Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na and Nb.  If you look at Na45Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16.  If you look at Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16.  If Nc32 vs modal 16 is a valid division of N type, then Na vs Nb cannot be valid.  If Na vs Nb is valid, Nc vs modal 16 cannot be valid.  All three files have, at the bottom, at large step, some Na, Nb, and Nc samples.

           Next, let me consider the 4 combinations using DYS464 and DYS19:

           Nbc42Cluster.xls is my analysis file using both the Nb and Nc signatures together.

           Nac32Cluster.xls is my analysis file using both the Na and Nc signatures together.  This is very different than Nc32;  the latter has a mix of Na and Nc;  the former is a new analysis using the additional restriction to Na match.  They both have 32 markers by coincidence.  As in Na45 and Nb32, the number of markers is my arbitrary choice;  there is no obvious best choice;  the number of markers will likely change as data accumulates for all these definitions where I specify the number of markers in the code name.

           Nb5_37Cluster.xls is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers plus the modal value at DYS19.

           Na7_26Cluster.xls is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers plus the modal value at DYS19.

           In the file NclusterAssignments.xls, I make speculative assignments to these 4 clusters, but samples that fit one of the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more confident assignment if they also fit these 4 combinations.

           The 3 Ng samples are all Na, but they are a mix of values at DYS19.  The neighborhood (just beyond the Ng cutoff) is all Na.  This is a tantalizing hint of a “father” clade with the Na signature.

           The 4 N-Ashk samples are all Nb, but in this case the neighborhood is a mix of Na and Nb.  This is a hint of an independent mutation to Na somewhat older than N-Ashk.  Three of the 4 N-Ashk are Nc, as are most of the neighborhood.  The other has the modal DYS19=16 value.  This is a hint of a father clade with the Nc signature, DYS19=15, plus recent back mutations to the modal value.

           The 6 Ns samples are all Na, with a neighborhood mostly Na but some Nb.  The Ng, N-Ashk, and Ns samples are all very far from each other.  You can see this in the file NclusterAssignments.xls, where each type and cluster has a column, with step value for each samples.  I consider this strong evidence against a large Na clade;  it seems more likely that the Na (464=12,12,15,15,15,16) set arose independently by copy mutation 3 times in these three hypothetical clades.

           Nt, the purported father of Ns, has 17 samples;  13 Na signature, 2 Nb, 2 one step away from Na.  It is reasonable to speculate that those 2 Nb are due to an independent recLOH in Nt, and that the father clade has the Na signature.  Unfortunately, it is also reasonable to speculate that there were multiple mutation to the Na signature within Nt making the 464 set is irrelevant.

           The 3 Nd samples match Nb but again the immediate neighborhood is a mix of Na and Nb, again evidence for independent mutations at 464.

           Ne is another example of a mixed Na Nb neighborhood.  In this example, 2 of the 3 match Na.  That third one, 464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a hint at another copy mutation.

           NYsearch.xls has a sheet with Ysearch data analysis for each type or cluster.  The Polish percent, in boldface, is my important result.  Although this analysis is based on very little data for each of those 4 combination clusters here is the tentative finding:  Nbc42 is not concentrated in Poland.  The other 3 seem to be concentrated in Poland;  that is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors one or more clades that are concentrated in Poland.

           Ns seems related to Nac7_26, because 4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3, so this technique of 4-way combination is good for hints, but not conclusive.

 

           Summary;  Na, Nb, and Nc clusters:  25 Sep 2011.  That was a lot of analysis to justify my opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without correlation to more markers.  N type probably experienced population expansion not long after the TMRCA, whereby the main N branches come out today with similar STR distributions.  DYS464 is multimodal;  DYS19 is bimodal;  the 4 main combination modes based on 464 and 19 provide evidence of twigs that are concentrated in Poland.  I bet there are many more small Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type.  I’ll continue to watch the STR data.  New SNP markers within N type someday will be even better.

 

           P.  Complete rewrite finished 16 Aug 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  PType.xls

           P type is the main topic in my publication, Part II.  P type is significantly concentrated in Poland, and in the Czech Republic.  It is found at lower frequency in other Eastern European countries, and in eastern Germany.  About 9% of Polish males carry P type Y-DNA.

           After my publication, an SNP called L260 was discovered, found to be equivalent to P type, confirming my prediction that P type corresponds to a haplogroup, R1a1a1g2.

           The “father” haplogroup R1a1a1g (M458) is composed of  what I have been calling N type (L260-) and P type (L260+).

           My current definition for P type, P43, is a modal haplotype using 43 of the 67 standard markers.  The cutoff is 7, which means all samples less than step (genetic distance) 7 are predicted P type (predicted L260+).  That definition is available in the PType.xls analysis file, in my Haplotypes.xls file, and at Ysearch as 8U92G.

           P type age (age means TMRCA) is about 1,600 years.  That’s highly uncertain, but I’m 80% confident that age of 1,600 years is not off by more than a factor of 1.5 - age 1,100 to 2,400 years.  The L260 mutation is likely quite a bit older than the age of P type.

           It’s interesting to wonder if the age of P type is associated with the historical appearance of Poland somewhat more than 1,000 years ago.  It’s also interesting to wonder why P type is so isolated in haplospace - why there are so few men alive today with STR values slightly different than P type.  I added a bit of speculation along these lines to my publication, but frankly, no one knows the answers.  I offer a little more speculation at the end of this topic.

           That was a quick summary.  Next comes detailed discussion:

           My published 2009 definition for P type, P36, still works very well.  My prior update definition, Sep 2010, P46, still works very well.  I updated the definition Aug 2011;  P43.  All 3 definitions are compared in that analysis file PType.xls, Calculator sheet, columns BZ to CB.

           The August change is only a slight tweak;  I dropped 3 slowly mutating markers that are mutated in two samples recently found L260+;  these two were at steps 7 and 8 using the prior P46 definition;  they are now at steps 5 and 6 with the new P43.  More discussion about this below.

           There is only one L260+ sample not captured by P43.  This sample is at step 9 using any of my 3 definitions.  The problem is DYS464, where this sample obviously had a serious recLOH mutation, expanding the number of 464 markers from 4 to 6, yielding step 4 for only that compound marker.  The net step 9 would become step 5 without 464.  Nevertheless, I cannot drop 464 from my definition, because this marker helps a lot to discriminate P type from N type.  I have more discussion below about this outlier sample.

           P43 captures only one sample not P type, an NB sample, which means N Borderline.  Although this sample fits N better than P, hence the NB prediction, it has not been tested for L260 or M458, so its status is uncertain.

           There are 10 samples at step 6 (5 Aug 2011), the last step of the type, where uncertainty is highest.  Seven of these have been tested L260+, confirming membership in this haplogroup.  This high testing rate is not a coincidence;  Mayka and I have been encouraging men with marginal samples to do the L260 test since it became available in Apr 2010.  (We paid if cost was a problem.)  One of the step 6 samples not L260 tested is the NB sample of the previous paragraph.  Another is M458+ and not a fit for N type, so it can be confidently predicted L260+ (although the L260 test would be nice).  The 10th step 6 sample has neither SNP test, and is not a fit for N type, so it is assigned PB, a Borderline assignment intended to encourage SNP testing.  There are two other PB samples that were step 6 using the prior definition;  these are now step 5.  We will probably expand the PB category, so the next assignment update should have a few more PB samples, again to highlight the ones most likely to benefit from SNP testing.  I estimate the PB samples have about 75% probability of being proven L260+.

           P43 summary:  The P43 definition, cutoff 7, captures 90 samples as P type.  One L260+ sample is not captured because of DYS464.  One captured sample at step 6 is probably N type.  So the predicted P type is 90 samples and the predicted (some actual) L260+ is also 90 samples (5 Aug 2011).

           The statistical accuracy of my P type definition may seem like about 98% - 100% below step 6.  However, my confidence is more like 90% - I’m 90% confident that more than 90% of future samples that match P43 below the cutoff step 7 will be L260+ if tested - - 95% confidence below step 6.  That confidence is not calculated - it’s my estimate to account for two issues:  First, I have removed from the definition markers that are mutated only for L260+ samples at high step (mentioned above and discussed further below) but more such mutated markers are bound to show up for future samples, so future predictions are not quite as good as the adjusted fit implies.  Second, there may still be a very small L260- clade that just happens to have STR values close to P43 due to the luck of random STR mutations.  For samples without Polish ancestry the probability is higher for these two issues;  this confidence discussion is limited to Poland, as represented by the Polish Project.

           According to Pawlowski, along with further evidence in my publication, P type (L260+) is concentrated in Poland.  I verified this and other Polish types using both Yhrd and Ysearch.  P has fewer mutations than N and K, so it must be younger.  In my publication I estimated that about 8% of Polish men have P type male line ancestry of this type;  my current estimate, from the Results Table, is 9.0% (calculated from the edited data 28 Jul 2011) -- calculated 70% confidence interval 8.0% to 10.0% -- 95% confidence interval 7.1% to 11.0%.

           Ludvik Urban pointed out to me that P type is common in the Czech Y-DNA Database.  FTDNA also has a Czech Y-DNA Project.  There is not enough data yet to calculate if the frequency in the Czech Republic is greater or small than the approximate 9% frequency in Poland (as represented by the respective projects).

           Karen Melis, administrator of the FTDNA Zamagurie Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on the border of Slovakia with Poland.  I’m not sure of the concentration in Slovakia.

           It will be interesting if more data in the future allows resolution of subtypes of P type by region.

           I added a “Ysearch” sheet to that PType.xls analysis file, with update analysis from Ysearch.  That file has a copy of the 123 matches at step < 9 (12 Aug 2011) from my P43 definition, 8U92G.  Seven of those matches are modals, segregated to the bottom of the sheet and not used for analysis.  The cutoff is 7, same as in the Polish Project, but SBP is 19%, not very good.  The reason is 10 samples at step 7.  Only two of these at 7 indicate “Poland” for Origin, 3 Germany, 2 Scotland, 2 Unknown, and 1 USA.  This may be a sign of a clade outside Poland with STR values close to the P type cutoff;  I doubt that;  more likely, these are outliers from more distant clades, because there are a huge number of samples at step >9 so of course some samples from those clades will fall at step 7 just due to the luck of random mutations.  In other words, P type is a relatively small haplogroup on Ysearch, and the background is larger on Ysearch than in the Polish Project, so of course SBP will be larger.  Still, 19% is pretty good on Ysearch.

           Those Ysearch results include 11 samples with “Unknown” or “USA” for Origin, so I removed those for Origin analysis, 105 net samples.  Below the cutoff step 7, 54% are Poland;  that is very high;  the overall percent of samples in Ysearch from Poland is a very low percent.  At steps 7 and 8, 26% are Poland, showing the expected drop off for outliers.  Germany and other Slavic countries also have significant percent P type;  there is a table with details in that Excel sheet.  This updates my evidence that P type (L260+) is concentrated in Poland.

           The isolation of P type in the Polish Project is now even more impressive than at the time of my publication.  Most of the samples at steps 7 and 8 are good fits to other newly discovered types (see PType.xls, column CB), so there are now fewer borderline samples just beyond the edge of P type.  Two of the step 7 samples are my maternal cousins;  their close match to P type is what got me interested in this topic;  if I had not noticed this someone else may have done a similar study and those two samples would not be in the database;  statistically those two should be edited;  I edited by -1 in the Results Table, but I do not do minor edits in the analysis files.  One of those cousins is tested M458- so I have high confidence both belong to I type, not P type.

           This Aug 2011 analysis does not include L260 data from other projects.  I’ll wait a few months before reviewing L260 data outside the Polish Project.  My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011.  For those last results, see the following topics, which have not been updated for several months:

           L260M458News

           L260M458

           L260M458TestResults

           L260M458TestResultsDetails

           L260M458Signatures.

           L260M458Results.xls

           P type Age - TMRCA:  My publication explains the ASD method.  The ASD sheet in PType.xls provides 1,778 years using all 67 makers.  However, 385b should not be used because 5 samples have recLOH mutation from 14 to 10, providing the unreasonable ASD age of 11,007 years at this one marker.  Also, 464 has obvious recLOH issues;  my ASD sheet, treating 464a to d as independent markers, comes up with an average of 2,093 years for these 4.  Most people who figure ASD age exclude 464.  It is interesting that 385a has no recLOH (10 to 14) so far;  I do not understand why not.  The other compound markers are not issues because the P type values are such that the apparent recLOH cause only step 1 mutations, so they might as well be included.

           1,637 years is the ASD age, cell N29 of the ASD sheet, using 62 markers; excluding 385b and excluding the four 464.  Exclusion is by typing a blank or zero into a mask, row 21, so you the reader can easily verify that removing compound markers other than 385b has no significant effect.

           The far right of the ASD sheet has all the markers ranked by apparent age.  I added a Notes column with explanations for some of them.  Other than 385b, other old markers should not be excluded because the random luck of STR mutations is bound to produce such anomalies, which are statistically balanced by the 9 markers with zero age (no mutations among the 90 samples).  They should all average out.  By the way, the number of markers with apparent zero age has been declining in P type as data accumulated during the past few years, as of course it should, but apparent age averaging many markers has not changed more than statistically expected due to the details of new data.  My 2009 published age was 1601 years;  my update last year on this web page was 1775 years.  I have consistently written “roughly 1600 years” in my discussions.

           There are a number of reasons why “raw” ASD age should be increased, as discussed in my publication, part I.  However, those reasons are mostly due to population bottlenecks in the past.  As discussed below, P type evidently went through a rapid population expansion soon after the TMRCA, so the raw ASD age should be used as a best estimate.  Anyway, there are significant non-statistical age caveats that produce systematic uncertainties as large as the uncertainties due to population bottlenecks, and much larger than the statistical sampling uncertainties from 90 samples.  So any age calculated from ASD (or from any other type of STR variance) should be taken with a grain of salt.  My factor of 1.5 uncertainty quoted above is based on my 80% confidence from experience, not from calculation.

           385a=10 is the best marker for P type.  I have a separate topic for the P type signature.  385a=10 continues to be amazing.  89 of the 90 samples predicted L260+ have the 385a=10 value.  Beyond P type, 385a=10 shows up in only 2 samples at step 7 (my two cousins, mentioned above, who should not both be counted), none at step 8, only 1 at step 9, and 3 at step 11.  The PType.xls database is truncated at step <12;  the full R1a data from the Polish Project - 457 samples - has only 1 more 385a=10 sample beyond step 11.  In other words, this one marker 385a=10 is about 99% effective at capturing P type (future L260+ predictions) plus less than 3% additional falsely predicted foreign samples from the rest of R1a.  385a=11 is ancestral (N type and most of R1a), but so far there are no P type with the ancestral 11 value, strong evidence that the rare mutation from 11 to 10 happened before the TMRCA.  The 385a & b pair are ranked together tied for 41st in the Chandler rates, not very slow.  However, shorter STRs mutate a lot more slowly than longer ones, and step down is slower than step up with stronger effect for shorter STRs.  (Chandler discussed this with me by email - his project did not take these issues into consideration - treating compound markers together, with data combined from all haplogroups).  In other haplogroups 385a values >14 are not uncommon.  So it makes sense that the 385a mutation 11 to 10 should be very rare, explaining why it works so well for P type, although that one P type exception (at step 4) is an even rarer 10 to 9 mutation.

           Column CJ of my analysis file shows that using only the best 5 signature markers, cutoff 2, 83 P type samples are captured an none from outside P.  That’s better than 80% accuracy using only 5 markers, which is very good and unusual in SNP prediction.  Even more unusual is that the one best marker is even better.

           DYS540=11.  A new signature marker.  From the 111 marker STR set recently available commercially.  71 Polish Project R1a samples already have the 111 data, including 12 P type and 12 N type (18 Jul 2011).  11 of the 12 P type have the 540=11 value.  11 of the 12 N type have 540=12.  Since P and N are the two parts of the R1a1a1g (M458) haplogroup, this marker nicely distinguishes the two parts with high probability.  12 is obviously ancestral because that value dominates the R1a data.  540 already does not look as good as 385a for P type, but it’s always nice to have another signature marker.  It is too early to switch definitions to the full 111 set.  I’ll be adding 111 modal haplotypes to my Haplotypes.xls file over the following months;  P and N are already there.

           That Excel analysis file is intended for finding types - hypothetical haplogroups with < 20% SBP.  For P type this is moot because L260 is available.  Nevertheless, I used the file to automatically come up with the best prediction, P54, column CF, with SBP 7.6%.  That SBP means 80% confidence (if L260 were not known) that less than 7.6% of the predicted P type would not actually belong to the predicted haplogroup.  Indeed P54 captures 89 samples, only 3 of which are not P according to my new P43 based on L260 - that’s 3.3% foreigners captured.  Since I published the SBP method in 2009, almost all predictions have been better than SBP.  But I designed SBP to be conservative (higher percent) to account for statistical biases.  I expect eventually to have a few failed predictions (foreign background larger than SBP, or two or more unrelated haplogroups fitting one type definition).

           The main point of that PType.xls file:  Many definitions are displayed, with various marker selections.  I tried a lot more definitions than the ones displayed in that file.  The exact definition does not matter much for P type.  Any reasonable definition of P type captures more than 90% of P type and less than 10% foreigners.  Even the full 67 modal haplotype works OK.  Although that P54 has lower SBP than my current P43 definition (9.2%) , P43 is better because I adjusted P43 using L260 results.

           I identified P type and submitted my analysis for publication before the M458 mutation was announced by Underhill.  The end of my Part I mentions M458 -- notes added during publication.  M458 (so far) is composed of P type plus N type plus perhaps a few small clades just outside N.  L260, the SNP that defines the haplogroup corresponding to what I have been calling P type, was discovered by a P type member of the Polish Project, inspired by my publication.  With him and other coauthors, I published a brief letter announcing and describing L260 in the Fall 2009 issue of www.jogg.info.

           P type has obvious structure.  Evidence of sub clades.  Nodes in the P type branch of the Y-DNA tree.  The most obvious evidence is bimodal markers.  The bimodal markers are discussed below as clusters - hypothetical sub clades without high confidence.  The bimodal markers do not correlate with each other, so none of the clusters qualify yet as types.  Future data may provide better statistics with a convincing subtype of P.  If this paragraph is not clear, please read the discussion below for the individual clusters:  Pa, Pc, etc.

           Other evidence of structure:  My two edits of the P type definition.  In Sep 2010, I increased the number of STR markers in the definition, then edited out the markers that have mutations only in L260+ samples at high step, and not in L260- samples at or just beyond the cutoff.  In Aug 2011, I edited out 3 more such markers. Four samples involved, color coded in columns BZ to CA in the analysis file;  two do not fit my original P36 but fit the other two definitions;  two do not fit the 2010 P46 but fir the other two definitions.  These edited markers are also evidence of structure.  These are all relatively slow mutating markers.  Those samples with such mutations are probably from old nodes in the P branch.  Of course, these cannot all be old nodes because some markers will have mutations only at high P step just due to the luck of random mutations.  Some samples from young nodes will come out at high step due to luck, and some samples from old nodes will come out with low step.  The point of this paragraph is that old nodes defined by rare mutations are expected in any Y-DNA tree, and those samples are evidence of the expected structure in P type.  Another point of this paragraph is justification for my method of editing markers.  You the reader may be concerned by such editing as selection bias to improve the apparent fit of the data.  Indeed there must be such bias in some of the markers that I edited.  However, insofar as some of those edited markers truly correspond to old nodes in the P branch, it is appropriate to edit them;  future distant cousins with the same rare mutation will be better predicted as L260+.  The whole point of using definitions shorter than the full 67 is to remove those markers that define sub clades in order to come up with a proper definition that distinguishes the branch as a whole, as explained in my publication.

           Old node comment.  It is possible the P type data includes samples that really belong to an L260 branch with a node much older than the next youngest node.  In such a case it would not be proper to combine them into the single P type.  That one sample at step 9 (discussed above) is an example of a candidate for such an old branch, but then again that sample might just be an unlucky member of a young node (an outlier).  Those 4 edited samples of the previous paragraph are also examples.  Because there have been very few P type samples beyond my original cutoff, and because all but one of them were easily incorporated with minor edit of the definition, I am comfortable considering them all as a single type until there is evidence of significant L260+ samples beyond P.  At any rate, all markers are included in the age calculation, so any old branches contribute to the estimated age of the oldest node (oldest node means MRCA). This paragraph would be a valid comment about any type analysis, but P type is unusually well isolated in haplospace, so the justification is strong to consider it a single clade.

           The L260 mutation might be about the same age as P type.  Unlikely.  We expect a defining SNP to be more likely older than the TMRCA, perhaps much older.

           The Western Slavic Modal haplotype, Ysearch 28WGP, matches P type perfectly at all 43 markers used in my new definition.  That Western Slavic Modal uses 76 markers, but many of those are highly variable due to high mutation rate.  That modal is one of the Russian site modals.

           The Polish Project makes some assignments to P type for samples with < 67 markers if they match the P type model very well.  I have not updated those assignment rules for a couple years, but I have been quite conservative below 67, so those assignments are still > 80% confidence.

           Let me finish this P type topic with brief speculation about the origin of P type:

           What does P type isolation mean?  One simple explanation:  The M458 father haplogroup for P type and N type seems to have experienced a severe population bottleneck.  The evidence:  P type and N type are very easily separated by STR values.  Both are isolated in haplospace.  No overlap.  They are so far apart that the nearest neighbors (just beyond the cutoff) for P type include outlier samples (from other R1a haplogroups) in addition to N type samples, and nearest neighbors for N include samples other than P.  Apparently, the father haplogroup was quite old at the time of the bottleneck, with lots of variation in STR values.  The bottleneck wiped out most of that population, so today men in that father haplogroup descend from just two ancestors, the MRCAs of P type and N type.

           Why is P type so large and concentrated in Poland?  One obvious explanation is a rapid population expansion not long after the TMRCA.  Evidence:  Subtypes cannot be defined with confidence.  Apparently, the major bimodal markers are due to mutations that happened early in the population expansion, so the branches of P type have similar statistical spread of STR values.  For more discussion along these lines see the discussions of the clusters below.

           There are other explanations to these questions:  P type may represent a huge migration of a single paternal tribe during the dark ages from far away to the region that is now Poland.  Perhaps the related haplogroups in that far away place got wiped out by subsequent famines and wars.  On maybe they did not get wiped out.  If people in that far away place did not tend to migrate to North America in the past, and today do not tend to get DNA tests, then perhaps there are isolated pockets of L260 clades there waiting to be discovered - some with STRs very similar to P type - some with STRs very different than either P or N.  Maybe in the mountains of western Asia.

           Also, the standard “null” explanation should be considered unless there is strong evidence otherwise.  The null explanation is statistical:  No significant bottleneck or expansion.  Just the luck of random growth of clades in a small human population over the millennia.  The MRCA of P & N perhaps were far apart in STR values just by luck - both being outliers.  No one knows how to calculate the probability that a large P and a larger N clade can be sole survivors of the statistics of clade growth in the Y-DNA tree in only a couple thousand years.  To me it seems highly unlikely.  But I don’t know how to rule this null model out in a convincing way.

           I can think of more complicated models as explanations.  I’m sure you can, too.

           Caveat:  I said M458 consists of P and N.  It is possible some of the outliers from N type might represent small old branches that have nodes older than the node for P & N.  There is no evidence to support this, but then again there is no evidence to rule this out with confidence.  More data will answer this over the next year, perhaps.  Anyway, this is a small detail in the larger picture.

           P type Bimodal Markers.  This sub topic was significantly edited 25 Aug 2011, when I introduced a definition of bimodal.

           The following analysis uses the 90 P type samples (5 Aug 2011) predicted L260+, at 67 markers, discussed above.  I also include some comments about the 12 samples available with 111 markers (on 18 Jul 2011).  A bimodal marker is evidence of structure, but not proof - a hypothetical clade.

           In the past, I have sometimes called these hypothetical types.  I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence that more than 80% of the samples belong to a clade that will someday be confirmed as a haplogroup by a newly discovered SNP.  Sometimes we make exceptions slightly above 20%, for example when a type is regionally concentrated.

           None of the following bimodal markers qualify as a definition of a type, although some of them might be good enough to be called clusters.

           This is not proof that a specific bimodal marker or cluster does not correspond to a future haplogroup.  It is still possible that 95% of the samples from a particular bimodal marker belong to a unique future haplogroup.  For example, if the son (or grandson, or great great grandson) of the P type MRCA had that defining mutation, and if he participated in the purported P type population expansion, that would explain why his haplogroup (male descendants) have STR values so similar to P type except at the one defining marker.  He had no other mutations that differed from his ancestor among the standard 67 that I’m using today for analysis.

           It is possible as more STR data accumulates some of the following will qualify as types.  Cluster identification is a bit of an art so it is possible I just failed to find a small P sub type and someone else will find it.

           Many of the following are probably not unique clades, but instead represent two clades that have widely separated nodes in the P tree.  Or three or more.

           One characteristic of a type:  It shows up early in the data as a cluster with 20% < SBP < 50%, and the SBP continuously decreases in value as more data shows up, as the SBP penalty for sampling statistics becomes diluted.  This is good - it means false clusters that show up by luck will not last as more data accumulates.  The P bimodal markers that I have been following for a few years (Pa, Pb, Pc, Pd, Pe, Pg) all have increased in SBP, which I take as evidence that they will probably not become types.

           Excel files for Pc and Pg are in the on line data with my 2009 publication;  I am not updating those or adding any others because none are good enough to stand out.  Nevertheless, some merit discussion:

           Pa Bimodal Marker.  Defined by DYS389 delta = 18.  DYS389=13,31.  18 samples (among 90 P at 67).  P modal values are 13,30.  This is a compound marker;  that 2nd number is the sum, so this mutation is in the longer repeat chain;  P modal 17, Pa value 18.  All the 18’s are 13,31;  there are no 14,32 or 12,30 in the Polish Project P type data at this time;  my analysis files will capture any future such samples as Pa.  That 31 value by itself does not capture the Pa cluster because there are several 14,31 in P type, which I’m calling a different cluster because they are not mutated at the longer repeat chain;  the 14 refers to the shorter chain.

           Pa is briefly mentioned in my publication at page 172.  Pa was the first bimodal marker to catch my attention in 2007 because that 31 value produces the 3rd most common haplotype in Polish data that differs by only one step from P modal values using the old standard 12 marker set;  see the table in my publication at page 162.  Such a common haplotype at 12 is evidence that Pa is an old sub clade of P.  However, the evidence is not convincing yet.

           Bimodal evidence:  Only 4 samples (value 16) with values other than 17 or 18 for the longer chain.

           3 Pa are available at 111 markers.

           I have more discussion about Pa in the Pg topic below.

           Pb Bimodal Marker.  DYS19=16.  27 samples.  P modal value 17.  This one is of interest because 16 is the ancestral R1a value, modal for both N and K types.  The large size of Pb is a bit of a surprise, because Pb is only 5th largest at 12 markers, and those should be a mix of P and K because Pb differs from both P and K by only 1 marker out of the 12.  Those 27 are not K because they have 67 makers and do not fit K type, which differs by multiple signature markers.  The large size of Pb might mean there is one large P sub clade that represents the oldest P node, before the mutation to 17, so it is quite old with lots of STR variation.  That makes sense, because the proportion of Pb samples that match the Pb modal at 12 markers is not much different than the proportion of P samples that match the P modal at 12.

           On the other hand, Pb might be 2 or more clades with unrelated nodes, only one of those might be the oldest, the others being back mutations to 16 by coincidence.  On the other hand, that 16 might be a back mutation for most or all samples, as far as we know with the data available today.

           Bimodal evidence:  Only 2 samples (value 15) with values other than 16 or 17.

           5 Pb are available at 111 markers.

           Pab bimodal marker pair would have both Pa and Pb defining mutations.  There are only 2 such samples (out of 90 at 67 markers).

           Pc Bimodal Marker.  DYS439=11.  17 samples.  P modal 10.  Also discussed in my publication starting on page 171.

           Bimodal evidence:  Only 2 samples (value 12) with values other than 10 or 11.

           One Pc is available at 111 markers.

           The combination markers produce Pac and Pbc clusters with 3 and 6 samples.

           The Pc that I discussed in my publication is what I now call Pch, discussed below.  I changed the nomenclature to avoid getting myself confused.

           Pg Bimodal Marker.  DYS572=11.  25 samples.  P modal 12.  Also discussed in my publication page 172.  Like Pb, this one is of interest because the 11 value is ancestral; the discussion is similar to the discussion for Pb.

           572 is the 4th from the last of the standard 67 markers.

           Bimodal evidence:  Only 2 samples (one each at 12 and 13) with values other than 11 or 12.

           3 Pg are available at 111 markers.

           The combinations Pag and Pbg each have 8 samples.  Two Pb combinations (above) have 3 or more samples.  All other combinations of a, b, c, g have fewer than 3 samples each.

           Those two combinations with 8 samples, Pag and Pbg, are instructive.  They provide a reason why Pg has not worked as a proposed type in the past.  Pg might be comprised of two sub clusters.  Pag has the P modal 17 for all 8 samples at the “b marker”.  Pbg has the P modal 17 for the long 389 chain for all 8 samples at the “a marker”.  9 Pg samples belong to neither Pag nor Pbg.

           For most haplogroups, a cluster of 8 samples with two markers that differ from the haplogroup modal is impressive.  However, P type is large and relatively homogeneous.  In this case I have tried many combinations;  some are bound to come up impressive just by luck;  I am discussing only the impressive ones.  I suppose if your sample falls into either Pag (or Pbg) there may be 50% confidence that you belong to a clade including more than 5 of those samples defined by the two corresponding mutations, but I personally do not consider the confidence anywhere near 80%.

           Even if Pag and Pbg are shown in the future to correspond to two haplogroups, it does not follow that they will be sub clades of Pg;  they may be independent branches of the P tree that both received the DYS572=11 mutation independently.  Or one of them could be an old node with the ancestral value.

           DYS572 is ranked in the Chandler list as 40th, not very slow.  In the 2010 version of this web page, I presented evidence that 572 is indeed a slowly mutating marker, at least in R1a.  I still stand by that prediction.  That would make it reasonable that most of the Pg samples belong to the oldest node in the P tree (but still less than 80% confident for 80% of the samples).  Also, we wonder if Pbg is the oldest node in the Pg branch, or if Pbg is a more recent back mutation at the “b marker” DYS19 to the ancestral value?  In other words, are the apparently ancestral 572=11 and 19=16 both older than P type, or both younger, or is one older and one younger?  We don’t know yet.

           H type also has the 572 = 12 value.

           Ph Bimodal Marker.  DYS534=14.  34 samples.  P modal 13.

           Bimodal evidence:  Only 2 samples (value 15) with values other than 13 or 14.

           One Ph is available at 111 markers.

           There are several combinations;  the ones with 3 or more samples:  Pah, Pbh, Pch, Pgh, Pagh, Pbch, Pbgh have 4, 11, 12, 14, 3, 5, 4 samples.

           My published Pc is really Pch, defined by those two markers that differ from the P modal.

           The best 3:  Pbh, Pch, Pgh, have 11, 12, 14 samples.  These are instructive, particularly if they are viewed along with the previous two “instructive” combinations, Pag and Pgb above.  These cannot all be valid clades because the same markers are used in different combinations.  This is an explicit demonstration how interesting clusters will always come up if enough combinations are tried.  However, if we assume one particular cluster to be valid, that means some of the others are not valid.

           Pd, Pe, Pf, Pi, ….  My Haplotypes.xls file, near the middle of the “Haplotypes” sheet, has a longer list of bimodal markers in P type.

           Plap Cluster.  Includes Lapinski samples.  This cluster has 8 samples that match perfectly at 14 of the 67 markers.  Two of those 8, plus two more at step 1 out of the 14, belong to the Lapinski family set.  This is an example of selection bias, because Lapinski recruited the other 3 distant relatives, so the cluster is not as large as it seems.  The cluster does not form a type;  I mention it here as an example of a tentative cluster.

           The Plap modal differs from the P modal at what I call the Pr marker, DYS607 = 17 for Plap vs 16 for P modal.  DYS607 is highly variable in P type;  there are more 17 samples than 15 samples -- a mildly bimodal distribution.  However, those 8 Plap samples, all with the 17, just about account for the excess 17’s, so 607 is no longer bimodal after adjusting for Plap.

           Pz Cluster.  DYS565=14.  Only 4 samples.  DYS565 is the last of the 67 set.  There are 5 DYS565=14 samples -- these 4 plus another that does not fit.  The Pz modal differs from the P modal at 12 markers, so this one is promising for the future.  SBP comes out over 50% because of the penalty for small sample statistical correction built into SBP.  This one may improve as more data accumulates in a year or so.  On the other hand, I studied about 20 P clusters to come up with this best example of a new promising cluster, so the most obvious explanation is luck.  If you study STR data randomly generated by a computer you may find a good cluster if you examine enough candidates.

 

           R.  Remainder.  Updated 2 Jul 2010.  This is not a haplogroup or a type.  This is a category for samples that are distant in STR values from all the R1a1a types I have defined so far.  If you are in this category, I highly recommend that you get all 67 markers plus the M458 test.  More markers will help me define a new type for you.  Your M458 test is unlikely to come out positive, but if it does that means you would be the first member of a new type within M458.

           I also recommend that you test for all the several SNPs that FTDNA considers equivalent to R1a1 (called R1a1a by others).  Your unusual STR values make you a candidate for an unusual small clade that has a very old node with the R1a tree.  Each SNP is unlikely to come out negative.  In fact, all such tests most likely will come out positive.  But if one comes out negative that’s excellent, because you will join a very rare group, perhaps even define a new haplogroup.  If you cannot afford all these tests, OK, just hope for people with STR values close to yours to do the tests and watch this web page for your sample to move into a new category.

           R is equivalent to a paragroup.  Just like R1a1a* means only R1a1a samples that are negative for all known SNP subdivisions, my R category extends that to mean only samples that do not match any of my known types.  At 67 markers, R also means that the sample does not qualify for one of my borderline categories.  I have a policy not to use the U category for samples with all 67 markers, so in some cases I need to make a close call on a sample that is on the edge a borderline category - some R samples are right at my cutoff at 67 markers.

           For a sample with 37 or fewer markers, I require 80% probability that the sample would not match one of my types if all 67 markers were obtained.  There used to be quite a few R at 37 markers back when I had only a few types, but there are none right now (July 2010) because there are none that have STR markers so unusual that they are far from all types.

           The 80% rule does not apply to R.  If a sample has 30% probability of belonging to its best fit type it would be assigned to R.  That means it only has 70% probability of being a true R.  R samples still have their FTDNA assignment which is either 100% (green) or 99% (red).

           When I started this hobby a couple years ago, R was the 2nd biggest category after U.  I now have enough types that R is small.

           In June 2010 I subdivided R into two categories.  R (M458-) is those tested negative for M458.  R (needs M458) is those not tested for that SNP.  If an R sample would test positive it would be moved to the NR category.

 

           U.  Unassigned.  Updated 3 Jul 2010.  This is not a cluster, but a holding place for samples with less than 80% probability for assignment.  I use U in the Polish Project for R1a uncertain samples with less than 67 markers.  Samples with all 67 standard markers are not assigned to U, but instead are assigned to the R (remainder) category, or into “Borderline” categories such as N Borderline or K Borderline.  U is 0% in the Results Table, which is samples with 67 markers, but considering all samples U is the largest category in the Polish Project, with 200 members on 25 May 2010 - 15% of the project, 35% of R1a.  If you are classified U you can become promoted to another category by obtaining results for the remainder of the 67 markers.

           The 80% rule does not apply to U.  If a sample has 70% probability of belonging to its best fit type it would be assigned to U.  That means it only has 30% probability of being a true U.  Many U have >30% probability of belonging to two or more different types.  U samples still have their FTDNA assignment which is either 100% (green) or 99% (red).

           Probabilities include estimates, so they are not exact.  I tend to be strict for samples with fewer than 67 markers, using U for marginal situations.  At 67 markers is do not use U - I use R, and I’m not strict at 67.  Also, I concentrate my time on improving the assignment rules at 67 markers and have not yet found time for 37 marker rules for some of the newer small types.

 

           Z280.  See K type.

 

           Z93.  New topic 31 Oct 2011.  This new SNP was recognized earlier this month by ISOGG as R1a1a1h.

           So far, all Z93 samples in the Polish Project are coming out L342.2+, and vice versa.

           A type, discussed here at this web page since origination, and mentioned in my 2009 publication, is a branch of Z93 (L342.2).  A type samples are coming out positive for both SNPs.

           I just today added L342T as a new cluster, a hypothetical branch of Z93 (L342.2).

           The Z93 category at the Polish Project web page has the samples that are Z93+ or L342.2+ and are not predicted A type or L342T cluster.  Z93 also includes samples not tested for Z93 but are close STR matches to a sample that tested Z93+.

           I tried to come up with an STR definition for Z93 (L342.2).  I could not.  Z93 does not have good signature STR markers.  Or, there is a better way to say that:  The signature markers for Z93 are about the same as the signature markers for Z280 (previous topic), which is a large new haplogroup in R1a.  Lots of Polish Project samples are now coming out Z280+.  Z280 seems to be equivalent to what I have been calling K type.  Z93 and K type have similar STR values at the slower mutating STRs.  As a result, the modal haplotype for R1a as a whole is similar to the modal haplotype for Z93 (L342.2) samples, and similar to the modal haplotype for Z280 (K type) samples.

           A simple explanation:  Z280 and Z93 are “brother” haplogroups, and neither is particularly young.  The MRCAs of these two haplogroups apparently had very similar STR values.  Originally, both grew rapidly, before significant sub clades could form with STR mutations at slow mutating markers.  Over the years, both haplogroups diversified in STR values.  So many subclades in Z280 and Z93 today have STR overlapping values.  Population bottlenecks eventually produced some sub clades with good STR signatures, such as A type for example, which is very well isolated in haplospace.  This paragraphs is a simple explanation of why it is difficult to distinguish all Z93 samples;  other explanations are possible, including complicated explanations.

           Z93 is a good example of why calculating age of haplogroups is highly uncertain.  A type seems to be very young.  A type dominates Z93 in the Polish Project.  Maybe A type had a particularly vigorous population expansion; or maybe A type luckily avoided a severe population bottleneck;  or maybe the A type ancestors moved to Central Europe from distant lands;  whatever.  Age is calculated from STR variance, so the age of Z93 is dominated by the age of A, which is misleading and too young.  If A type samples are excluded, the age of Z93 still would come out too young, because the A type samples have a unique STR signature, which means significant STR mutations, which means the A type MRCA lived at a time when Z93 was already quite old, so the A data needs to be considered when estimating the age of Z93.  I’ll try to come up with an age estimate, for next time I update this topic.

 

           End of R1a Categories.

 

R1b Haplogroup Types

 

           On 20 July I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).

           Mayka had already added these three to the Polish Project web page during the previous week, based on my recommendation, based on my SBP analysis.

           I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters.  We judge that my analysis justifies adding them to our list of types.  Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10).  These three small types are roughly 1% each.

           I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.

           The STR definitions for these are available at Haplotypes.xls.  PolishCladesUpdate has a link to an Excel analysis file for each of these three types.

           Instructions for Ysearch comparison are below.  Here is the “UserIDs” bar for R1b comparison:

USEID, CX94E, MKM4R, 7HB9C

           Change USEID to your User ID.

           Reminder:  These two types are calibrated to Polish data.  The definition modal haplotypes may not be optimal for other regions.  If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade.  Up to step 15 there is lower probability that you belong.  You should test the appropriate SNPs (explained below) for higher confidence.  If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.

 

           L23EE.  20 Jul 2010 documentation:  This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a.  This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.

           Nordtvedt pointed out the cluster for this type some years ago, calling it R1b-EE (Eastern Europe).  Mayka suggested the L23EE code to me.

           There are only 6 samples in the Polish Project in this type (13 Jul 2010).  SBP = 10.7% using all 67 markers, which is excellent for such a small type.  The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%.  In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+.  So this type is very well isolated in haplospace in Poland.

           On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project.  There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia.  That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain.  I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.

           This type has evidence of structure.  A number of markers are bimodal with no obvious correlation.  To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.

           If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers.  Even at 67 markers, I recommend the L51 test;  a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member.  We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.

 

           L47P.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.

           Mayka announced the cluster corresponding to this type on the web in March 2009.

           There are only 4 samples in the Polish Project in this type (13 Jul 2010).  SBP = 9.3% using 64 markers, which is excellent for such a small type.  The cutoff is 7 and the gap is 10.  There are no samples from step 7 to 16.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.

           This type is very robust;  the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.

           Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.

           Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project.  Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.

           The “P” in the code L47P represents my hypothesis that this type is Polish.

           Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative.  That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.

           See the last paragraph of L47A, next topic, for more comments.

 

           L47A.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  I do not know yet if this type is negative for L44, a known branch of L47.

           Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down).  I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers;  I would appreciate an email of a reference to add here, even if it does not exactly match my definition.  It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.

           There are only 5 samples in the Polish Project in this type (13 Jul 2010).  SBP = 7.6% using all 67 markers, which is excellent for such a small type.  The cutoff is 10 and the gap is 9.  There are no samples from steps 9 to 18.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.

           This type is very robust;  the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.

           This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics;  SBP = 4.6%, which is remarkable.  It might be even better with an optimized definition;  I used the modal haplotype that I extracted from the 4 Polish Project samples.

           This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.

           So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48.  In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A.  SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades.  I also have not searched the web for the STR values expected for L148.  (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.).