Polish Y-DNA Clades
26 Jan 2015
26 Feb 2015 news: Lots of new SNPs were discovered during 2014, and the pace of discovery continues. Most of the discoveries are coming from Big Y data. This web page of mine needs lots of updating to reflect all the new SNPs, most of which are not yet mentioned here. Most of my STR types are being validated as haplogroups defined by new SNPs. A few of my predictions are being proven invalid. (My goal has always been about 80% validation - mildly aggressive predictions.) One good place to find new SNPs arranged into the Y-DNA tree is http://www.yfull.com/tree/.
22 Feb 2014 new table of clades that are most concentrated in Poland:
Click on Poland Concentration Index for an explanation of this index, which I propose as an objective measure of geographical concentration in Poland. Larger index value means relatively more statistical confidence of concentration in Poland. Of the types that I have identified and studied so far, I present these eight as the most likely Y-DNA clades with significant concentration in Poland.
division of Z92
These two tables are based on data from the Polish Project, taken to be representative of Historical Poland, with caveats explained in my Publication. The Freq. % column shows the percentage frequency for each clade in the Polish Project.
In the Short Code Name column, click on the link to jump down to a discussion of that clade. Most of the discussions have not been updated lately, so the discussions my have different ISOGG codes, because those codes change often. Also, the discussions may have older calculations of the % of the Polish Project; % concentrations in these two tables are based on Polish Project data as of 14 Nov 2013.
Update 22 Feb 2014. Other clades of interest:
(part of N type)
division of Z92
division of CTS1211
R1b EE type
R1b A type
R1b P type
The two tables above are a brief summary, with some of the most interesting results, and some recent results. For more detailed results, please click on the following links:
Short overview of all major haplogroups in the Polish Project
More details, more proposed clades, more Ysearch links
Abstract rewrite 20 Oct 2012. Minor edit 9 Aug 2013.
The Polish Project has assignments of men (samples) to haplogroups and to proposed subdivision clades based on their Y-DNA data. Lawrence Mayka, administrator of the Polish Project, provides data for this web site of mine. I help Mayka with statistical methods for assignment of samples. This web document is for explanation, details, and update news.
The Results Table has a summary of assignment statistics. Some assignment categories have a link to more detailed discussion. If you know your assignment you can click on the link in the right column of the Results Table to read more about your assignment category.
Many of the assignments are to well established haplogroups, based on SNPs. Some assignments are to hypothetical haplogroup branches, based on s. Such branches are proposed by many people, including Mayka and me. In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence. With less than 80% confidence, my assignment categories are tentative, not called types, usually called clusters.
This Abstract is for people reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you might prefer to read the Introduction first.
This web document has three purposes: 1. More detailed explanations for the sample assignments in the Polish Project. 2. Summary of my published results. 3. Update with recent results.
The topic is common Polish Y-DNA clades - identification of male line Y-DNA clades that are concentrated in the region of Historical Poland.
I use the word type to mean an STR cluster with statistical confidence as established by my Mountain Method. Many of my types have been validated by discovery of new SNPs that qualified the corresponding clades as official haplogroups. I expect more than 80% of my types to be validated some day, but my method is intended to be slightly aggressive, so I do not expect 90% validity. I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters. All types have associated clusters but not all clusters qualify as types. In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references. Often when I discover a type I later find out someone else had mentioned it earlier on the web; let me know if you the reader have more clues and references for me.
Most types that I discuss seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA). A few of my types might be younger or older than that range.
I use phrases like “seem to be” over and over because the methods are statistical.
Click here for a slightly longer abstract-style summary of Polish Project assignments.
The Polish Project is considered representative of Historical Poland, with caveats explained in my Publication.
I am interested in Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal. This document is dedicated to identifying haplogroups and types and clusters concentrated in Poland, with detailed explanations. I am aware that some people object to the use of Y-DNA for historical analysis, so I try to mention caveats along with my comments.
Comments added 31 Aug 2014. There has been a flood of new SNPs discovered over the past few months. That means there are lots of new Haplogroups. Genetic genealogists, including me, have been working to assign men (their Y-DNA samples) to these new branches on the Y-DNA tree. Most of these new branches are very small (I call them twigs), with less than 5 known samples. For the new SNPs, it is not known how many samples belong to the corresponding new haplogroup; samples that match closely with STRs need to be tested for each new SNP to see which are positive and which are negative. So there is a time delay between discovery and determination of the scope for most new SNPs. It looks like my STR types all have roughly equivalent new SNPs. I say “roughly” because the types might be somewhat older or younger than the SNP, and because the exact scope of the new SNPs are not yet known. STRs are useful to recommend SNP tests, and I’ll be figuring out STR type definitions corresponding to new SNPs. However, it is looking like there will no longer be a big incentive to predict new Y-DNA branches based on STRs, because new SNPs can now be easily found. See Big Y for one way to find new SNPs. See SNP ordering information. I need to rewrite this entire web page, but my current priority is analyzing the new data. My next priority will be to update the “type” topics on this web page, one at a time, associating them with SNP definitions.
Update minor changes 28 Feb 2014.
Abstract rewrite 8 Aug 2013.
When I originally posted this web page in December 2007, no significant haplogroup subdivision of R1a was available, so this page started with hypothetical subdivisions of R1a. A major division, roughly 50-50, based on the SNP M458, became available in November 2009. Now, 2013, there are many haplogroup branches known in R1a, and this page continues with proposed further division of Y-DNA clades common in the region of Historical Poland.
In 2010 I expanded this page to include clades from other haplogroups.
Actually, the largest category in the Polish Project is the R1a - U category, for “Unassigned” samples without sufficient data for confident assignment to R1a branches. The Results Table is based upon the samples with sufficient data. If you are in this R1a-U category, you can promote yourself into one of the branches by purchasing the full 67 marker STR set, since all R1a samples with 67 markers get a detailed assignment.
There are two large categories in the Polish R1a data. Since 2007, I have been calling them P type and N type. P type is now known to be more than 95% equivalent to the haplogroup R1a1a1b1a1a(L260). N type is more than 95% equivalent to the haplogroup R1a1a1b1a1b(CTS11962). P type is concentrated in Poland, rare with increasing distance from Poland. N type seems to be mostly Slavic, widespread in eastern Europe.
R1a1a1b1a3 (Z284) is a large R1a haplogroup, common in Scandinavian countries, rare in Poland. There are only two Z284+ samples in the Polish Project.
Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.
Reminder: I am concentrating on Poland. The statistics of STR clusters depend a lot on the database. For example, P type stands out dramatically in Polish data. In other countries far from Poland P type is rare. If you belong to an R1a cluster that is rare in Poland, I’m sorry, but I’m not covering you. Check out the R1a Project.
Edited 9 Aug 2013.
This Introduction is for people unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JOGG are good general references. The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.
The following several paragraphs are a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages. The definition words are boldface. I often use links to those definitions when I use a jargon word for the first time in a topic. There are more boldface definitions in the summary of my Methods.
The Y chromosome gets passed from father to son, so it works just like a male family name. Men are divided into haplogroups based on known rare mutations (most of them are called single nucleotide polymorphisms SNP) in the Y chromosome. Division into haplogroups is done in a manner that has virtually 100% confidence. I say “virtually” because your confidence in your DNA result from your DNA testing company might be 98% or 99% or 99.9%; the confidence for haplogroups is better than that. I other words, we are more confident in the validity of the haplogroups than in the accuracy of individual DNA tests. We can be virtually certain that all the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup. The MRCA corresponds to a node, or branching point, in the Y-DNA tree of male line ancestry. Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the node.
Lots of people, including me, are working to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.
Haplogroups have alphanumeric codes, like R1a1a. A paragroup is a haplogroup considered without its known haplogroup branches. An asterisk is often used in paragroup codes, like R1a1a*. When a new branch is discovered within a paragroup, it gets removed from the definition; that changes the meaning of that paragroup. The meaning of a paragroup varies, depending upon which branches are used in the associated database, and depending upon which known branches are excluded.
Many people, including me, try to “stay ahead” of the haplogroups by analyzing other mutations that are not so rare (called STR) on the Y chromosome. Men submit their Y-DNA data to various web sites. There are lots of STR data available on the web. Men are divided into STR clusters as hypothetical subdivisions of the haplogroups, based on similarities of STR values. All such clusters are hypothetical. Some will be validated in the future by new SNP discoveries. There are various statistical methods for estimating the confidence of STR clusters. I recently published a method that I developed. That publication has references to other methods. There is a brief summary of my method below.
A few STR clusters are small family clusters, with the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line cousins identified by genealogy research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people prefer to avoid genetic genealogy.
The male line associated with the Y-chromosome is only one ancestral line. Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then; the one man at the tip of the male line root is only one of those thousand. That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many. That said, many people enjoy the challenging hobby of figuring out to which ancient extended male line they belong.
Most STR based clusters have an MRCA who lived thousands of years ago, before family names were common, so most men assigned to a typical cluster do not have the same family name.
Many SNP based haplogroups have an MRCA who lived more than ten thousand years ago, so these span multiple ethnic groups and nationalities. For example, the R1a haplogroup is of interest to me. R1a is most common in Slavic countries but calling R1a Slavic is misleading because it is found throughout Europe and west Asia. The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it. It is possible that he did not even live in what is now the Slavic region of Europe; maybe his descendants moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and culture, by now some of his descendants long ago moved to other parts of Europe and Asia. One of the appeals of genetic genealogy is trying to figure out ethnic descent and migration from the statistics of haplogroups. Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of our 24 chromosomes. True enough. Some individuals and some web sites go too far with genetic genealogy claims based on DNA. That said, statistical analysis of haplogroup data provides many clues on human origins.
Again, some people try to stay ahead of haplogroups, using statistical analysis of STR based clusters to gain insight into more recent human origins. I am one of those people. My interest is Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal. This document is dedicated to Y-DNA data and analysis, both SNP and STR, identifying haplogroups, types, and clusters concentrated in Poland, with detailed explanations.
The bottom of my Method section has more definitions for a number of genetic genealogy terms.
There are many organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company FTDNA; I mention them because I make extensive use of their data; check Google for competitors. At FTDNA, click on Products for cheek swab kits. DNA results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your data into one of the many databases; for example, most of my analysis is from the data in the FTDNA Polish Project.
I use the FTDNA standard set of 67 STR markers (plus a few non-standard ones occasionally). I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets. Other companies use standard marker sets that may not overlap with all the FTDNA markers.
Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch.
New topic 17 Oct 2012.
L1029 was a new SNP last March. L1029 provides a branch of M458, added to the ISOGG tree this year. The other branch is L260 (update next topic). L260 was discovered in 2010. Most M458+ L260- samples are coming out L1029+. I have been calling M458+ L260- samples N type (very few exceptions - next topic). It is now clear that L1029 is a major branch, capturing more than 90% of N type (more than 90% of M458+ L260-).
In this topic I present preliminary evidence that Np corresponds to a Y-DNA clade concentrated in Poland. I also explain why all Polish N type samples (tested or predicted M458 and not L260) would benefit from the L1029 test, because Np cannot be predicted precisely, and because there is a low fraction of L1029- outlier samples, not fitting Np.
So far (10 Oct data) there are 20 results L1029- (including a few samples that are not M458+) and 42 results L1029+. N type requires 67 or more of the standard markers for confident assignment. Using samples with those 67 markers the numbers are 114 N type, of which there are 12 L1029- and 41 L1029+. Of the 61 remaining N type samples (at 67 in the Polish Project) not tested for L1029, I estimate only about 5 might come out L1029-, because testing has been concentrated on STR predictions, discussed below in this topic.
Two of the others differ significantly from the other 12, so I am predicting these two as outliers, with M458 nodes older than the main Np hypothetical clade.
Np Cluster Definition: I constructed an STR definition for the remaining 10 samples with similar STR values and L1029- result. The definition uses 37 of the 67 markers. The cutoff is 2 (step less than 2 are considered matches). I uploaded this definition to Ysearch, code CHFXB. My analysis file is L1029Study.xls
On this basis, 3 of the untested N type samples fit the definition and are predicted L1029- members of the hypothetical Np clade. Two more are marginal, so perhaps there are 14 Np samples among the 114 N type. N type is 8.8% of the Polish Project, so that means 14 / 114 * 8.8% = 1.1% Np samples in the Polish Project. The statistical uncertainty is wide, so my estimated 80% confidence range is 0.5% to 2 %. Insofar as the Polish Project is representative of Historical Poland, it seems the Np hypothetical clade has roughly 1% frequency in the region of Historical Poland. Of the 10 confirmed Np samples, 8 provide “Poland” as origin, one “Russian Federation” and one “Lithuania”. The 3 predicted Np samples have two “Poland” and one “Belarus”. There is no need to subtract the samples without “Poland” because the Polish Project as a whole has a similar frequency of samples not “Poland”; such samples come from men with evidence of male ancestry from Historical Poland.
DYS460 = 10 is a very strong signature marker for Np. All 13 of the confirmed and predicted Np samples have this value. Those two outlier samples also have this value. Among those 41 L1029+ samples, only 6 have this 10 value; 3 have 12 and the 32 others all have the N type modal 11 value. The statistics of this paragraph are misleading because DYS460=10 was used to encourage L1029 testing in the Polish Project. I would expect a few Np to show up in the future with 460 value other than 10 (mutated from the Np ancestral value), and I would expect in the long run a lower fraction (less than 6 / 32) L1029+ to have the 10 value (independent mutations). Among the 49 N type samples not confidently assigned to sub-categories, only 5 have the 10 value, and 1 of these is a marginal Np sample mentioned above.
CDYa = 33 is another good signature. These two markers alone with cutoff 1 (that means both markers match) capture 9 of the 13 Np samples (Np defined as 13 captured by 37 markers cutoff 2). These two markers also capture 2 marginal samples (at the step 2 cutoff of Np at 37), plus only one other N type, plus a few D type (D are not members of the M458 clade, but DYS460=10 is modal in D). CDY is a fast mutator, so it is unusual to serve as a signature marker. I ran into this on one other occasion, where I postulated a mutation disabled CDYb; see my discussion at http://www.gwozdz.org/L540.html#CDYb. Actually, another reasonable explanation is that this CDYa=33 signature is just luck, because using only 10 samples we should not be too surprised that one of the rapid mutators looks like a signature, by the luck of random mutations. Yet a third explanation: Np might really be 2 or more clades where the ancestors (MRCAs) of each clade had the CDYa=33 value by luck, but those ancestors differed at other markers; this explanation is discussed more below.
There are no more good Np signature markers. Np modal values differ from N modal values at only 4 of the 67 markers. There are only two Np samples at 111 markers, and they do not seem to differ from N at those additional 44 markers. On this basis, I am not confident that my definition is very precise, because it takes as little as 2 mutations in the male line history for a sample to be incorrectly predicted, using any STR definition.
There is another reason for my uncertainty about my 37 marker Np definition: I worked harder than usual to construct this definition, so there is selection bias. Markers that just happen to have no mutations in those 10 samples are all in the definition. Any marker got dropped if it produced 2 or more mutations in any sample of those 10. Surely as more samples show up I’ll need to modify my definition. Those 37 markers are only a “good bet” definition for Np prediction today.
I published my SBP method of quantifying confidence in clade predictions based on Y-DNA STRs. Lower SBP means higher confidence. I reserve the word type for clusters with SBP < 20%. I consider SBP meaningless for SBP > 50%. Np comes out with SBP = 64%. This does not necessarily mean that Np is invalid as a clade prediction. My SBP method gives larger values for SBP with few samples, so valid clades improve with more data (SBP becomes smaller). A clade with modal STR values close to the father clade (N is the father of Np) necessarily comes out with large SBP. Concentration in Poland is evidence of validity for Np. That 460=10 is also evidence of validity. In my estimation, Np has about 80% confidence of validity, all evidence considered, but only 50% confidence of being a unique clade. Np might be primarily one clade with interference from other independent small clades with similar STR values. Or, Np might be 2 or more clades, about the same size, all concentrated in Poland, but distantly related. Clarification: two clades with very close nodes to the father branch might be considered a single clade; here I mean that Np might be 2 clades with nodes that are not close in the tree, perhaps with other small clade nodes between them that do not fit Np STRs (by the luck of random mutations in the ancestor). More discussion below on this idea.
In the R1a Project, my 37 marker definition captures 11 samples with SBP = 95% (data at 67 markers, download 14 Oct). Eight of the 11 have L1029- result and the others are not tested yet. Seven of the 11 are of “Poland” origin. Two L1029- are N type that do not match Np. There are 38 L1029+ that do not match Np. Summary: L1029- are rarer in the R1a Project (compared to the Polish Project) and the L1029- predominantly match Np. SBP is worse (higher) because of interference at the cutoff by more R1a samples from outside Poland. This paragraph is not conclusive, however, because the administrators of both projects work together; many of the samples come from men who joined both projects. Both projects worked hard on getting L1029 results this year, using 460=10 fit as a guide for emphasis.
As an independent test, I checked (11 Oct) the “RussiaDNA” Project (another FTDNA project). Of 260 R1a total, only 12 have been tested for L1029, and only 2 of these 12 came out L1029-: one Poland and one Russian Federation. This is preliminary evidence that Np is rare in the Russian federation, although N is common in all Slavic countries.
More projects checked (14-15 Oct):
Russian_impire: 4 L1029 tests, one negative, not Poland
LituaniaPropria: 4 L1029 tests, two negative, both “Lithuania” origin, one L1029- also in the Polish Project, both also in the R1a Project
in addition, both L1029+ are also in the Polish Project, and one in the R1a Project, so these are not independent data
Scottishdna: no L1029 tests
Finland: 1 L1029 positive
BritishIsles: 1 L1029 positive
Other projects are not concentrating on L1029 tests. I hesitate to encourage them, because M458+ L1029- seem to be mostly from Poland.
I have an R1a database at 67 markers with 1816 samples from 15 FTDNA projects. I collected this 20 June, when there were fewer L1029 results. My 37 marker definition captures 13 samples, but 12 of these are in the Polish Project, and the other is in the R1a project. No additional samples fit Np. There are more marginal samples at the cutoff step 2: 10 of them: only 2 in the Polish Project; only one from Poland. This is my strongest evidence that the Np cluster is concentrated in Poland.
Ysearch: 9 samples are captured by my Np definition CHFXB. Only 2 are from Poland. Only 2 of the 13 Polish Project Np joined Ysearch (one Poland and one Lithuania). SBP is poor for Np at Ysearch because there are 6 samples at the step 2 cutoff, none from Poland. In addition, 2 “Central European” modals fall at step 2 (37 markers used), emphasizing how hard it is to separate Np. A simple explanation for these Ysearch results is that there are 1 or more other clades concentrated outside Poland, which might be L1029- or L1029+.
At the top of this topic, I reported “more than 90% of N type” (M458+ L260-) are L1029+. Since L1029- are concentrated in Poland, it may actually be more than 95% worldwide. However, there is a reasonable possibility of one or more small clades showing up L1029- from outside Poland when more samples are tested.
Age of Np: It is too soon to estimate the age (TMRCA) of L1029, and age based on STR variation is uncertain because of known caveats. However, L1029 is probably not much younger than N type because L1029 includes almost all of N type. N type is surely older than 2,000 years. Indeed, variation of L1029 STRs is looking similar to N type variation. The L1029- node is necessarily the same or older than the L1029 node, so Np has an old node. However, the age of the node is almost always older than the age of the clade (TMRCA). Np seems very young, as evidenced by the unique 460=10 value discussed above. On the other hand, other markers have significant variation within Np; that may mean Np is not so young; or, that may mean Np is composed of 2 or more clades, each of which is young.
Speculation: Np reminds me of P type (L260 update, next topic). In my 2009 publication, and at this web page, I have speculated that L260 may have a very old node, but the P type ancestor (MRCA) may have lived more recently, perhaps not long before formation of the tribes that led to the Polish nation. It seems to me that M458 is quite old, but not many M458 individuals survived over the millennia, and a few of the M458 survivors were lucky enough to found clades during the population expansion of the last 3 millennia. Perhaps the Np ancestor, with L1029- and 460=10, also lived long ago and left few survivors; most of those few formed what are today very small clades, and one was (or perhaps 2 or more, all with 460=10, were) lucky enough to found the medium sized cluster today apparent as Np. I find it interesting to consider the men who lived 1,000 to 2,000 years ago in the region that is now Poland (and / or maybe in another region from which there was a migration to Poland). Due to the statistics of Y-DNA inheritance, most men do not form clades that last long, and very few men form large clades. Human behavior may perhaps broaden the statistical spread of clade size, allowing rare men to produce relatively larger clades. I speculate that among those proto-Polish men who founded clades that survive today, most were R1a, and many of those were M458, and one or a few of those were Np, and one was P.
This topic was updated 5 Jul 2010.
If you are P type or N type you would likely come out positive in the SNP test for M458 (M458+). If you are P type you are likely L260+. N type is likely L260-. If you have not already tested you can pay the small fee to test for these SNP tests to confirm that you belong to the corresponding haplogroup.
If you are assigned to P borderline or to N borderline you would benefit more from the M458 and L260 tests, because that would provide for you a definite assignment within R1a.
The assignment rules are done with high probability, so if you are unassigned (category U) there is a low probability that you would test positive for M458, with probability that decreases with your step (genetic mutation distance) from P or N.
If you have less than the standard 67 STR markers it is generally better to purchase the remaining markers. That way, you are more likely to get an assignment, because the statistics for STRs improves with more markers. Nevertheless, if you are not many steps from P or N you might consider doing the M458 test even with fewer than 67 markers.
There is a slight chance that you might test positive for L260 or M458 even if you do not match P or N. The haplogroup corresponding to M458 is old enough that there may be small clades with STR markers very different than P or N. I have not seen one yet, but there is no way to estimate this probability. I hesitate to recommend the M458 SNP test for men whose samples are distant from both P and N in STR values. I admit you can just wait to see if anyone with STR values similar to yours matches an SNP, then test for that SNP. However, we all benefit when some men test for all the new SNPs within an established haplogroup, because that way we find out the size and rough age of the corresponding new haplogroup branches. FTDNA offers “deep clade” test packages to test for all possible haplogroup branches, but my understanding is that L260 and M458 are not yet included in the R1a deep clade test. You need to purchase them separately from the advanced markers menu. No doubt FTDNA will add them soon to the deep clade package.
The Fall 2009 issue of the Journal of Genetic Genealogy has my publication split into two parts:
Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.
Part II is the application of that method to Common Polish Clades. That article has a lot more detail than this web page, but that article was published in the Fall of 2009, so this web page serves as an update.
PolishCladesUpdate is my folder for updates of the Excel analysis files for those two articles.
This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.
The Fall 2010 issue has my publication announcing the L260 SNP.
Wikipedia has a nice R1a entry with primary contribution by Andrew Lancaster.
11 Jan 2011 update: There is a lot of activity these days in the discovery of new SNPs for dividing R1a into branch haplogroups. You can follow the activity at the R page of the ISOGG Y-DNA tree, and also at the FTDNA Draft tree.
The new SNP named L365 includes what I have been calling G type, based on preliminary data. It is too early to say if other samples in addition to G type are positive for this new SNP.
The new SNP named M417 excludes what I have been calling C type, based on preliminary data. So far very few R1a samples are negative for this new SNP, but it is too early to estimate the rarity of M417-.
In early 2011 FTDNA released some new SNPs for commercial testing, including the following for R1a: L365, M417, L366, L291, and others.
L260 and M458 are discussed below.
There are other new experimental SNPs discussed on the web. I’m not trying to list everything here, just the ones that are of interest for discriminating new R1a haplogroup branches.
This topic is an example of the confusion of haplogroup names. I’ll not update this “Confusion” topic because this 2010 discussion is a good example. This topic is technically out of date, but it still serves as an example. This confusion applies to all haplogroups, not just R1a.
In my fall 2009 publication I used the notation that was well known at the time, where more than 95% of R1a was known to be paragroup R1a1. The R1a1 samples with one of four very rare SNPs that had been known for a few years were called haplogroups R1a1a through R1a1d. Ysearch still (25 Oct 2010) uses the notation described in this paragraph. FTDNA Projects still use this notation for automatic assignment of samples. Individual samples are not actually assigned to a paragroup because most have not been tested for all SNPs. Most R1a samples are listed as R1a1. Many samples are listed as just R1a but almost all of those would come out R1a1 if tested for the appropriate SNP (the well known M17 or M198, or one of the new ones that all seem to be equivalent). I mentioned in my publication that all Polish Project R1a were coming out R1a1. Since then only one sample (out of 1441 R1a total in the Polish Project) has come out M198-.
New SNPs were discovered equivalent to SRY10831.2, the original R1a SNP. Subsequently, rare samples were found positive for some of these new SNPs but negative for SRY10831.2. I’ll use L62 to represent these; there are others that seem to be equivalent. Those define two small paragroups, R1a(L62, SRY10831.2-) and R1a1(SRY10831.2, M198-). That previous R1a1 paragroup becomes R1a1a(M198). Accordingly, when Underhill announced the M458 SNP, he called that haplogroup R1a1a7. L260 was called R1a1a7b when first discovered. In spring 2010 I rewrote this entire web page using the notation described in this paragraph.
The recent new SNPs change the notation again. I shall not attempt to rewrite this entire web page. As I update topics, I’ll use the current notation. For clarity, I’ll add the defining SNP in parenthesis when I do updates.
The choice of which SNP to put in parenthesis is arbitrary for haplogroup notation. For example, R1a1a1(M17), R1a1a1(M198), and a few others, all seem to be equivalent. But any day now someone might announce a few samples that test negative for one of those SNPs and positive for all the others, which would define a new paragroup and force the renaming of all branches beyond that new node in the tree.
There is ambiguity in assignment of samples. For example, a sample that tests negative for M198 might be called R1a(M198-), but it is not clear if this sample belongs to the paragroup R1a(L62) or to the paragroup R1a1(SRY10831.2) if it has not been tested for the latter.
My types have an uncertainty similar to SNPs. For example, I said N type is equivalent to R1a1a1g(M458, L260-). Recently two samples showed up in the Polish Project that are M458, L260- but just beyond N type as defined by STR fit. We can think of these two as a new “paratype”, although I’ll not use that word. We classify these two in the Polish Project as “M458+R”, the Remainder in M458 excluding N type and P type. Actually, as I discuss in the N type topic, it is not statistically certain where to place the cutoff for N type, so you could argue that the M458+R category has more than two samples in the Polish Project.
Full rewrite 16 Jan 2015; minor edits 17 Jan 2015.
M458 is an SNP that was published by Underhill. It has been available as an SNP test since early November 2009 at FTDNA. M458 is common in Eastern Europe and is found throughout Europe and Western Asia. M458 is a branch of R1a.
P type is the main topic in my previous 2009 publication, Part II. P type is significantly concentrated in Poland, and in the Czech Republic. It is found at lower frequency in other Eastern European countries, and in eastern Germany. About 9% of Polish males carry P type Y-DNA. M458 was published after I submitted my article, during the referee process, so I added a comment about M458. After my publication, L260 was discovered, found to be equivalent to P type, confirming my prediction that P type corresponds to an SNP haplogroup.
My P type topic has lots of discussion that applies to L260.
CTS11962 has been available as an SNP test since early 2013 at FTDNA. CTS11962 is a branch of M458. Almost all M458 samples are either L260 or CTS11962; exceptions are very rare. CTS11962 is equivalent to what I had been calling N type.
L1029 has been available as an SNP test since March 2012 at FTDNA. L1029 is the main branch of CTS11962.
See the Results Table for an updated overview of M458 and L260 in Poland.
M458 has recently been divided into a large number of branch haplogroups defined by newly discovered SNPs.
The current ISOGG code for M458 (16 Jan 2015) is R1a1a1b1a1. L260 is R1a1a1b1a1a. CTS11962 is R1a1a1b1a1b.
See also http://www.yfull.com/tree/R-M458/, which has more recently discovered branches.
Both P type and N type are code names published by me before the corresponding SNPs were discovered. At this web page, I have been saying that L260 seems almost equivalent to P type, and that M458 seems almost equivalent to N type plus P type. I say “almost” because there have always been borderline samples. In Feb 2012, a few samples turned up that are exceptions: M458+ samples that fit neither P nor N. Not even close with STR values.
Technical Reminder: There is a logical distinction between an SNP haplogroup and an STR type: I use the word “type” for clusters of samples where I have 80% or higher confidence that the samples of the type corresponds to a unique clade. I use the word “borderline” for samples that seem to have 50% to 80% confidence of belonging to that clade. P type and N type are very well isolated in haplospace, with relatively few borderline samples. P type is particularly well isolated. The age of a type is the MRCA (ancestor at the hypothetical node for the type). A new SNP may be younger than a type, capturing only part of the type. A new SNP may be older than a type, capturing all of the type plus additional samples that are not predicted into the type. Even if a new SNP captures all the samples of a type, future samples may show up that are positive for the SNP but do not fit the type, not even Borderline. These might be members of older branches (branches with nodes older than the type), or they might be statistical outliers (members of branches within the type, where these particular samples have significantly more mutations than statistically expected due to luck). Also, there may be some samples that fit the type with STR values but do not test positive for that new SNP. These might be members of the oldest branches of the type, older than the new SNP, or they might be outliers from other clades with distant nodes.
Both P type and N type are code names published by me before the corresponding SNPs were discovered. At this web page, I have been saying that P type seems almost equivalent to L260, and that M458 seems almost equivalent to N type plus P type. I say “almost” because there have always been borderline samples. Rare outlier samples have shown up. Sometimes at this web page I say that a type is equivalent to an SNPs, particularly in summary paragraphs; please realize that by “equivalent” I mean almost but not exactly the same.
Example; Ry type: There is a family set (five samples with the same family name, very close STR match to each other) where one of them tested M458+ L260- CTS1192-. These five are clearly not N or P. Not even close in STRs. These 5 samples are now categorized in the Polish Project as “Ry type”. These were independently noticed by Lapinski, the administrator of the R1a Project, with a new category for these in that project, also. Since these have been recruited as a group, they should count as only one sample for statistical purposes. There is one other independent sample tested M458+ L260- CTS1192-. So these two are rare examples of M458 that are not P type or N type, nor do they belong to the corresponding two known haplogroup divisions of M458.
MRCAs for N and P must have lived long after the node for these two branches in the Y-DNA tree, because the STR definitions for N and P are very different (compared to definitions of other haplogroups, with yet older nodes). I say “must have” because this is a statistical conclusion; it is possible but very unlikely that N and P have a node not much older than the two MRCAs, and that those two men had unusually divergent STR values due to the random luck of mutations.
An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.
This web page about Polish Clades was completely rewritten using this new information. Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-). See R1a Subdivision for a brief summary of other groups, and for a clarification of what R1a1a* means.
R1a1a7 is the new M458 haplogroup. R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other R1a types
This Underhill article has data for 158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
Up to here, I have tried to write this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity with genetic genealogy jargon has understood. If you read this top to bottom, it gets progressively more detailed, with more and more jargon. I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions. If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.
If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.
Rewrite 21 Jan 2015:
Lawrence Mayka is the administrator of the Polish Project. Click on the Polish Project web link to see how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA and for Y-DNA. This web document of mine is restricted to Y-DNA. I help Larry with assignments to types. Lukasz Lapinski also helps with assignments.
Haplogroups are defined by SNP mutations. The goal is to assign samples to their proper terminal haplogroups. When the terminal haplogroup cannot be assigned with reasonable confidence, assignment may be to one of the branches leading to the terminal haplogroup. Terminal haplogroups are changing often recently because of the rapid rate of discovery of relevant new SNPs.
STR mutations have been easier to test, so many samples have STR data without SNP data. Predicted assignments (for samples without up to date SNP data) are based on STR correlations, by comparison to samples that have both STR and SNP data. The men with such assignments can verify their assignments by ordering the corresponding SNP test.
The assignment guideline is minimum 80% probability and minimum 80% confidence for each individual sample. Using an 80% minimum, most assignments are better than 80%, of course. So the average probability for predicted assignments higher than 80%. Indeed, over the years, Polish Project predicted SNP assignments, when tested later with the predicted SNP, have been verified more than 80% of the time. In other words, if you are assigned to an SNP category without qualification you have higher than 80% expectation that your sample will test positive for the corresponding SNP.
The 80% guideline does not apply to categories with a qualification, such as “Borderline” or “Cluster” or “Needed” or “Recommended” or “Consider”. These are qualifiers are for assignments with less than 80% confidence due to insufficient data for prediction, but in all cases the confidence is greater than 50%. Also, for such SNP recommendations, even if the SNP test comes out negative, such negative result would be useful because that means another SNP can be predicted, perhaps with higher confidence.
I introduced the types as categories based on STR analysis, before corresponding SNPs were available. My guideline has been minimum 80% confidence that the type corresponds to a future SNP. Indeed, the verification rate of types by SNPs has been higher than 80% over the years as new SNPs are being discovered. Many types that have been verified by SNPs now use both the SNP and the type code letter for assignments. Some types still do not have SNPs, so assignments to those types are still based on STRs. For some types, “Borderline” is used for samples with 50% to 80% confidence.
Sometimes a Remainder category is used for paragroups, which means the remaining samples from a haplogroup that do not belong to any of the known subdivision branch categories of that haplogroup.
The Polish Project also includes FTDNA assignments, which use a color code; green text means assignment based on an SNP test; red text means assignment based on STR prediction. I do not know the FTDNA computer algorithm for those red STR based predictions, but it is conservative; I notice they have more than 95% probability - less than 5% of those end up in different haplogroups when they are eventually SNP tested. However, that means most of the newer SNPs are not used, because there is not enough data for such high probability predictions.
The Polish Project assignments are more aggressive with the 80% guideline; the intention is to provide more STR based assignments, accepting the risk that some might later be found incorrect.
Most of the larger Y-DNA branches have multiple equivalent SNPs that define the same branch. Since early 2003, the FTDNA computer seems to be selecting different equivalent SNPs as assignments for various samples from the same such branch. The Polish Project assignments consistently use only one of the equivalent SNPs - the one historically used the most for that branch. So there may be some confusion when comparing the FTDNA vs Polish Project assignments.
Comment added 20 Oct 2012: This is a long topic with many short subsections, each for a category. Many of these subsections are out of date and need to be rewritten. The subsections without a date on the first line may be a few years old.
This large topic has descriptions for many of the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters. Types and clusters are high confidence hypothetical haplogroups. Borderline categories are lower confidence. There is also the Unassigned category for uncertain samples.
Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.
Please don’t get confused. The following capital letter names are my codes for R1a categories. Capital letters are also used for the large official haplogroups, but that’s different.
Some of the following categories are discussed in my November 2009 publication, and may have archive copies of my 2009 Excel analysis files stored in the Supplementary folder. Many of the following types have my update Excel analysis at PolishCladesUpdate.
A type is hypothetical clade of L342, which is a branch of Z93. A type does not correspond to a haplogroup yet, because there are L342+ samples that do not match the A type definition.
I have consistently expressed more than 98% confidence that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web. It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi. This type is not restricted to Poland. Levy-Coffman wrote an article about Ashkenazi genetic genealogy; I noticed discussion in a recent Science article. I expect an SNP to show up someday equivalent to what I have been calling A type.
Between 2008 and 2011 I predicted that A type was a subtype of K type, but I never had more than 80% confidence in that prediction, which is now seen to be wrong, because K type is in Z283, a brother SNP to Z93. See the R1a Project for a recent SNP tree. The match of A type to K type at the first standard set of 12 markers is now seen to be a coincidence. Older publications call that 12 marker haplotype, very common in Eastern Europe, the “Ashkenazi” haplotype, but we not know that only a small fraction of men who match at 12 markers are Ashkenazim.
B. Update 8 Mar 2012. A hypothetical subtype of K type, identified by Mayka. Concentrated in Poland. I have more than 90% confidence that B type represents a clade that will be verified some day with a new SNP discovery. My confidence is only about 80% that it is a subtype of K; the node for B type in the R1a tree might be slightly younger or slightly older than the K definition node. Individual assignments to B type have 80% or higher confidence, depending on how closely each fits.
C. Update 10 Mar 2012. This type code name was dropped from the Polish Project in early 2011. The two C type samples are both now listed as R1a1a (M198+,M417-), and they are the only samples in this paragroup, so that is a better label. These are the only two R samples in the Polish Project with the signature (385a,455) = (13,10). C was added to Polish Project in Dec 2009 by Mayka, who pointed out that Didier Vernade originally pointed out the unusual DYS392=13 value in 2007. DYS392=11 is almost universal in R1a. C type is very small. There are only 2 Polish Project samples in C type, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4 different samples with 67 markers. I calculated SBP = 7% using only 37 markers with Ysearch data (in early 2010). None on Ysearch are identified as “Poland”. C type differs very much in STR values from the rest of R1a1. That is because C type has a very old node in the R1a tree.
This type was added to the Polish Project in Jan 2010. The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on DYS462=12.
Signature (460,481,462,560) = (10,<22,12,18). Any one of these four markers by itself can distinguish D type with high probability from other R1a1a1i (Z280) samples, but those values can be found individually as independent mutations in other R1a clades. D type cannot be distinguished using the 25 FTDNA standard markers. At 37 markers, only 460 is available.
At 67 markers, 481<22 is an effective signature: 16 total D type: 13 D have 481=21, and only one other R1a sample has the 21 value. 2 D have <21, with no other R1a samples. One D has the 22 value along with several other R1a. 481=25 is modal for R1a.
DYS462 is a standard STR marker at Sorenson, and has been available for years at Ysearch; 462 is now available at FTDNA with the 111 marker set. In Nov 2011 I noticed that DYS560=18 is another marker for D type from the 111 set, but that is not available at Ysearch (Nov 2011).
That DType.xls analysis file provides SBP = 5.3%, although I did manual editing of the definition to improve SBP, providing some selection bias. On the other hand, isolation of D type is even better than indicated by SBP for two reasons: Samples just beyond D type, steps 12 and 13, all have solid assignments to other types. Most of the D samples have 462=12 and a few have 560=18, and those samples beyond step 11 with data have other values at those 2 markers, so a future definition using all 111 markers should provide even better (lower) SBP. Only 3 D type have 111 markers; most of the DYS462 data was obtained some time ago by purchasing that marker separately.
D type seems to be Z280+ Z92-, based on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file). Z92 is a new SNP, so not much data is available; confirmation should be available soon.
D type is clearly a Polish type: In the Polish Project 10 of the 16 D type at 67 markers indicate “Poland” ancestry; the exceptions are 2 “Unknown” (one with an obvious Polish name and one with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.
On Ysearch, there are 32 samples below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which is quite high for Ysearch. SBP is 15% on Ysearch, implying there are clades near the cutoff that are rare in Poland; indeed none of the 5 samples in the gap at steps 9 and 10 indicate Poland. For details see the “Ysearch” sheet in DType.xls.
Age (ASD sheet cell N12) comes out 1,385 years using all 67 markers. Old human Y-DNA clades have age older than the raw ASD calculation because of population bottlenecks and because of other statistical adjustments. However, D type is not very old, so this correction may not be needed. On the far right of that ASD sheet I sorted markers by age, and I added notes about problem values, and suggested four markers that should be masked out, but the age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216 years. I see evidence of subclades, so D type might be composed of younger subclades that might be identified with more data.
I noted three markers (on the far right of the ASD sheet) that I consider hints for subclades. Last year in this topic I mentioned Da, with the signature (458,576,444)= (16,20,14) and that still looks promising, but not convincing. One of the three D samples with 111 markers fits Da, and provides a hint that markers 463 and 715 from the 111 extension might help to resolve Da, so it will be interesting to see what happens as more D men order the 111 extension.
E. Update 8 Mar 2012. V. Rudich entered a modal for this cluster into Ysearch as ID MW7DP, named “North Eurasian”. Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”. 67 markers. It’s an excellent type; on 25 May 2010 it had 16 samples at 67 markers in the Polish Project, with SBP = 14%. In late 2011 E type samples tested positive for the new Z92 SNP, corresponding to the R1a1a1g2d haplogroup (ISOGG early 2012). However, not all Z92 + samples fall into types.
FH Clade. F and H types were suggested by Mayka. They have the signature (439,511,452 = 11,11,28). They differ from each other, so I could not make a combined FH type. I can make a reasonable FH cluster, but it is not necessary, since the FH clade can be better defined as the combination of the three types Fa, Fb, and H. The original F type (introduced Jun 2010) was split into Fa and Fb in Dec 2010. DYS452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated. Mayka and I helped most of the Polish Project members in FH, and members just beyond FH, to get 452 evaluated. Samples beyond FH have 452=30. My analysis files do not use 452 for determination of SBP. 452 would not significantly lower SBP because most of the background near the cutoff for each type are samples from the other two. In other words, Fa, Fb, and H are very well isolated from the rest of R1a, but not so well isolated from each other. These three FH types do not seem to be specifically concentrated in Poland (per Ysearch) although they are concentrated in Slavic countries including Poland. All three types seem quite young, without relatively low STR variance (see the ASD sheets in the analysis files).
FH Borderline. The borderline samples from Fa, Fb, and H are combined into a single FH Borderline category in the Polish Project, because these clearly belong to the FH clade but have less than 80% probability of belonging to any one of the 3 types.
Fa. Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%. See FH clade, above.
Fb. Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%. These samples were the original F type, before Fa was split off. See FH clade, above.
H. Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.
G. This type was suggested to me by Mayka, who calls it the Pomeranian cluster. Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland. Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians. Kashubians consider themselves an ethnic group or nationality within Poland. It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians. Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.
G type is mentioned only briefly in my publication because not much data was available to me at that time. My GType.xls update analysis file with June 2010 data has excellent results: There are 12 samples in a nice type with SBP = 11.2%. There is preliminary evidence of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in Ysearch; see Haplotypes.xls for a list including hypothetical working modals.
11 Jan 2011 news: Mayka informs me that one of the new SNPs, L365, is positive for all of 5 G type samples that were tested so far. A few samples from other types all tested negative for L365. It seems like G type is included in the new haplogroup defined by L365. One of those 5 is in that tentative Ga subtype.
Of course, this is very preliminary. It is possible, if unlikely, that some of the G type samples still might turn out negative for L365. It is quite possible other samples not matching G type might be found L365 positive. I’ll provide updates here.
Those 5 samples are positive for M417, negative for M458, and negative for a few other new SNPs.
L365 is one of a few new SNPs that look like they will receive the notation R1a1a1x, where x = i, j, k, etc.
This type should not be confused with another G type in the N haplogroup.
14 May 2011 comment: Sorry I have not taken the time to update this G type topic. Recent data continues to verify that G type seems the same as the haplogroup divided by L365, now called R1a1a1i.
For the latest status check http://www.yfull.com/tree/R-S18681/.
Rewrite 9 Nov 2014:
I type seems to be about equivalent to the new SNP S18681.
In other words, if you test positive for the SNP S18681, that places you in the S18681 haplogroup, which is closely equivalent to what I have been calling I type.
Samples that match I type at 111 STR markers are coming out positive for S18681. Samples that do not match I type at 111 markers are coming out negative for S18681, with only one exception.
At 67 markers, there are three S18681+ outlier samples that do not match I type. I type is defined by STRs, so future S18681+ outliers may not match I type, and a few samples beyond the I type cutoff may come out S18681+ in the future. I have been slightly adjusting the definition of I type as more 67 marker data accumulates, so the definition has been improving with time.
At less than 67 markers the probabilities of outliers are higher. More discussion about this below.
Most but not all I type samples in the Polish Project are also coming out positive for the new SNP YP331. There are two newer SNPs, YP314 and YP315, that are located between S18681 and YP331. The most recent SNP finding is Y5973.2.
The “father” of S18681 is CTS8816, with the two “brothers” L1280 and Y2902.
The SNP sequence is R1a > Z280 > CTS3402 > CTS8816 > S18681 > YP315 > YP314 > YP331 > Y5973.2.
This recent work on new SNPs is being done by Stanaszek, Milewski, Lapinski, and Mayka. Łukasz Stanaszek has a document R1a_S18681.doc with a listing of I type samples from both projects, along with discussion of the possible origin of the S18681 haplogroup. Michał Milewski has a tree chart for Z280, which includes S18681, at the forum: http://eng.molgen.org/viewtopic.php?f=77&t=1464&start=120 Check that forum topic for the most recent update.
My definition for I type is published at Ysearch EKVHX, uses 58 of the 67 STR markers, cutoff 8, SBP 16.2%. My analysis file is available as IType.xls. My Aug 2011 definition, which used 62 markers, still works quite well, as demonstrated in that xls file.
My I type definition works OK in the R1a Project and at Ysearch, but not as well as in the Polish Project. One obvious difference is that there are only four samples in the Polish Project confirmed or predicted into the paragroup YP314+ YP331-, while this paragroup at the R1a project is almost as large as the haplogroup YP331+. This paragroup does not seem to be as concentrated in Poland as YP331+, which explains why my I type definition, tuned to the Polish Project, does not work quite as well in the R1a project. Those four paragroup samples in the Polish Project are highlighted in that file IType.xls.
SNP confirmed data: So far (14 Oct 2014, both projects) there are only 3 samples confirmed with SNP tests in the paragroup S18681+ YP315- (plus one cousin assumed). There is only one sample confirmed in the paragroup YP315+YP314-. There are only three samples confirmed in the paragroup YP314+YP331-. There are 10 samples (59% of 17 fully confirmed samples) confirmed in the haplogroup YP331+ (plus one cousin assumed). Three of those 10 YP331+ are in the new haplogroup Y5973.2, and one of them is confirmed Y5973.2-.
There are about 20 samples confirmed or predicted S18681 that need testing for the branches; we guess that most of these will come out YP314+, and most of those will come out YP331.
In other words, the largest subdivision of I type is looking like the haplogroup YP331+. The second largest looks like the paragroup YP314+YP331-.
The I type samples not yet SNP tested with low step at 67 or more markers are predicted S18681 with high confidence. There are “Borderline” samples close to the cutoff for I type, and samples close to I type at <67 that might be S18681, but cannot be predicted with high confidence.
The Polish Project and the R1a Project categorize all samples and recommend which SNPs should be purchased for those interested in determining the paragroup or haplogroup for their sample. See Stanaszek (link above) for the combined recommendations.
History of I type: This name was introduced by me in my Fall 2009 publication, Part II, page 178.. I named it after my Polish Iwanowicz grandfather, who carried this type. Later, I was informed that Russian web sites had been calling this STR cluster “Northern Carpathian”.
The best ranked signature marker for I type is DYS578=9. The ancestral value is 8. DYS578 has the second slowest mutation rate of the 67 standard markers per the Chandler rates. This marker is in the 37 set, but not in the 25 set. So the 37 set is a reasonable predictor for I type, while the 25 set is not. The 9’s are colored orange in that analysis file IType.xls. Three other good signatures are in the standard 111 set, but not in the 67 set: DYS463=24; DYS532=12; DYS504=14. Another fairly good signature, available in the 25 set, is DYS458=14, again orange in the file. This is a rapid mutator, so there is more variance.
DYS511>10, available in the 67 set, highlighted in IType.xls, seems to be a marker for paragroup YP314+YP331-, while the ancestral value 10 dominates the rest of S18681. However, one STR marker does not provide very confident assignments.
Previous subdivision attempts of I type: At this web page, I have reported that I type has a particular structure: I type has always formed a well isolated mountain in STR haplospace, with relatively few samples near the cutoff. However, the mountain is broad, not tall, more like a range of foothills, in the graph of number of samples vs step (STR mutation count). I type has a few bimodal STR markers. All this implies subtypes. Over the years, I have tried various subtype definitions within I type, and I have seen attempts by others on the web. But the subtypes have never provided statistical confidence; they are not well isolated in STRs.
Now we know why. Those four paragroups and that one haplogroup all seem to be old branches. It seems the S18681 tree is more like a bush. Indeed, the confirmed samples do not come out together when grouped by STRs at 67 markers. (Not many samples have 111 markers yet.) In fact, most of the confirmed samples from small paragroups seem closer using 67 STRs to samples in other groups than to members of their own group. That’s because there are many more samples in the database from other groups, and some are bound to match more closely just due to the luck of random mutations.
In other words, I type is a broad mountain in haplospace because it has many relatively old branches.
What does this mean? It’s difficult to say with certainty, but here is what I consider the simplest speculative explanation: S18681 I type originally grew quickly and prospered while other related clades died out, leaving this clade well isolated in STR haplospace. Over the years, a number of I type clades survived, all with approximately the same age.
J. This type was suggested by Mayka. Only 6 members in the Polish Project, but this type is well isolated at SBP= 13%.
Since 2007, I had been using the name “K type” for a large R1a Polish category. Over the years I had subdivided K into several smaller types and clusters, although I did not have high confidence that all of them in fact belonged to a single unique clade, as discussed at this web page over the years. As types were subdivided, the remaining samples that fit the general K definition did not form a type. My K group is now known to be a mix of independent haplogroups, so the Polish Project stopped using K as a category in Oct 2012, although quite a few small clusters with names such as Kx and Kz are still predicted, because the confidence in the clusters has always been higher than the confidence in K. The various K categories are now clusters, types, and confirmed haplogroups within the two major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93).
Most of the samples originally classified as K are now in B type, D type, I type, and J type, all significant branches of Z280. If an SNP shows up that captures many of these branches, I’ll be inclined to re-use the short code name K to discuss that branch.
I have been using the subscripts “z”, “y”, “x”, etc backwards through the alphabet because I am running out of letters for new clusters and types. These small hypothetical clades seem to be subclades of K, although I do not have high confidence about the subclade status.
Kt, Ku, Ky. Cluster with STRs similar to K type. These came up Z92+, so their match to K type is a coincidence. Need documentation as a new topic at this web page.
That KyType.xls file demonstrates that the same 5 samples are extracted using any number of markers from 11 to 67, although at some of those definitions one or two other samples are also extracted. The full 67 markers work best, SBP=23%.
Ky was more isolated last year; a few samples showed up in the gap, reducing SBP.
I’m using a hand edited definition, Ky63, using 63 markers, for the following reasons:
Ky is unusual in that 4 of the 5 samples have an unusual value for at least one markers. I highlighted these values in red in that file. Notice also the high step values for those four, 8 through 11, using all 67 markers (column BX), although SBP came out 23%, which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 5 samples seems to be a representative of a branch of this hypothetical clade, where each of the 5 branches has a node not much younger than the TMRCA.
Hand editing like this does introduce some selection bias, so the calculated SBP=13.6% for Ky63 is misleading. Countering the selection bias, some if not all of those 4 markers that I masked out might represent small tribal sized subclades, so future prediction of new Ky samples should work better using Ky63 with those 4 removed. T
he far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out. You can see that my selection is a bit arbitrary; I could have masked less than 4, or more than 4.
ASD age using all 67 markers comes out 917 years, cell N12. ASD age using the 63 markers not masked out comes out 878 years, cell N29, not much less. ASD age has a number of caveats, and 4 samples are not significant, so this age is highly uncertain. Ky seems young, as haplogroups go.
Ky doesn’t not have a prominent signature.
That KzType.xls file demonstrates that the same 6 samples are extracted using any number of markers from 2 to 67, so the definition is not critical for this well isolated type.
Kz is effectively more isolated than the SBP values (row 12 in that file) indicate, because the samples just beyond Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.
I’m using a hand edited definition, Kz59, using 59 markers, for the following reasons:
Kz is unusual in that 5 of the 6 samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those 6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%, which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a representative of a branch of this hypothetical clade, where each of the 6 branches has a node not much younger than the TMRCA.
Hand editing like this does introduce some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but moot). Countering the selection bias, many if not most of those 8 markers that I masked out might represent small tribal sized subclades, so future prediction of new Kz samples should work better using Kz59 with those 8 removed. Again, this is moot, because any number of markers extract the same samples.
The far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out. You can see that my selection is a bit arbitrary; I could have masked less than 8, or more than 8.
ASD age using all 67 markers comes out 724 years, cell N12. ASD age using the 59 markers not masked out comes out 704 years, cell N29, not much less. ASD age has a number of caveats, and 6 samples are not significant, so this age is highly uncertain. Kz is clearly young, as haplogroups go.
Additional information supplied to me by Mayka: Three of the Kz type samples are from non-Polish men who suspect they have Polish male line ancestry, so it is not certain Kz type is Polish. Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP appears to be private, restricted to his family. Insofar as that man recruited 3 more Kz samples into the Polish Project, Kz seems proportionally twice as large. My next edit of the Results Table will reduce the percent size of Kz.
Kz has the prominent signature DYS459b=18. Mayka points out the additional signature DYS461=12, not one of the 67 marker set; most of the samples in Kz have been verified with this 12 value. Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other hypothetical clades, we do not know if the signature markers define a larger father clade.
L. This cluster is highly hypothetical. It is rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested this cluster to me. It is a well known Scandinavian cluster. I quickly checked it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here. More documentation about L will be available here when I find time to study it.
L1080. New SNP needs documentation here.
L342.2. New topic 30 Oct 2011. This SNP was recognized as a new haplogroup by ISOGG during the summer of 2011. This was an L342 haplogroup category at the Polish Project for a short time in the summer and fall of 2011, but it has been replaced by Z93, because it seems all the L342.2+ samples are also Z93+ in the Polish Project. Apparently there are very few men elsewhere in the world found to be Z93+ L342.2-.
Z93 is a more reliable SNP than L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2, discovered earlier in the E haplogroup. L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests together are more reliable. These 4 mutations are in the same segment, which is apparently a segment that mutates relatively rapidly. Z93 is recommended as the better test for R1a samples that do not fit STR definitions of other R1a haplogroups; the Z93+ samples can do the L342.2 test. This information about L342.2 was supplied to me by Mayka.
The Z93 category has the samples that do not fit the two known subdivisions: A type and L342T cluster (next topic).
L342T. New topic 30 Oct 2011. Based on 26 Oct 2011 Polish Project data. Analysis file: L542TCluster.xls. I just noticed this cluster.
Seven samples at 67 markers fit my new 48 marker definition for L342T. There are 19 A type samples, which should all be in the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T; the closest A’s are at step 8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5 also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit the L342T definition) and the four samples beyond the cutoff, are assigned to K type and to subtypes of K; Z280 has recently become available for K type; as those background samples get tested in the future for Z280, my L342T cluster will start looking better. Let me say that another way: a cluster should be analyzed with data from its own haplogroup, so L342T should be compared only to L342.2 (Z93) data. But there is very little L342.2 (Z93) data available, so I used the full R1a database in that xls file. That means L342T is likely more isolated than it seems right now, so it is more likely to correspond to a valid haplogroup.
Mayka pointed out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code name. Of course, Tatars may belong to only a branch of L342T; I have no idea what fraction of L342T in Poland are Tatar. And of course Tatars are expected to be a mix of multiple haplogroups.
Three of the L342T samples, with the name Muchla, are apparently a family set, so they count statistically as only one sample, reducing the current count from 6 to 4, so SBP as calculated in that xls file should be increased (not as good). This is evidence against L342T being valid.
M. Needs documentation. M type was brought to my attention by Larry Mayka, who informs me others have called this haplotype the Viking haplotype because of its concentration in northwest Europe.
N. N type. Comments 7 Nov 2013: N type is almost equivalent to M458+ L260-. The only exceptions in the Polish Project are a family set of a few samples (called Ry type). N type also seems equivalent to the new SNP CTS1192. See the M458 topic for discussion of a L1029, which divides N type. See also L1029 and Np News.
N type is concentrated in Slavic countries. N type is discussed in my publication, page 179.
According to Ysearch and Yhrd N type seems to be spread all around the Slavic lands and central Europe, common from East Germany to Russia. Within Poland N type seems to be about the same size as P type, both about 9% of men. Worldwide, N is much larger than P. N type should be properly studied in a database that is not restricted to Poland. However, there seem to be subtypes of N that are concentrated in Poland. See the discussions on N subtypes below. I’ll continue to watch the Polish Project, because it will be interesting if more data provide more Polish subtypes within N.
During review of my publication in 2009, the SNP called M458 was published. I added notes about this to my publication on page 184. The corresponding haplogroup is now called R1a1a1g. This haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-). M458+ samples may turn up someday that do not fit either N type or P type, but I have not noticed any yet.
My current definition for N type, N46, is a modal haplotype using 46 of the 67 standard markers. The cutoff is 8, which means all samples less than step (genetic distance) 8 from N46 are predicted N type (predicted M458+ L260-). That definition is available in the NType.xls analysis file, in my Haplotypes.xls file, and at Ysearch as 3SEJK.
N type age (age means TMRCA) is about 2,000 years. That’s highly uncertain, but I’m 80% confident that age of 2,000 years is not off by more than a factor of 2 - age 1,000 to 4,000 years. The M458 mutation is likely much older than the age of N type.
I’m suspicious that N type includes many younger clades that just happen to have similar STR values, difficult to resolve into clusters or types. I offer some speculation along these lines in the hypothetical subclade topics below.
I highly recommend that someone from N type purchase WTY, a commercial product for discovering SNPs. No sample from N type has been submitted for WTY. That means there is a good chance that the first N man to submit his sample to WTY will discover one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an SNP that captures about half of N type - or perhaps an SNP that captures a small subclade - or perhaps multiple such SNPs. My WTY was the first in a long time in my haplogroup, so I found 14 new SNPs.
It’s interesting to wonder why R1a1a1g seems to be composed of two types that differ substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type topic. Much of my P type discussion is also related to N type, so I avoided repeating all the details here; please read my P type discussion if you are interested in more about N type.
N seems to be older than P. I wonder if there are subtypes of N about the same age as P. I avoid too much speculation in this web page - just enough to indicate my motivation. I’m wondering if there are clades in various haplogroups, mostly P and N, associated with the origin of the Polish nation - a few centuries more than a millennium ago.
I have only identified 4 small subclades of N so far: I am quite confident of Ng type, but less confident of N-Ashk type. The Nt and Ns clusters are hypothetical; I have about 70% confidence in them. These 4 are used for assignments at the Polish Project web page. I also identified a few clusters with roughly 50%confidence; these are too speculative for formal assignments. All are discussed below. I made speculative assignments based on all these types and clusters within N type, in column CD of that file NType.xls, Calculator sheet. My file NClusterAssignments.xls has lots of details. If you are N type, you can find your row with your kit number, and see your speculative assignment. For the “clusters”, I estimate a 50-50 chance an assignment will need to be changed in the next year or so, as more data becomes available
In addition, N type has many bimodal markers, hints at yet more subclades not discussed here. This is evidence that N type experienced population expansion when it was young (not long after the TMRCA). More discussion below.
The paragraphs up to here are a brief summary. The rest of this topic is a detailed discussion about N type and hypothetical subclades:
This Sep 2011 analysis includes only data from the Polish Project. I’ll wait a few months before reviewing data outside the Polish Project. My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following topics, which have not been updated for several months:
For the size of N type, please see the table at the top of this page, where N has only 4 more samples than P (87 vs 83 - 5 Aug 2011 data). In my 2009 publication N had one less than P (28 vs 29, Table 6 page 169). The 70% confidence interval for 87 samples is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by implication in Poland) within statistical sampling accuracy, at about 9%.
My 2009 published definition for N type, N45, still works very well. I did not change that definition at my Jan 2011 update and analysis here in this topic. This Sep 2011 N46 update is just a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP data that has accumulated over the past year. Both definitions are compared in that analysis file NType.xls, Calculator sheet, columns BZ to CC.
Tweaking the definition like this, to better fit SNP data, introduces some selection bias. I discuss this issue in the P type topic, where I did a similar tweak; please read that topic if you are interested in the statistical justification. The justification is not as good for N type, so I’ll return to this issue in the “old branches” paragraph below.
This new N46 definition fails to capture only one M458+ sample, which falls at the cutoff step 8. This new N46 definition captures only one foreigner, L540+, at step 7, the last step of the type. The other samples at step 8 have tested either M458- or L260+, except one that fits D type well, so they are all confirmed as not N type. Similarly, 7 of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to other types, with only 2 that are Borderline fits to other types. In other words, the N46 definition captures the M458+ L260- samples with apparent 98% accuracy. However, my confidence is about 80% for step 7, about 90% for step 6, and 95% or better for step <6. Again, please see the P type discussion about confidence for a general explanation. P and N are similar in this regard. I have related discussion about N type confidence in the “old branches” paragraph below.
Almost all the samples near the cutoff for the previous N45 definition have been SNP tested. This high testing rate is not a coincidence; Mayka and I have been encouraging men with marginal samples to do the M458 and L260 tests. (We paid if cost was a problem.)
The NType.xls analysis file has 10 columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as tentative N type definitions, with automatic selection of the best markers. For each column, I colored the step count violet for samples captured by that definition. You can see at a glance that any definition using 2 to 67 markers captures more than 80% of the N type (M458+ L260-) samples, and not many foreigners, so just about any definition works surprisingly well. In other words, N type is very well isolated in haplospace.
For the two best automatic definitions, I used boldface to highlight the N type samples missed by that definition, and also boldface to highlight the foreign samples captured by that definition. I used boldface similarly for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the effect of 3 different cutoff choices.
You might try resorting the sheet by column (select everything from cell A14 to the end) to better compare the results.
The issue of SBP is moot for N type now that the SNPs M458 and L260 are available, but an analysis is instructive: That NType.xls file has automatic marker selection of N type, and automatic calculation of SBP, disregarding the SNP data. The best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%. However, N46 is a better definition because N61 captures only 80 of the 87 N type plus that same one foreigner. But still, 8 misses out of 87 is not bad for N61, better than the 13.2% SBP (SBP is a high estimate for statistical confidence).
I considered calling N46 a definition for M458+ L260-, with a different definition for N type as a slightly smaller subtype, leaving out some samples that do not fit the N type definition with lowest SBP. I could not come up with a convincing definition for such a smaller subtype. So at least for now, I am considering N type as the same as M458+ L260-, with the understanding that may change in the future.
The summary conclusion for all those columns of trial definitions: My preferred N46 definition (column CC) does the best job of capturing N type (M458+ L260-). Most of the other columns are trying to define N type as slightly smaller, leaving out a few of the samples (not always the same samples). Most definitions for N type have many samples at or near the cutoff. My explanation is in the next paragraph:
Old branches: A type is a hypothetical unique clade. Of course, every clade is composed of subclades - branches in the Y-DNA tree. Here is a simple explanation for the previous few paragraphs of discussion: N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type MRCA at a few STR values from the standard 67 set. Those old branches have many younger branches (twigs) that differ at yet more STRs. In other words: the N tree might have a few small branches near the ground. Those small old clades provide samples in the database with large step, but each sample is from a different twig, so these do not correlate into obvious clusters. Any clade has statistical outliers with large step; a few small old branches would provide more outliers for N.
Those old branches may not be small world wide. One possibility - a large subclade of N concentrated outside Poland might have one small branch in Poland, corresponding to a man or tribe that moved to Poland long ago. I am watching for evidence along these lines, but so far this paragraph is speculative.
In addition, there might be additional large old subclades that seem young. I consider this possibility in the discussions below. The age of a clade can be much younger than the node. I discuss this in another topic, where I call such clades smooth branches. The N tree might have a number of small smooth trunks with nodes near the ground - that would not necessarily be evident as STR correlations. On the other hand, the N tree might have only one main trunk, almost smooth, with only few small branches near the ground. The actual situation might be more complicated, with multiple trunks of various sizes, at various distances from the ground. I can’t tell yet from the STR data. Perhaps another year of additional STR data may help.
Why am I speculating about N type smooth branches? I see plenty of hints for more branches in the N type data, but little statistical confirmation. In the discussion below for subclades, I offer evidence (not definitive proof) for many more significant sub clades within N type.
This discussion is personal. It is my opinion, based on my statistical analysis. Someone might send me an email any day now pointing out a convincing cluster or type in N that I missed. Someone else might disagree with my analysis about particular hypothetical N subclades.
Reminder: This discussion is limited to Poland, as represented by the Polish Project. Outside Poland there is additional probability of M458 branches showing up someday that fit neither N type nor P type. Outside Poland I expect yet more N type branches.
Regarding concentration in Poland, I use percent of samples in Ysearch with “Origin” Poland as an objective measure. This is discussed in my publication, where Table 1 shows P12 (the P type modal haplotype using only the original standard 12 markers from the Polish Project) with 42%, while N12 has only 14%. Those numbers 42% vs 14% are not calibrated (because of the unknown concentration of men with Poland origin in Ysearch) but those numbers are a relative indication of concentrated in Poland vs not particularly concentrated in Poland. My file NYsearch.xls has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift due to more data. That same file has the N46 definition at 24%. This is evidence that N type, defined using 46 of 67 markers, is only slightly more concentrated in Poland than the 12 marker equivalent. The simplest explanation: There are probably large M458 clades outside Poland that match N12 and also match N46 at less than the cutoff, but the Polish samples are only a twigs on those branches, descended from one man or family or tribe that moved to Poland a millennium or so age. It makes sense that clades within M458 might be regionally concentrated. That 24% concentration for N46 is of course an average; there are subclades of N with higher and lower concentration. I found a few, discussed below; that file NYsearch.xls has a sheet for each subclade analysis.
Age: N type comes out 2,340 years old using all 67 markers. See cell N12 in the ASD sheet in NType.xls.
Because of recLOH issues, the compound markers 464, YCA, and CDY present difficulties estimating age in the N type data. Other compound markers are OK. The ASD sheet allows a mask, row 21, where I masked out the 8 markers for these recLOH difficulties. The result, using 59 markers, cell N29, is 2,010 years. That’s my best guess for the age.
On the far right of the ASD sheet I sorted the markers by apparent age. YCAb comes out 20,704 years, demonstrating the recLOH problem.
The second (apparently) oldest marker is DYS454, at 18,744 years. This old age is due to only 5 mutations in this slow mutator. DYS454 is clearly bimodal. In my notes, I use the Nj code for the 2nd mode with these 5 samples, defined by 454>11. This is evidence of a subclade, but the statistics are not convincing yet. Maybe with more data in the near future I might call some of these samples the Nj cluster. It’s not fair to exclude this “old” marker, DYS454, because there are 7 markers with zero age (no mutations in the 87 samples) and there are 7 more markers with less than 1,000 years apparent age. The reason for averaging markers is that apparently old markers should be averaged out with apparently young markers. Anyway, you can go ahead and mask out DYS454 by deleting the mask number at cell AE21, and the new age (58 markers) without 454 is 1,990 years, only a 20 year decrease. I offer this paragraph of discussion as one example of preliminary evidence of an N type subclade, based on 454<11.
The third oldest marker is DYS531, at 14,319 years; at this bimodal marker I use the code Np for the 2nd mode value. Again, I’m waiting for more statistical evidence for a subclade.
That far right side of the ASD sheet has more notes about markers with old apparent age.
Age estimation from STR variance is highly uncertain. At another of my web pages, I use M458 as an example of age caveats. I have more discussion about age estimation methods in the P type topic; please read those two topics if you would like more discussion; N is similar to P in this regard.
I’m not too concerned about getting the age of N type correct in Polish data because I suspect in less than a year there will be enough evidence to subdivide N - new SNPs and / or more STR data for better statistical significance. I suspect there will be younger subclades. Furthermore, M458+ L260- is not really a tree; it seems to be branch of the Y-DNA tree that is well isolated - a long smooth segment near the node; but I mentioned above my suspicion that the main branch might not be really smooth - there might be significant old branches concentrated outside Poland; if this is true I’ll need to soon redefine N type as younger, excluding any such significant branches. I’ll leave it for someone else to estimate the age of M458+ L260- from worldwide data; I’ll concentrate on N type, and hypothetical sub clades in Poland.
There are 12 samples from N type available with the new 111 STR marker set (18 Jul 2010). Only DYS532=12 is an obvious signature marker for N type from the 44 new markers; 10 of the 12 have this value. Modal for R1a is 532=11. P type also has the 532=12 value, also 10 of 12 samples, so this marker also provides a signature for M458 with good statistical significance. I type also has the 532=12 value; see the I type discussion below.
The following topics are my proposed subclades for N type in the Polish Project. Please consider reading the section P Type Bimodal Markers, if you would like more discussion of how I use bimodal markers as hints for subclades; that same discussion applies here for N type. If you are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls. Near the bottom of the “Haplotypes” sheet is a list of 70 code names for signatures that I considered for N type subdivision. I discuss only a few of these here. I spent a lot of time studying tentative subclades of N because I’m anxious to find significant subtypes that are concentrated in Poland. I uploaded a total of 17 Excel analysis files associated with N and tentative subclades, all discussed above and below.
Ng. Rewrite finished 22 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NgType.xls. Ng is a small subtype of N type, but it has highest confidence.
This is a very small subtype, only 3 samples, but it is very well isolated. The definition uses 56 markers, cutoff 4, gap 9. There are no samples in the gap, from step 4 to 12. SBP = 15.8%.
These same 3 samples are present in Ysearch, where the gap with no samples is from 4 to 11. Two samples at step 12 are from Germany and Unknown. There are none at step 13 and 11 samples at step 14. It seems Ng is concentrated in Poland.
The signature is (537, 492) = (10, 14). These are the only 3 Polish Project samples in N type that have any mutation from the 12 value at 492, and they have a 2-step mutation. 492 is ranked 18th of 67 in the extended Chandler mutation rates. The 10 value at 537 is also rare - only these 3 plus 2 other samples have it in N type in the Polish Project. The same 3 Ng samples are extracted from N type using 1 to 67 markers. They are well isolated using as few as 7 markers because they have little variation from each other in the rapidly mutating markers, so those rank well for the Ng definition. ASD age comes out 619 years using all 67 markers but of course that is a very rough estimate.
The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and passed on those 2 unusual mutations.
The 3 Ng samples fall at steps 4, 5, 6 with the N45 definition of N type, a hint that the Ng node is near the center of the N type branch, not one of those old branches I speculated about, but this is just a preliminary hint.
I introduced Ng type in Oct 2010; there have been no new 67 marker data in the STR neighborhood of Ng type, so SBP has been 15.8% since, with the same definition.
Ng also has what I call the Na signature, discussed below.
The “g” is only my arbitrary code name that I have been using for the DYS492=14 signature.
N-Ashk. Rewrite finished 25 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NashkType.xls. N-Ashk is a small subtype of N type. Only 4 samples.
These seem to be Ashkenazi samples. Mayka pointed out to me that the names seem Ashkenazi, per his experience. The samples beyond the cutoff are apparently not Ashkenazi.
Signature (19,385a,594) = (15,12,11)
I introduced this type in Jan 2011, with SBP 23%, slightly more than my stated 20% limit for using the word type. Two reasons: First, the Ashkenazi names are independent evidence of a clade. Second, the N-Ashk modal haplotype differs from the N modal at 6 markers, which is evidence of a fairly old node in the N branch of the Y-DNA tree.
I introduced this type as Nca type, because of what I have been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was confusing because the samples do not match what I have been calling the Na marker. Nc is large; I doubt N-Ashk is a twig in a large Nc branch; the Nc mutation more likely arose independently in the N-Ashk hypothetical clade.
This Sep 2011 reanalysis makes a cleaner cluster of data, although still small with only 4 samples. The 594=11 marker is very clean; these 4 samples are the only R1a samples in the Polish Project with this value. SBP increased to 47%, so it is a stretch to call this a type, but the Ashkenazi connection is improved now and the 594=11 marker seem to be strong evidence. Also, I avoid making changes in classification names without significantly more data, so I’ll continue to call this a “type” for now. There are no longer any N-Ashk Borderline samples at 67 markers; the Borderline category is used for apparent Ashkenazi samples that match well with only 37 markers.
The improved definition uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4. (The previous definition used 59, cutoff 5.) The improvement: I masked out CDY. The previous definition used CDYb, missing an Ashkenazi sample that fits the type well, but has recLOH, providing a misleading step of 5 at this one marker. With that new sample the ranking of markers came out slightly differently, so a few other markers were added or removed from the definition. The old and new definitions are available in NashkType.xls. The new definition is also available at Ysearch as 2TZKF, and in my Haplotypes.xls file.
The ASD age comes out only 668 years, cell N29 in the ASD sheet in NashkType.xls. Age calculated from only 4 samples is highly speculative, but N-Ashk seems young because of little variation in marker values. The ASD should use (4-1) in the denominator instead of the total 4 samples (although most genetic genealogists do not do this for small sample sizes); with that adjustment the age comes out 890 years, but that is still highly speculative. That cell N29 is using 61 markers; CDY and 464 are masked out. (The mask is row 21, which you can easily edit.) All 67 markers yield 1,024 years, cell N12, because of CDY. DYS464 has no mutations in the set of 4, so including those 4 reduce the age, but I left 464 out because most people routinely exclude the 464 set from ASD.
N-Ashk is quite young, but the node seems old because of the 6 marker distinction from N type. The simplest explanation: N-Ashk has a long smooth branch, having an old node with N, but no further branching near that main node. The samples in the Polish Project all seem to come from twigs with young nodes. I speculate that there may actually be some branches of N-Ashk outside Poland. Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat less than a millennium ago. More data will eventually confirm or refute this speculation.
2TZKF is the modal haplotype at Ysearch, where only two of these samples are present, and where there are 2 additional samples in the gap, from Russian and Belarus; the simple explanation is that N-Ashk is concentrated in Poland, although there is too little data for confidence. See NYsearch.xls for my Ysearch analysis.
N-Ashk has what I call the Nb signature, discussed below.
Nt. Edited 25 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NtCluster.xls.
With 17 samples, Nt cluster is my largest speculative subclade of N type identified so far.
SBP = 27%; this cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions of this Nt cluster due to selection bias: I considered 70 signatures for N type during the summer of 2011, and carefully analyzed more than 30 of them. With that many attempts, a false positive is likely. One of the clusters I analyze will necessarily have the lowest SBP, but that might be just the luck of the data. No one knows how to calculate the statistical confidence in such a case. I discovered Nt at the end of this major effort. If SBP improves with more data for Nt I’ll upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll loose interest in Nt.
If Nt is valid, it is probably concentrated in Poland. See NYsearch.xls. See my Ysearch method discussed above. I consider this additional evidence that Nt corresponds to a clade, boosting my estimated confidence to about 70%. We don’t always use 70% confidence for assignments, but everyone is anxious for more subdivision of N type in the Polish Project, so we started using Nt in Sep 2011.
Nt is based on the signature DYS442<14. However, there are 29 samples with that signature, and 5 of the 17 Nt cluster samples have the N modal 14 value at this marker. My simple speculative explanation: the 442 mutation from 14 to 13 occurred independently in the Nt clade after the node with the main N type branch. Other speculative explanations are possible - those 14’s might be a back mutation within a much larger “father” clade that carries the Nt signature on most but not all samples.
One Nt cluster sample has the 12 value at 442, which could be another mutation or an independent double mutation.
If we subtract the 12 Nt signature samples with <14, that leaves 17 more samples (not included in my Nt cluster) with this second modal value at 442. There are only 3 samples at 15 in all of N, and we expect step up to be more common than step down for a slow mutator (see my publication for references), so that still leaves an excess of samples with <14, implying yet another hypothetical clade with an independent mutation, or a larger “father clade” but this paragraph is getting highly speculative. I have more speculation like this about independent clades vs large clades in the Na, Nb, and Nc topics below, similar speculation applies to Nt.
Thirteen Nt samples match what I call the Na signature, discussed below, but two samples match the alternate mode Nb; the last two samples are one step away from Na. This is evidence of an even larger Na father clade, but as discussed below the Na vs Nb signatures may have arisen multiple times independently, so I’m not confident to speculate further along these lines.
See also NclusterAssignments.xls.
Ns. Edited 23 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NsCluster.xls. Ns cluster is a speculative subclade of Nt cluster.
With 6 samples and SBP = 27%, this cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions of this Ns cluster for the same reasons given above for Nt: On the other hand, Ns looks like a credible subclade of Nt, which adds credibility to both of them.
If Ns is valid, it is probably concentrated in Poland. See NYsearch.xls. The 67% concentration is the best I have seen so far, but this % is highly uncertain because it is based only 2 Ns samples at Ysearch. Such as it is, I consider this additional evidence that Ns corresponds to a clade, same as my confidence for Nt.
The definition uses 47 markers, cutoff 2, no samples in the gap at steps 2 and 3. The definition is available at Ysearch as A5NSG
Ns is based on two signatures. Ns is my code for DYS446=12, 9 samples, vs 446=13 modal for N type. Nt is my code for DYS442=13, 5 samples, vs 442=14 modal for N type. The 6 Ns samples are all at steps 0 and 1 with the 47 marker definition; the other 3 with that signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation happened twice independently in the N type clade. Five of the 6 Ns samples have the Nt signature, but that 6th one has the value 12, two steps from the N modal 14, so it should be considered Nt also.
See also NclusterAssignments.xls.
All 6 Ns have what I call the Na signature, discussed below.
Nd. Edited 24 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: Nd53Cluster.xls.
Based on the signature DYS389I = 14, vs N modal 389 = (13,29). Nine samples have the Nd signature. Only 3 of these 9 fit Nd53. My confidence is only about 50% that these 3 samples really belong to the same clade; I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.
DYS389II has the value 30 for Nd but this is not a mutation at 389II. See compound markers for an explanation.
I call this Nd53 because the 53 marker definition is somewhat arbitrary - there is no very likely definition. It is likely I’ll need to change the definition soon, when more STR data becomes available. Also, “Nd53” makes it clear that this is not the same as the cluster formed using only the Nd signature.
Nd53 is not used for assignments in the Polish Project; see NclusterAssignments.xls for speculative assignments.
The 3 samples do not have Poland as origin, although I suppose those men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples. On the other hand, Nd53 might be representative of a clade that is concentrated outside Poland.
Ne. Edited 24 Sep 2011. New topic 23 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: Ne40Cluster.xls.
Based on the signature DYS390 = 24, vs N modal 25. Twelve samples have the Ne signature. Only 3 of these 12 fit the Ne40 cluster. My confidence is only 50% that these 3 samples really belong to the same clade; I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic. Nd and Ne have similar status.
I call this Ne40 because it is likely I’ll need to change 40 marker definition soon, when more STR data becomes available.
Ne40 is not used for assignments in the Polish Project; see NclusterAssignments.xls.
Only one of the 3 samples has Poland as origin, although I suppose the other two Ne men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples. On the other hand, Ne40 might be representative of a clade that is concentrated outside Poland.
I introduced Na and Nb in my publication, page 179 and Table 3. I have been updating the discussion for Na and Nb here at this web page. I consistently emphasize that these are speculative subclades. In retrospect, I should have avoided the word “type” for these because more data over the years has convinced me that the explanation for what is going on is not two subtypes of N. It will take me a few paragraphs to explain the issue of Na and Nb:
One way to split the N type data, obvious at a glance, is by the number of markers for 464. Some samples have 4 values, some have 6, just a few have 5 or 7.
I understand that the 464 set is the most prone to genetic testing evaluation errors, so this or any categorization using 464 will have uncertainties. If 464 is taken in combination with other markers that means some statistical uncertainty due to possible evaluation errors at 464. Specifically, a sample in a database with 4 values at 464 might really have 5 or more values, and vice versa.
Follow my links if you wish to read more about compound markers and recLOH issues, which introduce confusion for the 464 marker set. Briefly, copy mutations can increase the number of 464 markers, but recLOH mutations might reduce the number. A single copy mutation can change more than one value in the set. Copy mutations and recLOH mutations are rare, about the same frequency as very slowly mutating STR markers. Net mutations in the 464 set are common, with frequency among the fastest in the standard 67 set. For the Chandler rates, each of the four markers 464a to 464d are assigned a rate 1/4th the net rate for single mutations for the set of 4.
I use Na as my code for the signature 464 = (12,12,15,15,15,16) - the most common value set for 464. 28 of the 87 samples. My Nb signature is the next most common, 464 = (12,15,15,16). 16 samples. I say 464 is multimodal because there are also two sets with 4 samples each; that’s why I’m using Na as a signature even though it is the modal value for N type as a whole. This is for the 87 N type samples in my 5 Aug 2011 download of the Polish Project; the proportions change every few months as data accumulates due to the statistics of small sample sizes.
Na and Nb differ by 2 steps following the Ysearch method, but that is misleading because Na can turn into Nb in a single recLOH mutation, which might have happened more than once in the past in this N type database. Nb can turn into Na with a single copy mutation. I may not be exactly correct in this paragraph if my assumption of the structure of 464 in N type is incorrect, but this paragraph is certainly a brief example of the kind of confusion that arises with 464.
It is easy to construct clusters using 464 in N type. Too easy. Too many choices for clusters, as I discuss in the following. I could not come up with clusters with good statistical confidence. My Excel analysis files allow setting maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any variation of a sample from a trial definition; still I found no clusters with confidence.
My analysis files allow an alternate method, treating the 464 markers as individual markers. This is the method I used in my 2009 publication, still no clusters with confidence.
My default is to follow the Ysearch method for counting step at 464, although this method is obviously less than perfect.
My list of code names is available in Haplotypes.xls.
When trying individual markers, DYS464b is best. In my notes I use Na1 - 464b<14, and Nb1 - 464b>13; these two signatures neatly split all the N type data. Na1 captures all the Na samples plus mostly samples with more than 4 markers; Nb1 captures all the Nb plus mostly samples with 4 markers; there are exceptions. Using Na1 vs Nb1 I come to the same conclusions as using Na and Nb, discussed below.
DYS464e provides another way to split the data. In my notes I use Nx - any value for 464e, and Ny - no value for 464e. Nx captures all the samples with more than 4 markers including the Na samples; Ny captures all the samples with 4 markers including the Nb samples. Using Nx vs Ny I come to the same conclusions as using Na and Nb, discussed below.
Consider my definitions Na45 and Nb32, with 45 and 32 markers. See those two Excel files for details. My choices for 45 and 32 are arbitrary. Those files show columns with trial definitions using a wide range of markers, automatically chosen by rank. A wide breadth of number markers seem roughly equivalent. It is remarkable how many samples fit very well using up to 50 markers for trial definitions: Na has 16 samples at step zero using 11 markers, and 15 samples at step less than 2 using 45 markers; Nb has the same 14 samples at step zero using from 11 to 32 markers. When the 464 set is excluded from the definition, some Na samples fit the Nb definition, and some Nb samples fit the Na definition. One simple explanation: Na45 and Nb32 might correspond to two very young clades. However, there is an alternate explanation: Na45 might correspond to two or more young clades, and Nb32 might correspond to two or more young clades, and they may be a “bushy” set of branches where some Na45 clades are connected by nodes to some Nb32 clades. I see no way to be confident that most of the Na samples are in a branch distinct from a branch with the Nb samples. I suppose if your sample matches Na45 at step zero or one, there might be better than a 50-50 chance that you and others who match at <2 belong to a unique clade that may someday have an SNP definition, but such a clade will surely exclude some of the step <2 samples, and include some samples from steps 2 and 3, so Na45 does not provide a definition. The same can be said if you match Nb.
Some samples that fit the Na signature at 464 = (12,12,15,15,15,16) come out at high step using more markers. Similarly, some samples that fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more markers. You can see this at a glance in those two files. Two opposite simple explanations come to mind: Na and Nb may have independently arisen more than once, followed by population expansion - multiple branches in the N tree. The opposite explanation: Na and Nb sets might be signatures for two old clades that each have a few old subclades - two main N branches that have a few old branches and where both Na and Nb have a bushy clump of branches at the ends. More complicated explanations also come to mind. That second explanation, two main branches, is attractive, but I see no proof that is true, or even highly likely.
In the file NclusterAssignments.xls, I make speculative assignments. Most of the Na45 and Nb32 samples fit other more believable types and clusters. I went ahead and assigned the few leftovers to Na and Nb, but these are just speculative assignments, meant so show you which of my clusters you best fit.
Summary: There is not enough evidence to consider Na and Nb to be two unique subclades of N. Maybe Na45 and Nb32 do correspond to the top of two main branches of the N tree, with most of the samples that fit Na45 or Nb32 belonging to the corresponding clades. Maybe not. I see no way of ruling out multiple independent clades (branches far apart in the tree) for both Na45 and Na32, or for any other definitions based on the 464 set. Perhaps in a year or so more STR data will provide convincing sub cades along these lines. Perhaps in a few years SNPs will be discovered to subdivide N type.
I have more discussion along these lines below, in the Nc topic.
At all 67 standard markers, the Na and Nb modal haplotypes are essentially the same for STR markers other than 464. I say “essentially” because the rapid mutators, particularly the CDY pair and DYS576, typically vary modally from month to month due to the statistics of small samples. At CDYb, Na type signatures with multiple markers are typically modal 40, while Nb are typically modal 39, but this marker always ranks poorly for definitions because of the wide range of values. In Nb less than 1/3 of the samples typically have the modal value at CDYb.
The Russian site independently came up with this same haplotype distinction. Two modal haplotypes are available on Ysearch, from the Russians. Each use 78 markers and each match my Na and Nb types at 67 markers, including that 39 value for CDYb in Nb. Central European-1 Modal GTAVR corresponds to my Nb, using only 4 values, 464a-d. Central European-2 Modal 495M5 corresponds to my Na, using 6 values, 464a-f.
Nc. New topic 25 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis files Nc32Cluster.xls
My Nc code is for the signature DYS19 = 15, compared to the modal value of 16. Similar to Na and Nb, my publication and previous versions of this web page proposed Nc as a tentative subdivision cluster of Nb. The samples with the 15 value last year had mostly Nb samples, but this year that correlation is not significant.
My opinion of Nc is very similar to my opinion of Na vs Nb: No confident conclusion. Nc might correspond to a single large clade. Then again, Nc might correspond to independent unrelated clades where the Nc mutation arose independently.
My Nc analysis complements my Na and Nb analysis: If you look at Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na and Nb. If you look at Na45Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16. If you look at Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16. If Nc32 vs modal 16 is a valid division of N type, then Na vs Nb cannot be valid. If Na vs Nb is valid, Nc vs modal 16 cannot be valid. All three files have, at the bottom, at large step, some Na, Nb, and Nc samples.
Next, let me consider the 4 combinations using DYS464 and DYS19:
Nbc42Cluster.xls is my analysis file using both the Nb and Nc signatures together.
Nac32Cluster.xls is my analysis file using both the Na and Nc signatures together. This is very different than Nc32; the latter has a mix of Na and Nc; the former is a new analysis using the additional restriction to Na match. They both have 32 markers by coincidence. As in Na45 and Nb32, the number of markers is my arbitrary choice; there is no obvious best choice; the number of markers will likely change as data accumulates for all these definitions where I specify the number of markers in the code name.
Nb5_37Cluster.xls is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers plus the modal value at DYS19.
Na7_26Cluster.xls is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers plus the modal value at DYS19.
In the file NclusterAssignments.xls, I make speculative assignments to these 4 clusters, but samples that fit one of the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more confident assignment if they also fit these 4 combinations.
The 3 Ng samples are all Na, but they are a mix of values at DYS19. The neighborhood (just beyond the Ng cutoff) is all Na. This is a tantalizing hint of a “father” clade with the Na signature.
The 4 N-Ashk samples are all Nb, but in this case the neighborhood is a mix of Na and Nb. This is a hint of an independent mutation to Na somewhat older than N-Ashk. Three of the 4 N-Ashk are Nc, as are most of the neighborhood. The other has the modal DYS19=16 value. This is a hint of a father clade with the Nc signature, DYS19=15, plus recent back mutations to the modal value.
The 6 Ns samples are all Na, with a neighborhood mostly Na but some Nb. The Ng, N-Ashk, and Ns samples are all very far from each other. You can see this in the file NclusterAssignments.xls, where each type and cluster has a column, with step value for each samples. I consider this strong evidence against a large Na clade; it seems more likely that the Na (464=12,12,15,15,15,16) set arose independently by copy mutation 3 times in these three hypothetical clades.
Nt, the purported father of Ns, has 17 samples; 13 Na signature, 2 Nb, 2 one step away from Na. It is reasonable to speculate that those 2 Nb are due to an independent recLOH in Nt, and that the father clade has the Na signature. Unfortunately, it is also reasonable to speculate that there were multiple mutation to the Na signature within Nt making the 464 set is irrelevant.
The 3 Nd samples match Nb but again the immediate neighborhood is a mix of Na and Nb, again evidence for independent mutations at 464.
Ne is another example of a mixed Na Nb neighborhood. In this example, 2 of the 3 match Na. That third one, 464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a hint at another copy mutation.
NYsearch.xls has a sheet with Ysearch data analysis for each type or cluster. The Polish percent, in boldface, is my important result. Although this analysis is based on very little data for each of those 4 combination clusters here is the tentative finding: Nbc42 is not concentrated in Poland. The other 3 seem to be concentrated in Poland; that is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors one or more clades that are concentrated in Poland.
Ns seems related to Nac7_26, because 4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3, so this technique of 4-way combination is good for hints, but not conclusive.
Summary; Na, Nb, and Nc clusters: 25 Sep 2011. That was a lot of analysis to justify my opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without correlation to more markers. N type probably experienced population expansion not long after the TMRCA, whereby the main N branches come out today with similar STR distributions. DYS464 is multimodal; DYS19 is bimodal; the 4 main combination modes based on 464 and 19 provide evidence of twigs that are concentrated in Poland. I bet there are many more small Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type. I’ll continue to watch the STR data. New SNP markers within N type someday will be even better.
P. Update 26 Feb 2015. Lots of new SNPs have been discovered in the past few months. For the latest status check http://www.yfull.com/tree/R-L260/.
Complete rewrite 17 Jan 2015:
P type is the main topic in my publication, Part II. P type is significantly concentrated in Poland, and in the Czech Republic. It is found at lower frequency in other Eastern European countries, and in eastern Germany. About 9% of Polish males carry P type Y-DNA. L260 was discovered shortly after my publication, found to be equivalent to P type, confirming my prediction that P type corresponds to a haplogroup.
My current definition for P type, P50, is a modal haplotype using 50 of the 67 standard markers. The cutoff is 9, which means all samples less than step (genetic distance) 9 are predicted P type (predicted L260+). That definition is available in the PType.xls analysis file and at Ysearch as 8U92G. That file and definition are from an analysis done in Feb 2014 using Polish Project data downloaded 20 Jan 2014.
As data accumulates, my statistical definition often changes by a few markers. Lately, the P type definition changes have not been very significant with regard to capturing almost all L260+ testing results. That Feb 2014 PType.xls file shows 148 Polish Project samples predicted P type, none of which have tested L260-. At step 8, the last step of the type, 9 of the 13 samples have tested L260+; however, 2 of the 4 that have not been tested fit other types much better, so those 2 might be outliers from other clades. Two L260+ samples are outside P type, at steps 10 and 11. One other sample, at step 9, does not fit any other type so should be considered a P type Borderline. Summary: the cutoff 9 captures 2/148 = 1.4% false L260, and excludes 2/148 = 1.4% L260+ and in addition 1/148 = 0.7% Borderline.
Also on line is my previous analysis PType2011.xls, with my previous 2011 definition P43, for comparison.
That previous analysis has a sheet “ASD” for calculating age (age means TMRCA) of STR clusters using Average Squared Distance. P type age is about 1,600 years. That’s highly uncertain, but I’m reasonably confident that the age of 1,600 years is not off by more than a factor of 1.5 - age 1,100 to 2,400 years. The L260 mutation may well be quite a bit older than the age of P type. In my opinion, age calculation from mutations is highly uncertain; I wrote a long discussion about this at another of my web pages, Age Caveats.
It’s interesting to wonder if the age of P type is associated with the historical appearance of Poland somewhat more than 1,000 years ago. It’s also interesting to wonder why P type is so isolated in haplospace - why there are so few men alive today with STR values slightly different than P type. I added a bit of speculation along these lines to my publication, but frankly, no one knows the answers. I offer a little more speculation at the end of this topic.
Pawlowski originally noticed what I now call P type. My publication has more about this history. The isolation of P type in the Polish Project is now even more impressive than at the time of my publication due to the accumulation of more data.
Ludvik Urban pointed out to me that P type is common in the Czech Y-DNA Database. FTDNA also has a Czech Y-DNA Project. Karen Melis, administrator of the FTDNA Zamagurie Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on the border of Slovakia with Poland. I’m not sure of the concentration in Slovakia.
I added a “Ysearch” sheet to that PType.xls analysis file, with update analysis from Ysearch. The Western Slavic Modal haplotype, Ysearch 28WGP, matches P type perfectly at all 43 markers used in my new definition. That Western Slavic Modal uses 76 markers, but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.
Let me finish this P type topic with brief speculation about the origin of P type:
What does P type isolation mean? One simple explanation: The M458 father haplogroup for P type and N type seems to have experienced a severe population bottleneck. The evidence: P type and N type are very easily separated by STR values. Both are isolated in haplospace. No overlap. They are so far apart that the nearest neighbors (just beyond the cutoff) for P type include outlier samples (from other R1a haplogroups) in addition to N type samples, and nearest neighbors for N include samples other than P. Apparently, the father haplogroup was quite old at the time of the bottleneck, with lots of variation in STR values. The bottleneck wiped out most of that population, so today men in that father haplogroup descend from just two ancestors, the MRCAs of P type and N type.
Why is P type so large and concentrated in Poland? One obvious explanation is a rapid population expansion not long after the TMRCA. Evidence: Subtypes cannot be defined with confidence based on STRs. Apparently, the major bimodal markers are due to mutations that happened early in the population expansion, so the branches of P type have similar statistical spread of STR values.
There are other explanations to these questions: P type may represent a huge migration of a single paternal tribe during the dark ages from far away to the region that is now Poland. Perhaps the related haplogroups in that far away place got wiped out by subsequent famines and wars. On maybe they did not get wiped out. If people in that far away place did not tend to migrate to North America in the past, and today do not tend to get DNA tests, then perhaps there are isolated pockets of L260 clades there waiting to be discovered - some with STRs very similar to P type - some with STRs very different than either P or N. Maybe in the mountains of western Asia.
Also, the standard “null” explanation should be considered unless there is strong evidence otherwise. The null explanation is statistical: No significant bottleneck or expansion. Just the luck of random growth of clades in a small human population over the millennia. The MRCA of P & N perhaps were far apart in STR values just by luck - both being outliers. No one knows how to calculate the probability that a large P and a larger N clade can be sole survivors of the statistics of clade growth in the Y-DNA tree. To me it seems highly unlikely. But I don’t know how to rule this null model out in a convincing way.
I can think of more complicated models as explanations. I’m sure you can, too.
Z93. New topic 31 Oct 2011. This new SNP was recognized earlier this month by ISOGG as R1a1a1h.
So far, all Z93 samples in the Polish Project are coming out L342.2+, and vice versa.
I just today added L342T as a new cluster, a hypothetical branch of Z93 (L342.2).
The Z93 category at the Polish Project web page has the samples that are Z93+ or L342.2+ and are not predicted A type or L342T cluster. Z93 also includes samples not tested for Z93 but are close STR matches to a sample that tested Z93+.
I tried to come up with an STR definition for Z93 (L342.2). I could not. Z93 does not have good signature STR markers.
Z93 is a good example of why calculating age of haplogroups is highly uncertain. A type seems to be very young. A type dominates Z93 in the Polish Project. Maybe A type had a particularly vigorous population expansion; or maybe A type luckily avoided a severe population bottleneck; or maybe the A type ancestors moved to Central Europe from distant lands; whatever. Age is calculated from STR variance, so the age of Z93 is dominated by the age of A, which is misleading and too young. If A type samples are excluded, the age of Z93 still would come out too young, because the A type samples have a unique STR signature, which means significant STR mutations, which means the A type MRCA lived at a time when Z93 was already quite old, so the A data needs to be considered when estimating the age of Z93. I’ll try to come up with an age estimate, for next time I update this topic.
End of R1a Categories.
Instructions for Ysearch comparison are below.
On 20 July 2010 I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).
I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters. We judge that my analysis justifies adding them to our list of types. Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.
Instructions for Ysearch comparison are below. Here is the “UserIDs” bar for R1b comparison:
USEID, CX94E, MKM4R, 7HB9C
Change USEID to your User ID.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be optimal for other regions. If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability that you belong. You should test the appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.
L23EE. 20 Jul 2010 documentation: This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.
There are only 6 samples in the Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a small type. The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%. In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+. So this type is very well isolated in haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia. That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain. I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no obvious correlation. To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.
If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51 test; a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member. We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.
L47P. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.
Mayka announced the cluster corresponding to this type on the web in March 2009.
There are only 4 samples in the Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small type. The cutoff is 7 and the gap is 10. There are no samples from step 7 to 16. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.
This type is very robust; the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project. Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.
The “P” in the code L47P represents my hypothesis that this type is Polish.
Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.
See the last paragraph of L47A, next topic, for more comments.
L47A. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative for L44, a known branch of L47.
Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers; I would appreciate an email of a reference to add here, even if it does not exactly match my definition. It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.
There are only 5 samples in the Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to 18. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust; the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.
This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics; SBP = 4.6%, which is remarkable. It might be even better with an optimized definition; I used the modal haplotype that I extracted from the 4 Polish Project samples.
This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48. In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR values expected for L148. (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.). All this will quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148. Mayka provided the SNP data that I have documented here.
Update 25 Mar 2012.
At the end of July 2010 I added two types from the I haplogroup to this web document. I independently found these two by analyzing the Polish Project I data. Mayka informed me that they were previously known as clusters, hypothetical clades, discussed some time previously by Nordtvedt. Mayka added these two to the Polish Project web page in July 2010, based on my recommendation, based on my SBP analysis. One is a branch of what has previously been called I2-CE, and seems to represent a Polish collection of M253 branches so we named it M223CE type, discussed in the next topic. The other seems to be a Polish branch of I1-M253, so we named it M253P type, discussed in a topic below. I an now also using the short code names I-CE and I-P for these. I am now splitting I-CE into I-C, I-D, and I-E, topics below.
Instructions for Ysearch comparison are below. These types are calibrated to Polish Project data. The I-P definition WC8JD forms a type in the Ysearch database, so it seems to be reasonably valid world wide. The I-C definition SB6YK, and the I-E definition QUXE3, are probably not valid at Ysearch for a sample with origin remote from Historical Poland, because of interference by other clades with similar STR values, particularly from Russia.
All the I-CE samples in the Polish Project fall into one of the 3 branches discussed in the following topics.
The M223 clade is very well isolated in STR haplospace. FTDNA is able to predict I2b1(M223) with high confidence using only the first 12 standard markers, for more than 90% of the samples. Using 67 markers, I found that any reasonable definition does a good job of extracting M223 samples from Y-DNA STR data. A good definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).
STR isolation in the Polish Project is generally evidence of a single Polish clade. It is possible that two or more clades with distant nodes in the Y-DNA tree might have similar STR values by coincidence. In the case of Polish I-CE, since the larger I-CE world-wide clade is well isolated, my Polish I-CE type might well be a collection of multiple clades, perhaps including some clades that are not particularly concentrated in Poland. My original M253CE type used 4 of the 8 I-CE samples back in 2010. There are now 12 I-CE samples, and they form two types plus one cluster. It may seem silly to split these into 3 branches, but there are new SNPs, discussed below, that justify the split as valid haplogroups. These small types are interesting because they are preliminary evidence of small Polish clades.
CE stands for Continental Europe, including Britain.
The M223-Y-Clan project has lots of data; I used this project data for reference.
A good signature is (392, 437, 450) = (12, 14, 9), which distinguishes almost all M223 samples from others, allowing one mutation step. (594) = (11) is also an excellent signature for M223, with the value 10 dominant outside M223, but this one is strange in the Polish Project, where 4 of the 12 samples have value >11; this is evidence that I-C might comprise two clades.
At Ysearch, the percent Polish samples for I-M223 is low. The following 3 STR definitions, my proposed Polish branches, capture a small fraction of M223 at Ysearch.
My Excel file I-CE.xls has analysis of this type and also analysis of the following three branches. That file has ASD analysis, but ASD age is very misleading when calculated from samples that are a collection from multiple large old clades. The three branches have too few samples to attempt age estimates.
I-C type includes all 4 samples assigned to I-CE last year, plus one that was missed last year, plus 3 new ones, for 8 total at 67 markers in the Polish Project. SBP has improved from 19% to 2.6% over the past year, so this is a clade with high confidence due to the excellent isolation, although there is a chance it may be two or more independent clades as discussed above.
My Excel file I-CE.xls has analysis of this type in column CJ, SBP=2.6%. My definition uses 67 markers, cutoff 20, gap 14. There are no Polish Project samples in the gap from step 20 through 33, so this type is very well isolated. This definition also isolates I-E type, 4 samples, steps 34 to 42, but there is a better definition for I-E, see the next topic.
There are no Polish Project samples at step 43 or 44. There is only one I2b2 sample (not M223) as step 45. Then there are no further samples at steps 46 through 52. So this I-C definition also captures all of the broader I-CE (M223), although surely a better I-CE world wide definition could be constructed.
A good signature is (406, 487) = (10, 12), which itself distinguishes the 8 I-C samples in the Polish Project.
Two of the I-C samples are I-D samples, discussed below. Two other I-C samples have the same family name, very close in STR values. The remaining 4 samples in I-C are not particularly close to each other in STR values. The SNP data for each sample is included in column BX of the “Calculator sheet”; 4 of the samples tested negative for all 4 known haplogroup branches of I-M223. So I-C seems to capture M223* plus P95 (below) in the Polish Project.
My definition is also available at Ysearch, SB6YK. On Ysearch there are plenty of samples from step 20 through 33, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.
My Excel file I-CE.xls has analysis of this type in column CM, SBP=13%. My definition uses 67 markers, cutoff 19, gap 7. There are no Polish Project samples in the gap from step 19 through 25, so this type is very well isolated. Only the I-C samples are all at steps 26 to 44, so this definition also nicely separates I-C from I-E in the Polish Project.
A good signature is (393, 459a, 446) = (15, 9, 10), allowing one mutation step, which distinguishes the four P78 samples in the Polish Project.
Both the P78+ Polish Project samples are in the M223-Y-Clan Project, and there are 13 others, but there are many more P78- in M223-Y-Clan, so this is not a particularly large subdivision of M223.
The other two I-E samples in the Polish Project that have not been tested for SNPs, but both have P78+ close matches on Ysearch, and no close matches from the other 3 branches of M223, so those are likely also P78+.
There are two other known haplogroup branches of M223: M379 has no positives in M233Y-Clan, and plenty of negatives, so it is very rare. M284 has plenty of positives in M223-Y-Clan; that branch is a large subdivision with a couple known branches of its own, but no samples in the Polish Project.
My I-E definition is also available at Ysearch, QUXE3. The Ysearch closest matches are I2b1c, so my definition is good at extracting P78 samples, but I suppose a better definition could be constructed for the world-wide P78 data. On Ysearch there are plenty of samples from step 19 through 25, including some I2b1c beyond step 25, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.
There are only 3 samples P95+ in the M223-Y-Clan Project, and many P95-, so this is a small haplogroup. Those 3 include one but not both of the Polish Project I-D. Two of those 3 have Poland listed as origin, and the third has no origin listed, so this may be a Polish clade, but it is too soon to tell. It is possible that I-C has a larger subdivision Polish branch, of which this I-D may be a branch, but this is just speculation until we get more data.
(640) = (13) seems to be a signature for I-D, but one STR marker should not be very reliable for prediction.
I did not enter a definition into Ysearch. The two I-D samples are highlighted bold blue in column CI of I-CE.xls. Only one sample is P95+ in the Polish Project - the one that is also in the M223-Y-Clan Project, so I used that sample as the definition. There is a sample at step 10, and none others out to step 22, so I tentatively assigned that step 10 sample to I-D, forming a cluster of two samples, SBP=25%, well isolated from others but not a type.
On 26 July 2011, I added this Polish type for I1 haplogroup to this web page. This type has been known as a cluster for a few years. Mayka pointed out to me that Nordtvedt listed it on the web. Marek Skarbek Kozietulski has studied this cluster quite a bit, since he’s a member. I mentioned this type briefly in my publication, where I was previously calling it Y type, considering it not high confidence based on the data available then in 2009. I am now very confident that I-P type corresponds to a valid clade, concentrated in Poland, to be verified someday with a new SNP discovery.
I have also called this M253P type, because I-P samples test positive for I1 (M253) and negative for the branches of I1, although new SNPs for I1 are being rapidly discovered, and the newest have not yet been tested for I-P. So this is a type within the paragroup I1*, although a low fraction of samples from I1* are members of this M253P type. Marek has done the WTY , March 2011, without finding an SNP for I1.
SBP came out 6.4% for the 9 samples in M253P in July 2011 in the Polish Project at 67 markers. There are now (Mar 2012) 11 samples with SBP = 5.0%. Marek informs me that he had identified 4 men who matched at 12 markers and actively recruited them to obtain all 67 markers and to join the Polish Project. That means only 7 of these 11 samples should be used for statistical purposes. SBP calculated on the basis of 7 samples is 8.7%, which is excellent evidence of a clade that is isolated in haplospace.
I used all 11 samples in my analysis file in order to best estimate the definition, which are also available at Haplotypes.xls.
A good signature for M253P is (391, 392, 447) = (11, 12, 24), although this signature alone is not foolproof for distinguishing I-P from all other I haplogroup samples.
Nordtvedt's I1 Tree has this I-P type as I1*-P1, with related clusters I1*-P2 and AS4.
Here is some interesting speculation for which I do not have convincing statistical evidence: Marek points out that a sample at step 4 on Ysearch is Danish, which adds to his evidence that there might be a related clade in Denmark, perhaps with a node in the I1 tree slightly older than the node for the I-P Polish clade. I do not know where that Danish sample falls in Nordtvedt’s tree.
Ysearch provides evidence of concentration in Poland. My definition is WC8JD. 73% of the samples that come up in Ysearch (8 of 11) have Poland as origin. Although this is a small statistical sample, this is the most Polish concentrated type I have seen so far. SBP=22.1% on Ysearch, due to that single Danish sample at step 4, so although statistically less confident at Ysearch, my definition can suggest samples from Ysearch for the hypothetical I-P clade, albeit with lower confidence than samples with Polish origin.
The age comes out only 567 years using all 67 markers. See cell N12 of sheet ASD in my file. There are many caveats associated with age calculation based on ASD, and this is a small statistical sample. Insofar as Marek may have recruited with a bias toward close matches, the ASD age is biased low. That said, it is clear that I-P type represents a young clade.
Instructions for Ysearch comparison are below.
N-G. (N-L551). (N-G Type). Update 22 Mar 2012. Introduced on 17 Oct 2010 as “N1c1(M178)-G type”. The latest ISOGG code is N1c1d1a (L551).
Mayka suggested this one, based on a suggestion by Andrzej Bajor, from his Rurikid Dynasty Project. This type is concentrated in Lithuania, and Andrzej suggests that at least one member might be a male line descendant of Gediminas, the medieval Lithuanian Duke. Hence the “G” code.
This type has 9 samples at 67 markers very well isolated in the Polish Project with SBP = 8.9%. See N-GType.xls. The definition is also available at Haplotypes.xls and at Ysearch as RGE95, using 51 markers, cutoff 3 (samples < step 3). All but one of the N-G samples can be extracted from the Polish Project using only the signature (392, 607, 557) = (15, 14, 13).
This type should not be confused with another G type in the R1a haplogroup.
That new L551 SNP verifies our prior prediction that G type corresponds to a clade. All 9 of the predicted G type samples at 67 markers have tested L551+, and samples predicted just beyond G type are coming out L551-. Of course, there will probably be a few exceptions as more data accumulates, but so far N-G type (STR match) is equivalent to L551 in the Polish Project.
At Ysearch, N-G type is not as well isolated; the SBP is 22% with cutoff 4, due to interference by what might be a Russian clade. There are many Lithuanian samples matching my N-G definition (RGE95), including Lithuanian samples beyond the cutoff (step 3). 46% of the Ysearch samples below step 9 indicate Lithuanian origin. L551 is too new to be included in Ysearch, so this paragraph refers to N-G type as defined by STRs.
I do not know if the Polish Project N-G samples are an independent Polish sub-clade of a larger Lithuanian clade; or if the Polish Project samples are just a random sample of individuals from a larger clade(s). I have not taken the time to search other projects for STR matches to my N-G definition, or to search for more L551+ samples. Someone might inform me before I get a chance to search. Watch this topic for updates.
The age of N-G type seems to be less than 1,000 years, perhaps only 500 years. Check the “ASD” sheet in my analysis file. ASD age is highly uncertain, particularly for such a small sample, but G type has little STR variance, so surely G represents a clade younger than 2,000 years old. Isolation is evidence of an old , with TMRCA much younger than the node. The age of the L551 mutation can be anywhere in the time span older than the TMRCA of G type and younger than the node. N-G type is well isolated in Lithuania and Poland, but N-G may have a relatively young node with those other clades world-wide with similar STR values. Those other clades can be used to better constrain the age of the L551 mutation.
N-M. (N-L591). (N-M Cluster). Update 22 Mar 2012. Mayka suggested this one also, introducing it at the Polish Project in Jan 2011, as “N1c1(M178)-M Cluster”. The latest ISOGG code is N1c1d1b (L591). Includes Mickevius (Mickewicz) descendants. Hence the “M” code. Also concentrated in Lithuania. These two, N-G and N-M, are a small fraction of the M178 clade.
I call this a cluster because it does not meet my criterion SBP<20% to be called a type. Actually, the original proposed cluster is equivalent to what I am now calling Ma cluster, discussed below. The recent new SNP named L591 is coming out with about twice as many samples, so we have adopted the “M” short code name for the STR data for L591; this larger N-M cluster is so considered equivalent to N-L591.
My analysis is available, N-MCluster.xls, 10 samples at 67 markers. My best automatic for N-M, column CL, SBP=25%, is 80% accurate, missing one sample that is obviously L591 and predicting one sample that came out L591-, out of 10 predicted. Actually, this result is a nice confirmation of my SBP method, because although the data has only 10% background (false positives captured by the definition), my SBP formula has an increase to account for statistical confidence; hence 25% is a better upper confidence estimate of the background for so little data. I bet as more data accumulates my best N-M definition will drift below SBP=20%, qualifying as a type. Anyway, this is moot, because L591 is a better criterion for the clade, and there is a logical distinction between the N-M cluster (samples with STR correlation) and the L591 haplogroup. My definition serves as a guide for priority for L591 testing. Testing should be concentrated near the cutoff.
Accordingly, I came up with an improved STR definition for L591, using a mask to manually adjust marker selection. I’ll still call it by the short code N-M. Column CC in that file. SBP=50%, but SBP does not matter here, because the purpose of the definition is not to discover a hypothetical clade, but to predict samples for a known clade. Most clades do not produce low SBP because most clades are not well isolated. Let me elaborate with discussion of the statistical issues for N-M:
Obvious issue: There are three N-M samples with a very rare 6 step mutation at DYS446, from 17 to 11. Without DYS446, two of these three marginally fit the N-M cluster (based on STRs). These three seem to represent a subclade of L591 with modal STRs slightly drifted since their node. I marked them as “Mb” in that Excel file. Only one of these has actually tested L591+. Another one of these is that “obviously L591” sample that I mention above, the “obviously” based on this 6 step mutation, which is almost as good a marker as an SNP. That “obviously” sample is an STR outlier at other markers, which need to be excluded from the L591 definition, assuming more samples like this will show up. This seems obvious, but it needs verification with more data over the near future.
Speculative issue: There are two other outliers, which I labeled Mc and Md. Tested L591+. These may represent two clades with nodes only slightly younger than the TMRCA for L591, with independent modal drift. Highly uncertain. They might just be statistical outliers, due to the luck of random mutation. Again, more data will tell. For now, I adjusted the N-M definition to capture them, on the assumption that some future samples might come up with similar STR values.
Another issue: That one sample, mentioned above, fitting the M cluster very well but L591-, probably represents a clade with a node slightly older than L591, but similar STRs by coincidence; there may be other such clades. Again, this is speculative, but I adjusted my definition to exclude this one.
Statistical speculation summary: L591 does not seem very well isolated in haplospace, albeit more isolated than most young Y-DNA clades. It seems the L591 tree has nodes close to the SNP age, both younger and older.
This L591 clade seems to be concentrated in Lithuania. The evidence is Ysearch - Lithuanian concentration of the N-M cluster. L591 test data is not available yet at Ysearch. My Ysearch analysis (data in a sheet in that Excel file) is similar to the G type analysis: SBP not as good because of apparent interference from clades world-wide. Using the N-M definition at Ysearch, there is Lithuanian concentration at steps well beyond the cluster cutoff, so there seems to be a larger Lithuanian clade.
In the Polish Project, I spotted evidence of such a larger STR type, about double the size of N-M, including the all the N-M samples as a sub-clade. I colored these samples green in column BX of N-MCluster.xls, using all 67 markers. I dubbed this one N-L type. That 67 marker evidence is not satisfactory because it captures a couple N-G samples. In another file, not posted on-line, I came up with a satisfactory definition for N-L; I provide it in the “Haplotypes & Masks” sheet, row 21, of N-MCluster.xls. Mayka advises me that there are two new SNPs, L1025 and L1027, that are currently candidates for a haplogroup larger than L591. We are waiting to see how those come out before introducing N-L. That N-L definition cutoff provides a suggestion of where to prioritize SNP testing.
The age of N-M (L591) comes out similar to the age of N-G type, probably less than 1,000 years; see that short paragraph in the N-G topic above. My comments about isolation of N-G in the Polish Project do not apply to L591. For N-M, it is important to exclude DYS446, because that one marker triples the age as calculated using ASD (STR variance), due to that 6-step deletion mutation mentioned above. You can see this by editing cell BV21 in my mask in my “ASD” sheet in that file. Another way to edit this is to edit the 446 value, to make the mutation count one or two, which is more representative of the age. This is a good example of one of the caveats associated with age calculation based on STR variance.
N-Ma. New topic 20 Mar 2012. This is the original “N1c1(M178)-M Cluster” cluster, explained in the previous topic. Only 3 samples when introduced Jan 2011, SBP=36%. Now there are 5 Ma samples, SBP=30%. Although still not qualified as a type, there is better than a 30% chance this will improve over the next couple years as data accumulates. Lithuanian concentration, same as N-G and N-M. Again, I do not expect validity world-wide for N-Ma because of interference from other clades world-wide, but this might grow into a nice small, young Lithuanian clade. Analysis is in N-MCluster.xls, where the 61 marker definition for Ma is in column CG.
New Topic 22 Feb 2014.
Please look at the top of this web page for a very brief definition of what I mean by “Poland Concentration Index”, and two tables of data that include this index, as examples.
Let me use “PCI” for an abbreviation here.
I introduced a preliminary version of this index in my 2009 Publication, page 161, with results in the far right column of Table 1, page 162. In 2009 I defined this preliminary index as the percent of samples, of a given haplotype, that have the word “Poland” included in the “Origin” field for male line ancestry, for Y-DNA data at Ysearch.
In this 2014 version, I extend the index to types and clusters. In this topic, for brevity, let me use the general term “cluster” to mean a cluster, or a type, or a haplotype, or a haplogroup, or a paragroup, or a clade, or any other word for a category of Y-DNA STR data, although I make a technical distinction at the bottom of this topic.
In this 2014 version, the PCI is statistically adjusted for data sample size, as explained a few paragraphs below.
There are a number of reasons why % “Poland” from Ysearch is not an accurate measure of concentration in Poland. I mention some of these reasons in my 2009 Publication. I plan to add a longer discussion to this web page, with more detail about objections to statistics from Ysearch.
However, the objections are not a very serious problem if we are interested in relative concentration in Poland. Comparing clusters, we expect the clusters with higher % Poland at Ysearch to likely be more concentrated in Poland than those with lower % Poland. This is one reason I do not use a % sign for PCI. The other statistical reasons are explained below in this topic.
Caveat: You may use my PCI for another country, for example England, with a new “ECI” related to % “England” at Ysearch. It would not be valid to compare the PCI numbers to the ECI numbers, because we do not expect equal joining probabilities. Men with English male line ancestors do not necessarily join Ysearch in proportion to men with Polish male line ancestors. However, we do not expect significantly different joining probabilities for men with Polish male line ancestors in different clusters. There are exceptions, which I leave for future expansion of this discussion here (for example Polish Ashkenazi clusters).
The problem of false positives: We expect “false positives” - clusters with high % Poland just due to statistical probabilities (the luck of how many Polish and non-Polish men that would fit that particular cluster just happened to join Ysearch). The more clusters we study the more false positives we expect to find. The more clusters we study the more likely we will find a false positive that seems very highly concentrated in Poland.
For small samples of data, the statistical uncertainty is larger, so we expect more false positives. Suppose we check a large number of clusters for % Poland at Ysearch (or at any database), and suppose many of those clusters have fewer than 10 samples at Ysearch, and suppose some of those clusters have fewer than 5 “Poland” samples. We will surely find false positives. I discuss this sample size uncertainty in my 2009 Publication, where I used the lower bound of confidence range as a method to compensate for this statistical problem, particularly in small clusters.
My PCI uses the lower bound of the 95% confidence range. For example, a PCI = 25 means 95% confidence that the “true population % Poland” is 25% or greater. By “true population % Poland” I mean the % value that would show up in a much larger database drawn from the same population in the same way (in the distant future at Ysearch, for example, if Ysearch is still popular in the distant future).
If you are knowledgeable about statistical methods, you may quickly understand the details of my method from my file Ysearch.xls; check the “Summary” sheet and the “Instructions” sheet. There is a technical statistical explanation of PCI near the center of the “Instructions” sheet. Check the other sheets for specific cluster results. My automatic procedures use macros; if you are concerned about macros your browser should allow you to open my file in “View Only” mode.
Even if you are not knowledgeable, you might try following the “Instructions” sheet to evaluate your own clusters.
The “Summary” sheet in the Ysearch.xls file has 5 example rows labeled “50% Tests”: one row shows that a cluster with 100 “Poland” samples out of 200 Total has 50% Poland, and has a lower 95% confidence limit of 44.8%, so the PCI is 45. However another row shows that a cluster with 5 “Poland” samples out of 10 Total also has 50% Poland, but has a lower 95% confidence limit of 25.3%, so the PCI is only 25. In other words, smaller clusters get more downgrading to compensate for small sample statistics, but small clusters are allowed.
This is the main reason I leave out the % - to avoid confusion - probability vs confidence - PCI is a mix of both. PCI allows small clusters to be compared to large clusters, where the smaller clusters get adjusted to a lower index number.
There is another issue: selection bias: In defining clusters, we chose the number of markers, and the cutoff, to best define the cluster. The cutoff should be adjusted to best capture Poland samples, as demonstrated in the example sheets in Ysearch.xls. Selection bias effectively reduces confidence, because we may be selecting parameters based on statistical flukes.
Those “objections to Ysearch”, mentioned above but not listed, also reduce confidence. Many of the objections cause variation of results, with more variation expected for smaller clusters
So my net confidence is not 95%, but less. This is my style here: calculate PCI using 95% sample size confidence because sample size confidence can be easily calculated in an Excel sheet. All those objections, including selection bias, reduce confidence, with larger effect expected for smaller clusters, but smaller clusters get more reduction in PCI. I suppose my net confidence is about 80%, although this estimate is based on subjective experience - difficult to document with words. I expect about 80% of my PCI predictions to slowly increase over the years as more data accumulates at Ysearch, with smaller clusters increasing faster in PCI. I expect roughly about 20% of my PCI predictions to drop over the years, meaning roughly 20% overestimation of Poland concentration confidence. On the other hand, this PCI method automatically ignores small clusters with lesser concentration in Poland, many of which will significantly increase in PCI as data accumulates.
My Excel sheet automatically removes “Modal” Ysearch data, and excess “family set” data, as explained in the “Instructions” sheet, where the user is invited to manually edit the results, because human judgment is an improvement over automatic editing.
The Polish Project is representative of the historical Commonwealth of Poland, a geographic area much larger than modern Poland. PCI is weighted toward men with ancestry self-described as from Poland. One of these days I might add here a discussion topic about this complex topic. Check the web if you wish discussion now.
My PCI index can be used for haplogroups, but technically I do not do this, because the haplogroup assignments are not up to date at Ysearch. I like Ysearch because of the huge amount of STR data at 67 markers. Many of my STR based types are “almost equivalent” to SNP based haplogroups. For example, P type is almost equivalent to L260. “Almost equivalent” means a few haplogroup samples are STR outliers and a few samples from other haplogroups marginally fit the type just below the cutoff. In my tables this technical distinction between P type vs L260, and other equivalents, may not be obvious to you.
Click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services. Or, you can type in your data. You end up with a “User ID”.
Ysearch has a Research Tools tab to click, where you can type in other User ID’s for comparison.
Cluster Genetic Distance Method; for Haplogroup R1a: P - Pc - Pg - N - K - A - I - B - D - E - Fa - Fb - H - M - G:
Click here: Research Tools
Copy the following line into the “UserIDs” bar at the Research Tools page:
USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX, RU8Z8, K49NZ, GNYBG, YQ6D2, EFQM7, 559EE, 24MB4, ZD29Z
Change USEID to your User ID.
You need to type the Captcha puzzle for access.
Click on ‘Show genetic distance report”. You get a table of results.
Results: If there is a small genetic distance result (3 or less) for one of these types, you have a high probability of belonging to that type. There are more detailed rules available, see the “Polish Project Rules” sheet in the R1a Assigner.xls file. For haplogroups I, N, R1a, and R1b, see also Haplotypes.xls.
Reminder: This web page concentrates on the region of Eastern Europe associated with Historical Poland. If your male line is not from this region, the results of this Ysearch comparison may be misleading if there are unrelated clades, rare in Historical Poland, with haplotype range that overlaps one of these. Search for my discussion, in this web page, for your best match type; in some cases I have evidence for interference world-wide (significant matches by unrelated clades). Many men of Polish male line ancestry do not match any of these types; this web page is a work in progress. For non-Polish there is a higher probability of not matching any of these types.
Cluster Genetic Distance Method; for Haplogroup I: I-P, I-C, I-E:
Follow the R1a instructions above, except copy the following line into the “UserIDs” bar at the Research Tools page:
USEID, WC8JD, SB6YK, QUXE3
Cluster Genetic Distance Method; for Haplogroup N: N-G, N-M:
Follow the R1a instructions above, except copy the following line into the “UserIDs” bar at the Research Tools page:
USEID, RGE95, S4RUG
Cluster Genetic Distance Method; for Haplogroup R1b: R1b-P, R1b-EE:
Follow the R1a instructions above, except copy the following line into the “UserIDs” bar at the Research Tools page:
USEID, MKM4R, CX94E
Comment 18 Jul 2012: See the L260 and M458 News topic for an update.
This topic needs a complete rewrite with new test results.
This topic was completely rewritten during Dec 2010 & Jan 2011; last update edit 15 Jan 2011.
Lawrence Mayka is the administrator of the Polish Project. SNP results are not posted on the web. Most of my SNP data comes from Mayka. Some of my data comes from Cyndi Rutledge, the administrator of the R1a Project. Many men join both projects, but of course many men purchase the L260 or M458 test and do not join either. If you are an administrator of an FTDNA project (or a project at another database) you may send me the L260 and M458 results for your project for merging into my analysis, if you wish. Karen Melis, the administrator of the Zamagur8ie Project, also sent me a few M458 results.
Data with the 67 standard markers is most common in the SNP results because Mayka and I selected these for the initial tests. In addition, men who have purchased less than the standard 67 markers are less likely to purchase SNP tests. This discussion is limited to the 67 marker data with only brief comments about those with <67.
Mayka and I purchased many L260 and M458 tests for Polish Project members, so test results available to me are biased toward Polish data. Also, I suppose men who notice my publication and web pages about Polish types are more likely to purchase the L260 and M458 tests, so even data not available to me might be biased toward Polish data. At first we were concentrating on samples that match P type and N type very well, so much of the data available to me are biased toward P type and N type, of course. Later we concentrated on borderline samples that just barely match P type and N type, in order to better define the borders in STR haplospace. If there are clades from outside R1a1a1g (M458) that just happen to have STR values that match P type or N type we will discover them quickly, but not if they are concentrated far from Poland, and particularly not if they are concentrated in any Eurasian lands where men do not tend to get DNA tests. If there are M458 clades with STR values very different than P type or N type it will take some time to discover them all, because those will require “deep clade” tests by men without an M458 prediction to do the M458 test anyway. I have many such “wildcat” results; so far I have no L260+ or M458+ with STR values very distant from P and N type. I have comments below in this topic about the few outlier results a few steps beyond P and N types.
The SNP results do not provide estimates of population frequency because we are selecting the most interesting samples for SNP tests. However, since the SNP tests verify my type classification, my STR types provide credible frequency estimates. My Results Table is still the best estimate of frequencies in Poland: P type for M458+ L260+; N type for M458+ L260-.
My types are defined by STR values following my mountain method. For samples with all 67 standard STR markers my P type definition uses 46 of those markers; N type uses 45. The cutoff for both P and N is step 7, which means samples less than 7 genetic distance (step mutations) from the definition are predicted as belonging to the corresponding type.
To be fair, I should point out that I was a bit more conservative with my P and N predicted assignment rules 2 years ago, before the M458 and L260 SNPs were available, and when there were not as many samples with all 67 markers. Also, there were fewer known types 2 years ago. Half of today’s P and N outliers would be missed using my rules from 2 years ago and the others would be placed into “PK Borderline” and “NK Borderline” categories because 2 years ago I was more concerned about distinguishing P and N from K type, now known to be M458-. I no longer use K type, or those PK and NK categories. With recent data, my current STR based assignment rules are much more accurate for P and N outliers. I changed the P type definition October 2011.
I cannot define P type as exactly equal to L260, nor can I define N type as exactly M458 minus L260, because the types are defined by STR correlations. The outliers may be statistical, due to the luck of random mutations, particularly for P type with only 2 outliers so far (15 Jan 2011). I find that unlikely for N type, because the N branch STR distribution seems to have a non random tail extending to many outliers. It is possible that N branch outliers represent very small clades (perhaps only one clade) with old nodes in the Y-DNA tree. However, any particular outlier at or beyond the N cutoff cannot be assigned with confidence to a subclade of N. This is the reason I use the word “branch” instead of type for outliers, because I cannot be confident they all belong to the same young clade, as opposed to multiple young clades with old branches - with old nodes in the Y-DNA tree
However, those N type outliers provide confident assignment rules. At the N cutoff step N=7 all 4 samples in the Polish Project have been tested M458+ confirming N branch. At the next step N=8, 3 of the 6 in the Polish Project fit well for prediction into one of the M458- types, and 1 of those has been tested M458-; the other 3 N=8 do not fit any of the other types and indeed have been tested M458+ confirming N branch. This analysis is continued below in the next topic; the result is that samples without SNP results that have STR values at the cutoff or 1-2 steps beyond P or N type can be predicted with 100% probability (not 100% statistical confidence) to belong to the corresponding branch, for those samples that do not fit another type. At 3 steps beyond the cutoff probability is still about 50% for belonging to the branch.
P type and N type are very well separated from each other. Within P type, there is only 1 sample with steps N=P+5; all others are N>P+5. N type is more diffuse in STR values than P type. For N<6 there are 3 with P=N+5. The most ambiguous N type sample has N=7 (cutoff) P=8, and that one has been evaluated M458+ L260- confirming that samples marginally N type are really N branch. There are 3 others with N=6 or 7 and N<P<N+4; 2 of them are confirmed M458+ L260- and the other is M458+ but not tested for L260 yet. The most distant sample has N=10 P=9 and it is confirmed in the N branch, M458+ L260-, again providing the insight that distant STR samples with P step about equal to N step tend to fall into the N branch. (Again, this is for Polish Project samples that do not fit another known type). Of course, we expect someday to see exceptions, just due to the luck of random mutations.
There is one sample with P=9 N=11, but that one has an recLOH mutation that scores 4 steps at the DYS464 set. This is really only one mutation, so I manually adjusted the step to P=6 on this one.
For P type, the closest M458- sample has P=7 (cutoff); it fits I type; this is the sample that originally sparked my interest in P type. A P=8 M458- sample is assigned to K Borderline. A P=9 N=9 sample is the closest M458- sample that does not fit any known type, so is assigned to the Remainder category.
Borderline comments: In the Polish Project we use borderline categories for samples that have 50% to 79% confidence of belonging to a haplogroup or type. For P and N type samples with 67 markers, borderline means the SNP test has not been performed. With SNP results, samples are placed in the corresponding P or N type, with the understanding that outliers may in fact belong to closely related clades, as explained above.
Remainder comments: I use remainder categories for samples that have less than 50% estimated probability of belonging to any known type. Until recently we distinguished between the Rx458 category for samples not tested for M458 (and not positive for L260) vs the R458- category for samples that have been tested negative for M458. Today, all samples distant from all known types have been coming out R458-, so the Rx458 data has been merged into the R458- category.
During 2010 I used a R458+ category for N branch outliers, to distinguish outliers, which might not be true N type members. However, the distribution of N STR values is continuous, with no objective cutoff for N type vs N branch, so the R458+ distinction was dropped for now.
This discussion concentrates on samples with 67 markers for clarity. There 31 with only 37 markers and 2 with only 12 that have SNP results. I watch these for obvious anomalies; none yet. Analysis has lower confidence with fewer markers.
Summary of results: P type and N type are very well isolated in STR haplospace. They are well isolated from M458- samples and even more isolated from each other. Roughly 90% of the M458+ samples cluster into the two STR types within which I can make future SNP predictions based on new STR data with virtually 100% confidence. The roughly 10% remainder have STR values near the cutoffs for the types, mostly N type. Future STR predictions for these can be made with more than 50% statistical confidence (up to 100% probability based on the few data available so far) because most of these that do not fit one of the other known types do come out L260+ if closer to P type and M458+ otherwise. It is possible that some of these outliers belong to small clades (perhaps only two or three) that have older nodes in the Y-DNA tree.
Age (TMRCA) of haplogroups is uncertain due to a number of caveats. That said, N type seems to be about 2,000 years old and P type seems to be about 1,500 years old. Those estimates can be up to a factor of 2 incorrect, as discussed in my caveat topic. The ages of L260 and M458 are particularly uncertain because the calculated ages are dominated by P and N types, which are quite young. The SNPs may be much older, for all we know. The outliers in the P branch are too few to have significant effect on the calculated age of P type. It is possible that the N branch is really two (or more) types that are just as young as P; the calculated N age in such a situation would come out older. Ng type provides preliminary evidence of a hypothetical subtype of N, but Ng is too small and too close to N to affect the calculated age of N.
What does all this mean? There are a number of explanations. Here is the explanation that seems simplest to me: The R1a1a1g (M458) clade seems to be thousands of years old. It may have expanded into a large population long ago. The members of this clade diffused into a wide distribution of STR values over the millennia. Then there was a severe population bottleneck followed by a rapid population expansion, or multiple bottlenecks followed by multiple expansions. The living members of M458 descend from only a few men who each lived near the beginning of the most recent population expansion. Almost all living M458 men descend from just two of those men: the N type MRCA and the P type MRCA. A low percentage of living M458 men perhaps descend from other MRCAs who lived at roughly the same time as those two, as evidenced by the outliers in the N branch SNP data available to me today.
Comment 18 Jul 2012: See the L260 and M458 News topic for an update.
This topic needs a complete rewrite with new test details.
This topic was completely rewritten during Dec 2010 & Jan 2011; last update edit 15 Jan 2011.
Number of samples:
51 L260 Polish Project
8 L260 R1a Project, counting only those not in the Polish Project
59 L260 Total
154 M458 Polish Project
43 M458 R1a Project, counting only those not in the Polish Project
7 M458 Zamagurie Project, not in lists above
204 M458 Total (sum from above)
263 Total SNP test results (sum from above)
213 unique samples (210 different men)
180 have all 67 standard STR markers
31 have the 37 standard set
0 have the 25 standard set
2 have the 12 standard set
P type summary at 67 markers:
34 P<6 predicted P type all confirmed
7 P=6 predicted P Borderline all confirmed P type
1 P=9, but P=5 or 6 if corrected for recLOH, so predicted P type; counted as P<6; confirmed L260+
42 P type; so far, all samples below the cutoff 7 came out L260+, confirmed P type
1 P=7 (cutoff for the definition) P branch outlier confirmed L260+
1 P=8 P branch outlier confirmed L260+; this one from Czech Rep. is not in the Polish Project
2 P branch outliers; so far, all SNP data samples with P<9 are either P type or fit well to another type
so far, no L260+ with P>8
N type summary at 67 markers:
25 N<6 predicted N type all confirmed
6 N=6 predicted N Borderline all confirmed N type
31 so far fit N type, all confirmed
4 N=7 (cutoff) predicted N Borderline all confirmed M458+ N branch
There are no samples from other types at N<8
So far, all samples below with N<8 came out M458+ L260-, confirmed N branch
4 N=8 N branch outliers; all confirmed M458+
So far, all SNP data samples with N<9 are either N type or fit well to another type
2 N=9 N branch outliers
1 N=10 N branch outlier
3 with 50% predictability but no confidence, explained below
P type Details; samples that have all 67 markers sorted by P step:
34 P<6 solid P type (P<6 means <6 mutation steps from P type definition)
3 N=9 all solid P type well isolated from N
16 of the 34 are M458+, L260+; confirming P type not N type
8 are L260+ not tested for M458, assumed to be positive, confirming P type
10 M458+ not tested for L260 - predicted positive
7 P=6; 1 step below cutoff; would be predicted P Borderline prior to SNP evaluation; all 7 are M458+
4 are L260+, confirming P type
3 not yet tested for L260 probably most of these will be positive, now predicted P type
These represent all the Polish Project samples at step 6, 1 step below the cutoff, because these were selected for M458 evaluation soon after M458 was discovered. So step 6 is not as common as it seems in this SNP analysis.
2 P=7; cutoff; first step just beyond P type predictions
1 P branch outlier predicted P Borderline; confirmed L260+. From Bohemia.
1 predicted I type, verified M458-, not in the P or N branches
P=7 is very rare. By the way, this sample is my maternal grandfather.
5 P=8; 1 step beyond cutoff; 2 steps beyond P type
1 M458+ L260+ P branch outlier; not Polish Project; R1a project from Hostacov CR
1 predicted K Borderline; result M458- confirms not P or N type; still predicted K Borderline
3 N<=P considered N branch; details below
9 P=9; 3 steps beyond P type
1 N=11; M458+ L260+ This one has recLOH at 464, contributing 4 steps, so I consider this equivalent to P=6, so I count it as predicted P type, not an outlier. This is marginal, since it could be argued that the recLOH mutation may have happened after a 1 step mutation at 464 for all we know, making 2 steps, placing this sample an outlier at the cutoff 7, so my decision to predict him P type is arguable.
1 N=12; K=1; predicted Fa type (subtype of K) confirmed M458-
7 N<(P+2) considered N branch; details below
123 P>9 none are L260+; 18 are L260-
180 total with 67 markers, sorted here by P step
N type Details; samples that have all 67 markers sorted by N step:
25 N<6 solid N type
all 25 are M458+
3 of the 34 are M458+, L260-; confirming N type not P type
3 P=9; N<5; match N type much better than P type
6 N=6; 1 step below cutoff; were predicted N Borderline a couple years ago
all 6 are M458+
2 are L260-; the other 4 are predicted L260-
1 of the 6 is P=8 just beyond cutoff, but P is a tighter cluster, so this would not be predicted P, and this one came out L260- as expected
1 of the 6 is P=9, L260-
so all 6 are well isolated from P type
4 N=7; cutoff. These represent all the Polish Project samples at step 7, because these were selected for M458 evaluation soon after M458 was discovered. So step 7 is not as common as it seems in this SNP analysis.
all 4 are M458+
2 are L260-; the other 2 are predicted L260-
These are predicted N Borderline prior to SNP evaluation
5 N=8; 1 step above cutoff; 2 steps beyond the original N type definition
4 M458+; predicted N Borderline, now classified N type
1 P=8,10,10,13; K>6; fit no other known type
1 is Austria, not Polish Project
1 M458-, P=11, K=3 predicted K type, M458- result confirms K not N
9 N=9; 2 steps beyond N cutoff
3 predicted N Borderline do not fit any known type
2 M458+ N Branch outlier; P=11,16
1 M458- Assigned to R Remainder category; P=9, K=7
even at N=9, 2/3=67% probability N branch for samples that do not fit other known types
3 predicted D & G types verified M458- not M458 branch
3 P<5 P type analyzed above all 3 are L260+
10 N=10; 3 steps beyond N cutoff
2 do not fit other types
1 M458+ L260- N branch outlier; P=9; DYS573 null - discussed in previous topic
1 M458- Assigned to R Remainder category; P=16, K=8
3 fit other types; D, DB, & K, predicted M458-; all confirmed M458-
5 P<5 P type analyzed above 4 are L260+ 1 not L260 tested yet
36 P type discussed above
3 <(N-2) other types discussed with P sort above
75 M458- predicted other types, neither P nor N
7 N.17 M458- do not fit any type; R Remainder category
180 total with 67 markers, sorted here by N step
For background explanation of L260 and M458, see the topic L260 and M458.
This topic was written 17 Jan 2011.
DYS385a. The single STR marker called 385a is by far the best signature for predicting P type vs N type. All 34 samples with L260+ result so far have the value 10. All 25 samples with L260- result so far have the value 11.
Mayka also independently noticed this and mentioned it to me.
Usually, a signature with more STR markers predicts better. In this case, discriminating P (L260+) from N (L260-), 385a=10 predicts best by itself. No signature with 2 or more markers discriminates better. In fact, just 385a=10 works as well as the 46 marker P type definition.
This seems amazing, but is not entirely unexpected. STR markers have lower mutation rates at lower values, and step down mutations are less frequent than step up. Since N type has mostly 385a=11, step down to 10 should be less often than step up to 12.
The mutation rate of 385a=10 in P type (L260+) seems very low. At another of my web pages I postulate a rare SNP in the middle of a long STR chain to explain a low mutation rate, but such a postulate does not seem necessary in this 385a case because of the short STR chain value. For the lower rate at lower STR values, I provide a reference to Whittaker (2003) in my publication.
We can predict that future M458+ samples will be L260+ if 385a=10 and L260- otherwise. The probability is 100%. Exceptions are zero out of 59 L260 results so far. I figure the confidence of this prediction at 94%: Poisson 94% confidence interval for zero is the interval zero to 3.5; (1-3.5/59) = 94%. In other words, I am 94% confident that 3 or fewer samples out of the next 59 L260 measurements in the Polish Project will be exceptions to this new rule - that 385a=10 means L260+. Exceptions will be found eventually, of course, due to rare independent mutations from 11 to 10.
In the Polish Project, all 96 samples assigned to P type and all 15 samples assigned to P Borderline have the value 10 for 385a. There are 89 samples assigned to N and N Borderline. Only 7 of these have the value 12 for 385a; the other 77 have the value 11. In this case, predicting P type based on 385a=10, zero exceptions out of 100 samples, provides 97.8% confidence.
I postulate that 385a has only a slightly higher mutation rate in the N branch, at value 11. I postulate that those 7 N branch samples with 385a=12 belong to 2 or 3 subtypes in the N branch, 2 or 3 independent instances of a mutation from 385a=11 to 12. Most of these belong to a hypothetical Ncm type. The data is not sufficient yet to provide statistical evidence along these lines.
385a does not work quite that well for discriminating P type from all of R1a. Among the 91 M458- samples not tested for L260 there are 2 with 385a=9 and 4 with 385a=10. None of those are expected to be L260+ because L260 is a subhaplotype of M458. The 385a marker is still the best single marker for extracting P type from a full R1a database, including M458- samples from outside the M458 (P+N) haplogroup. However, in this case, using 2 or 3 markers works better, and of course the definitions (46 markers for P, 45 markers for N) work much better than any short signature.
A few samples with 385=(10,10) represent a hypothetical subtype within P. I call this Pk. I’ll discuss it more if and when there are enough samples for statistical significance.
Other signatures. Table 3 of my publication provides other signature markers. DYS572=12 continues to be 2nd best for P type. DYS 537 continues to be best for N type.
My R1a page has a handy 3 marker signature table. I announced this more than a year ago, as a handy prediction signature for the dominant types in R1a, using only the first 25 markers most common on the internet. It still works well. That signature uses (385a, 439, 447). The values for P type (L260+) are (10,10,23). The values for M type (M458+ L260-) are (11,11,23). The values for K type (M458-) are (11,10,24).
37 Marker Network
Lawrence Mayka (independently, March 2007) constructed a “median joining network” Network for the 37 marker samples of the Polish Project. This network supports the definitions of the P & N clusters, and of the A subcluster. The P cluster is the left side of Mayka’s network; N is the top branch, and A is a small branch on the lower right.
29 March 2010 correspondence: I mentioned Russian sites for R1a clusters in my publication. It’s not easy for me to figure out which of those clusters correspond to my types. Mayka worked out a correspondence on 29 March, warning me that the correspondence is not exact. Some of the Russian clusters are broader than my types; some are narrower. Here are Mayka’s findings:
My Type code vs Russian cluster name:
A Ashkenazi Jewish
B Western Eurasian
C Old European
D Baltic - Carpathian
E Northern Eurasian
F Central Eurasian
G Northern European
H Western Carpathian
I Northern Carpathian
N Central European
P Western Slavic
19 Sep 2010 update: A nice tree display of the Russian subdivision of R1a is at www.r1a.org. Robert Sliwinski brought this site to my attention.
My opinion: R1a cannot be highly subdivided with confidence based on STR data. This web site of mine is dedicated to estimating the confidence of each type that I study. I try to indicate which types are speculative. Even for the types with high confidence, the location of the nodes in the R1a tree will be uncertain until corresponding SNPs are discovered. These Russian clusters, apparently by Klyosov, have plus / minus values for accuracy of TMRCA ages that are far to small, because there are serious caveats associated with systematic statistical uncertainties.
Edited 7 Aug 2013. Rewrite 15 Sep 2011 - the bottom half of this topic.
Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see the Fall 2009 issue of JoGG. By haplospace I mean multidimensional sets of STR values; each haplotype is a point in haplospace.
A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain. The step is the genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate step. The cutoff is the next step just beyond the mountain. A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1). In other words, the cluster forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the parent haplogroup usually) at higher step numbers.
The Statistical Background Percent (SBP) is an objective measure of the quality of the type. Low SBP is taken as evidence that a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP (yet to be discovered). Larger types with lower gaps have lower SBP. SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade. SBP is increased to account for the estimated probability of outliers from other clades. An outlier is a sample that has very unusual STR values due to the luck of mutations. SBP is also increased to account for the estimated probability of small foreign clades that just happen to have the same STR values but are not closely related to the type. The SBP is also increased to provide the rough equivalent of the maximum in a confidence interval. Small sample counts have wide confidence intervals. So larger types (more samples) automatically get lower SBP. For a valid clade, SBP should decrease with time as data accumulates in a database. A very well isolated clade will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely to be a clade. SBP < 25% is good enough to be announced on the web. SBP > 25% is a cluster worth watching as data accumulates with time, although I avoid using the word type for SBP > 25%. SBP > 50% is not statistically meaningful although such clusters might improve as data accumulates. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains. The number of markers in the definition should be chosen to provide as small an SBP as possible; my Excel tools provide automatic rank of markers as an aide; human judgment can be used to include or exclude markers with obvious problems. A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.
I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff. The definition of a type is the modal haplotype plus cutoff. The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible if there is a tie. The definition of a valid type may change slightly as data accumulates.
Here are some common terms (in boldface) for genetic genealogy. I did not define these, although I use them in a restricted sense: A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation. A haplotype is a set of gene values at any number markers, here restricted to Y-DNA STR values. I use the word sample (plural samples or data or database) for the Y-DNA STR values from one man. A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples. A clade is a general term for common descent, so an SNP haplogroup is one kind of clade. I use the word clade in general, meaning a Y-DNA clade that may or may not be a defined official haplogroup. All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP. A cluster is a set of samples with similar STR values. All types have associated clusters but not all clusters are associated with types. The modal value for a marker is the most common value in the cluster. The modal haplotype is the set of most common values, usually the most common haplotype in a cluster. Many people use the adjective “modal” as a noun, meaning “modal haplotype”; so do I; I tried to avoid that in this web document.
The rest of this topic provides discussions and more definitions that not part of my Mountain Method. These are discussions and terms that I use often, so I provide them here for easy link reference from my web pages. Some of these terms are not common in genetic genealogy. Some of these I do not recall seeing used in documents at all, so they might be my inventions, although I suppose other writers may have used these terms with similar meaning:
A bimodal marker has a second STR value with many samples - more than expected statistically - in addition to the most common modal value. A multimodal marker is possible if there are more than two common values for the marker and if those common values are not distributed more or less symmetrically on both sides of the most common value. (A Bessel distribution is statistically expected for a low fraction of random independent mutations at an STR marker. A Bessel distribution is close to a Gaussian distribution for a high fraction of independent mutations. A Bessel for a low fraction looks like a tent; a Gaussian looks like a bell.) Step up mutations are more common than step down for short STRs, so for example a modal 8 plus a few more 9 values than 7’s does not necessarily mean the 9’s are statistically significant; experience helps to judge. RecLOH and other issues at compound markers also cause confusion in this regard. A bimodal marker is a hint that there may be a clade associated with that 2nd value, so genetic genealogists study clusters defined by one or a few such bimodal 2nd values. The main modal value also sometimes makes a good signature at a bimodal marker. In other words, a set of values using one or more bimodal or multimodal markers makes a good signature for a hypothetical cluster.
In the past, I have sometimes called such clusters hypothetical types. I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence that more than 80% of the samples belong to a clade that will someday be confirmed as a haplogroup by a newly discovered SNP. Sometimes we make exceptions above 20%, for example when a cluster is regionally concentrated, or associated with an ethnic group.
I had sometimes used “bimodal marker” for that second STR value, but I try to avoid that confusion. It’s the STR marker that is bimodal, with two common values.
There is no known way to calculate the % confidence that a cluster corresponds to a clade, but an experienced genetic genealogist can roughly estimate confidence based on experience. I developed SBP so that 100% minus SBP expresses my confidence, but only for clusters with less than 30% SBP; SBP breaks down around 50%. I avoid publishing clusters in which I estimate less than 50% confidence, although I may mention some as speculative.
Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous. A type corresponds to a clade that experienced a population bottleneck - isolation or migration or very rapid population growth.
A main branch of the Y-DNA tree is old, with data on the web for thousands of samples belonging, and with many known further branching divisions. I like to use the word twig for a small young branch of the Y-DNA tree. A terminal branch is a smallest known division of the tree; a terminal branch might be a haplogroup, or a type, or a hypothetical cluster; a terminal branch at one web site might not exist at another web site; a terminal branch might be very small (one or only a few samples) or very large (many samples).
By the age of a clade (haplogroup or type or hypothetical cluster) I mean the TRMCA. By definition, a TRMCA corresponds to a node in the Y-DNA tree, where two clades branch. (Sometimes more than two clades meet at one node, but we expect future SNPs might resolve that node into multiple nodes with two clades each.) An SNP is probably older than the TMRCA of the haplogroup it defines, and the node for two SNPs is probably older than either SNP, because there are usually many generations between old nodes, due to the statistical pruning of the Y-DNA tree (Y-DNA clades tend to die out statistically). The probability is very low that an old SNP mutation happened in exactly the same generation as the TMRCA. (An exception would be a recent private SNP found in an extended male line family.) I call the segments between nodes smooth branches, where there are no known nodes in that segment of the Y-DNA tree. A long smooth branch in the Y-DNA tree is one way to visualize isolation in haplospace. Any type, because it is isolated, probably has a long smooth branch older than the type. A smooth branch is necessarily a statistical estimate, because it is not possible to be sure a branch is smooth; the evidence is multiple equivalent SNPs, or less than usual STR variation. In addition, there may be small branches with living men who have not registered Y-DNA data on the web. So a “smooth” segment really includes the possibility of very few small branches. The metaphor of a tree is appropriate, because a large branch with very few twigs looks smooth from a distance; a smooth branch in an old tree was not smooth many years ago, but the twigs in that segment have died and fallen off the tree over the years. A Y-DNA branch can be smooth in one database (like the Polish Project) and not smooth in a larger databases (like Ysearch, if significant branches in that segment are rare or absent in Poland). All this paragraph applies to hypothetical clusters, but with lower confidence.
Edited 23 Jan 2015:
Sometimes one individual recruits male line relatives to submit data to a Y-DNA database, for example to the Polish Project. I call these family sets. I count these together as one sample when compiling statistics on frequency. By statistics on frequency I mean the number of samples per clade. By clade I mean a haplogroup or type or cluster. I do this adjustment for family sets because otherwise a small clade might get reported as too large.
My Results page is an example of statistics on frequency where I adjust for such recruitment bias.
I do not discourage such recruitment; it is a great research technique. I recruited my third cousin. I don’t mind the effort of adjusting for such recruitment bias.
My adjustment method: I sort databases by name, and automatically flag name repetitions. Then I examine the flagged data to see if the STR data is a very close match, which is a sign of recruitment. Often I make contact by email when it is not obvious if the samples have been recruited. Actually, even with email discussion, the actual correction may not be obvious; for example it may be difficult to say if a particular recruited distant relative may have later joined the project anyway independently, in which case he should be counted. So I may estimate 2 or more effective “independent” samples for some family sets.
I also sort data by close STR matches and look for evidence of recruitment. Recruitment can also be by close STR matches even with different family names. For example, I recruited a man with a last name different than mine, where his daughter noticed that his 12 marker data (at ancestry.com) matched my 12 marker data (at familytreedna.com). I paid for his 111 marker data which I submitted to multiple projects. We match STRs very closely at 111. I determined that his male line ancestors lived in Poland only 10 miles away from the village where my ancestors lived. We are obviously distant male line relatives. I don’t count him in frequency data because I recruited him.
I also sort by email address, again looking for samples with very close STR matches. Most samples that have the same email address are not in the same main haplogroup branch, even when the family name matches, which means most recruitment by family name brings in samples from different male lines. I do not adjust for these, because I assume the recruited samples fall randomly into haplogroups according to frequency in Poland, so such recruited data is OK. It is difficult to judge what to do when a pair of recruited samples are in different terminal branches that branch from a common larger branch.
Sequential kit numbers, or nearly sequential, are additional evidence of recruitment.
I’m not trying to make perfect adjustments. I’m mainly trying to catch all the large family sets. I don’t bother people with emails about sample pairs that may or may not be due to recruitment; I make my own judgment. If I miss a few pairs, or if I discount a few pairs that are really independent, that just adds a little noise to the frequency data.
Rewrite 21 Jan 2015:
See Polish Project Assignments for a brief overall explanation of how assignments are done.
The assignment guideline is minimum 80% probability and minimum 80% confidence for each individual sample.
Probability and confidence can sometimes be calculated.
Example: Consider a situation where 10 samples match a type with an STR prediction method. Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for the SNP, and 1 negative. That means 9 of the 10 really belong to the haplogroup and that 1 mismatch must come from a different haplogroup that matched the STRs by the luck of mutations. Next, consider a new sample that matches that same STR prediction method. What is the confidence that the new sample will pass the SNP test for the haplogroup? The probability is 90% because we know that 9 out of 10 previous samples like this matched the SNP. However, 1 mismatch out of 10 is a very small sample. As explained in my publication, I use Poisson statistics for quick calculation of confidence interval. Poisson statistics is simple to calculate in Excel. My tool Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.
80% confidence interval for 1 sample is 0.11 to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%; that lower number 61.1% means the 80% confidence ranges to lower than 80%, so minimum confidence is lower than 80%, although the probability is 90%.
In other words, statisticians are not confident of probability calculated from too few samples.
Confidence can also be lowered by biases; it is generally difficult to calculate corrections for biases. An example is selection bias: suppose we analyze 100 candidate clusters looking for STR correlations, and one of the 100 clusters has excellent correlation with a few selected STR markers; we might be tempted to have 99% confidence in the validity, since this one is the best one of 100. However, we will always find a best one when we study 100 candidate clusters. Even if we generate data for 100 STR clusters on a computer using a random generation method with no real correlation, some of those 100 will look like valid clusters, and one of those 100 will look best. In other words, if we do a lot of STR analysis that is sure to turn up some false STR clusters.
My publication has more detailed discussion of my statistical method for types. Briefly, I use SBP as a measure of confidence for predicting types based on STRs, including an adjustment for estimated biases.
It gets more complicated: Let’s go back to that 9 out of 10 SNP matching example above. When matching samples to a type, each sample has a step number (number of STR mutations from the modal, also called genetic distance). Now consider that new sample that has not yet been SNP tested: If that new sample has a low step, lower than most of those 9 validated samples, the probability and confidence can be estimated as higher than 90%. Conversely, a new sample at relatively high step can be estimated to have probability and confidence much lower than 80%. To calculate probability and confidence vs step we need several samples in that type with the same step that have already been SNP tested; but such data is almost never available.
One more simple example: If a new sample without SNP data matches STRs very closely to a previous sample with SNP tests, there is high probability and confidence the new sample will test positive for the same SNPs. In fact, if the family name is the same the two samples may have come from relatives. This simple observation is difficult to quantify, but there is a simple qualitative test: With an STR database, it is easy to compare one new sample to the entire database, and sort for closest matches. If 3 or 4 of the closest matches test positive for a particular SNP, and none of the other closest matches test negative for that SNP, we conclude that new sample can be predicted into the haplogroup for that SNP with high confidence.
The gist of all this discussion: We STR predictors occasionally calculate probabilities and confidence, but usually we estimate based on experience. The verification comes with time: as some of our predictions later get shown to be wrong we tend to get tighter with future estimates; if almost all our predictions get verified we may loosen up a bit.
A person who assigns samples to haplogroups based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, experience, educated guesses, and intuition. A bookie’s estimates are usually tested by reality very quickly. Probabilities of an STR estimator used to require years for verification or refutation by a newly discovered SNP; the time now is months.
You need to be skeptical of STR based predictions. In the past, a number of STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you can read as much as you wish about our methods in the Polish Project, judging for yourself the reliability of our assignments.
Update 21 Jan 2015:
I introduced P, N, and K in the Fall of 2007, publishing this web page 6 Dec of that year. I did not predict that P and N were brother clades, in fact it looked to me like P was closer to K. I did not make predictions about the P, N, K structure because the statistics did not justify such predictions. I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008. I stated my overall confidence in the subtypes of K type as only 80%, and without high confidence that the various subtypes of K actually belonged to a single unique K clade.
K, which never qualified as a type, represents the R1a modal haplotype. Today it is clear that K is not a clade. Many of my predicted subtypes of K have been verified by SNPs, and so far none of them have been shown to be invalid.
In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for those samples that do not fit P, N, or K. R was never intended as a clade. The R category is no longer used because there are many branches of R1a known today, so that each sample with sufficient STRs can be confidently predicted into a branch.
This topic uses R1a as an example, but the same discussion applies to other haplogroup assignments.
My publications have several references of general interest and relevance to my web documents.
My Tools and data for STR analysis are Excel files. These are available at the JoGG publication site as Supplementary Data: www.jogg.info/52/files/cpcindex.htm.
Polish Clades Update. This folder is for update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my publications. I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type. Link to English abstract: Pawlowski 2002.
Lawrence Mayka is the Administrator of the Polish Project. Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions. He reviewed & approved my 80% probability rule for assignments on the Polish Project web page. He also reviewed the original drafts of my publications. A number of my types were originally suggested to me as STR clusters by Larry. Larry continues to provide data for this web page. Many of my references to other websites in this document were suggested to me by Larry.
Lukasz Lapinski is an administrator of the R1a Project.
Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications. Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his publication in JoGG - some in Russian). It was encouraging to me seeing independent identification of clusters by different methods. He emailed to me an English version of one of his 2008 publications. His Fall JoGG articles have references to his other publications. Here is a web link: Klyosov Home.
Russian web sites: Semargl, http://www.r1a.org; http://www.rodstvo.ru; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a, brought to my attention by others, particularly by Mayka, who worked out a correlation with my types. These sites clearly have proposed subdivisions of R1a based on STR data, but I cannot quickly understand these due to the language barrier. Klyosov seems to be active at these sites. The sites make use of the FTDNA projects and Ysearch.
Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions, suggesting many STR based clusters.
FTDNA link: www.familytreedna.com. This is a commercial DNA testing company. I make extensive use of the project databases maintained by FTDNA. These are my primary sources of data. Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted for /polish/ in the Polish Project link, below. I do not work for FTDNA; many other companies offer DNA tests; I recommend FTDNA because I like the convenience of most DDNA data being available at the projects, particularly the Polish Project.
WTY. “Walk Through the Y”. This is an obsolete commercial product by FTDNA, for reading more than 200,000 base pairs of your Y chromosome, in a search for new SNPs in your branch of the Y-DNA tree. You can read about my WTY at another of my web pages. WTY has been replaced by Big Y.
Big Y: https://www.familytreedna.com/learn/y-dna-testing/big-y/. A commercial project by FTDNA for reading about 12 million base pairs of the DNA of the Y chromosome, which has about 60 million base pairs total. New s are being discovered in the Big Y data provided by customers. SNP ordering information.
Polish Project link: www.familytreedna.com/public/polish. One of many FTDNA projects. This is my primary source for Polish data. The Polish Project tracks both Y-DNA and mtDNA; click on “Y-DNA Results” on the left to see the data that I use.
R1a Project link: http://www.familytreedna.com/public/R1a. Newer R1a project, with multiple co-administrators, active in subdividing R1a data into hypothetical haplogroups. The project home page has a summary chart of R1a SNP sudivision, and other reference links.
R1aY-Haplogroup Project link: www.familytreedna.com/public/R1aY-Haplogroup. Original R1a project. Has not been updated for a few years.
Ysearch link: www.ysearch.org. Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch. I am not associated with the company FTDNA. I have Instructions for comparing your STR data to my types (modal haplotypes) that I have entered into Ysearch.
Yhrd link: www.yhrd.org. A forensic Y-DNA data base. Data is separate by city, with many Polish cities. I relied on Yhrd to figure out the geography of the various haplotypes. I wrote a Yhrd Reminders for myself so that I won’t forget how to navigate the Yhrd web site; click on that link if you need some hints.
Semargl. R1a site by Vladimir Tangankin. R1a tree in pie chart format using 111 marker data, Oct 2021: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009tmb700.png
Sorenson link: http://www.smgf.org/. Another DNA testing company.
ISOGG link: http://isogg.org/tree/ Y-DNA tree with the most recent SNPs and corresponding alphanumeric codes.
FTDNA Draft Tree link: http://ytree.ftdna.com/index.php?name=Draft another Y-DNA tree with SNPs, but not updated in more than a year.
recLOH: A technical detail discussed in many publications, for example http://en.wikipedia.org/wiki/RecLOH. I discuss this and other compound marker issues, and how step is calculated, in the “Documentation” sheet for my Calculator.xls tool.
DYS389: Another technical detail, also discussed on the web and in my Calculator.xls. Briefly, 389II is the sum of 389I plus another STR, so 389II should be figured in terms of the delta value.
I’m a very rare type in Poland - E-L540. My maternal 1st cousins are R1a1a. That means my late maternal grandfather was R1a1a. I became interested in Y-DNA in 2004. My maternal family name is Iwanowicz. I discovered a family with that name in my maternal grandfather’s home town in Poland. They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins. I brought a cheek swab kit when I visited them the second time in 2006. Sure enough, the male son was a perfect 25 STR marker match to my 1st cousin. I didn’t get around to checking the web for a year. I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers. That’s a hell of a lot of matches in the summer of 2007. Most of these matches are Polish. I did some research and found an article by Pawlowski (reference in my publication) about this most common Polish haplotype, which I now call P type. That got me interested in doing more research, leading to this web page for others to see my results. My experience, however, is a reminder that statistics can be misleading. I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers. In June 2010 I realized that the probability was really about 93%, because 13 out of the 14 then current Polish Project members who had 67 markers and who also matched P type perfectly at 12 markers were in fact P type as judged by all 67 markers. My grandfather does not match P type at 67 markers. My grandfather is that 14th one. He matches the small clade that I named I type, after Iwanowicz. I type has since been verified as haplogroup S18681, which is also concentrated in Poland. That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about common Polish Y-DNA clades. This web page was originally called “R1a”; it got so many hits from Poland that I eventually renamed it to include all common Y-DNA clades.
2007 Dec 6 First web posting of this file
2007 Dec Two revisions
2008 8 revisions
2009 33 revisions
2010 36 revisions
2011 26 revisions
2012 18 revisions
2013 Aug 10 rewrite of Polish Project Assignments. Edit of a few other topics
2013 Nov 6: News: update of Summary Table
2013 Nov 21 update of the Results Table at the top of this page; removal to topic “Explanation of the Results Table”
2014 Feb 22 new: Poland Concentration Index
2014 Feb 28: K type dropped, except for a brief K type explanation; reference to K type dropped from a number of other topics without update note. Update R1a Abstract, including drop of K type paragraph.
2014 Aug 30: Update with a comment added to the Abstract, about all the new SNPs
2014 Oct 12: Update of I type S18681
2014 Oct 14: Update rewrite I type S18681
2014 Nov 4: Update I type S18681 with links
2014 Nov 9: Rewrite I type; definition update from I62 to I58
2015 Jan 15: Rewrite of M458 topic and a few other updates
2015 Jan 17: Rewrite of P type
2015 Jan 19: update a few topics
2015 Jan 21: update of Polish Project Assignments and associated Probability topics
2016 Feb 26 add more links to Yfull Tree