Polish Y-DNA Clades
19 Sep 2016
19 Sep 2016 update of frequency % in the PCI Table.
18 Sep 2016 Major update of the Results Table.
5 Jul 2016 rewrite of D Type (Y2613).
30 Jun 2016 rewrite of B Type (Y2902).
12 Jun 2016 edit of P type (L260).
10 Jun 2016 rewrite update of multiple topics - no significant news.
New topic 24 Apr 2016.
The PCI Table lists the haplogroups that are particularly concentrated in Poland.
Abstract rewrite 24 Feb 2016. Edit 24 Apr 2016.
Lawrence Mayka is the primary administrator of the Polish Project. Paul Stone is also an administrator, with emphasis on the I1 haplogroup. I am also an administrator, helping Mayka with statistical methods for assignment of samples. This web document is for explanation, details, and update news.
The topic is common Polish Y-DNA clades - identification of male line Y-DNA clades that are concentrated in the region of Historical Poland.
This Abstract is for people reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you might prefer to read the Introduction first.
Many of the assignments are to established haplogroups, based on SNP test results. Many samples without sufficient SNP data, if their data matches closely to sample(s) with SNP data, are assigned to those corresponding SNP haplogroups.
Some assignments are to hypothetical haplogroup branches, based on correlations. Such branches are proposed by many people, including Mayka and me. In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence. With less than 80% confidence, assignment categories are tentative, not called types, usually called clusters.
Before 2014, it was expensive to discover new SNPs, so emphasis was on STRs, which were much less expensive. That was true of this web page, the Polish Project, most other web based projects, and most published articles about genetic genealogy. 2014 was a transition, where the cost of discovering new SNPs was greatly reduced. As of 2016, there is a continuous flood of new SNP discoveries, and testing for SNPs has become inexpensive. Emphasis is now on SNPs. Still, most on-line samples have STR data without test data for the most recent SNPs, so STR matching of samples continues to be needed for assignments.
Many of the new SNP branches are very small (I call them twigs), with less than 5 known samples.
I am interested in Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal. This document is dedicated to identifying haplogroups and types and clusters concentrated in Poland, with detailed explanations. I am aware that some people object to the use of Y-DNA for historical analysis, so I try to mention caveats along with my comments.
Update rewrite 22 Sep 2015. Edit 24 Apr 2016.
In that table, R1a is represented by R1a1 (R-M459) because all R1a samples in the database are R1a1. R1a1 is divided into 3 main haplogroups: M458, Z280, and Z93. Only 1.25% R1a1 samples do not belong to these 3 main haplogroups.
Worldwide, R1a is more complex. A graphic representation of the known branches of the R1a tree is available at the R1a Project. ISOGG has an R1a tree that is not up to date. Yfull has a continuously updated R1a tree including all the recent new SNPs for which data have been submitted to Yfull; direct link:
Reminder: I am concentrating on Poland. The statistics of STR clusters depend a lot on the database. For example, P type stands out dramatically in Polish data. In other countries far from Poland P type is rare. If you belong to an R1a cluster that is rare in Poland, I’m sorry, but I’m not covering you. Check out the R1a Project.
When I originally posted this web page in December 2007, no significant haplogroup subdivision of R1a was available, so this page started with hypothetical subdivisions of R1a. In 2010 I expanded this page to include the clades from other haplogroups.
Edit 10 Jun 2016.
This Introduction is for people unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.
This is a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages. The definition words are boldface. I often use links to those definitions when I use a jargon word for the first time in a topic. If you want more detail on those boldface words, consider a web search. There are more boldface definitions in the summary of my Methods.
The Y chromosome gets passed from father to son, so it works just like a male family name. Mutations (changes in the DNA coding) in the Y chromosome are inherited by sons. Men are divided into haplogroups based on known rare mutations. These mutations are called single nucleotide polymorphisms - SNP - a change at only one specific location in the Y chromosome. The human Y-DNA haplogroups, representing all men, can be arranged as branches on a tree. Diagrams of the tree usually depict the tree branching down (an upside down tree), or sideways. Example of a sideways image of the Y-DNA tree: wiki tree. Usually it is more convenient to arrange the tree upside down as a list, with tabs for the branches; examples: Yfull tree and ISOGG tree. These examples have links allowing you to browse through the thousands of known branches of the human Y-DNA tree. We don’t really know the full human Y-DNA tree; all trees are based on current data in a particular database; new branches are continuously discovered.
All the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup. The MRCA corresponds to a node, or branching point, in the Y-DNA tree of male line ancestry. Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the node.
Lots of people, including me, are working to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.
SNPs used to be difficult to discover and expensive to test. Costs have been coming down. SNPs are now discovered relatively easily; SNP testing is inexpensive. Since about 2013 there has been an increasing flood of new SNPs and corresponding newly discovered haplogroups.
SNPs have alphanumeric code names (for example CTS3402), assigned by the people who discover them, and registered at on-line databases. The corresponding Y-DNA haplogroups have alphanumeric code names assigned by ISOGG (for example the haplogroup for CTS3402 is R1a1a1b1a2b3). Since 2014, with the flood of new SNPs, ISOGG is not keeping up, and the ISOGG codes have become too long. ISOGG codes are still used for the main branches (the oldest branches with thousands of samples). For smaller (younger) branches, sometimes only the SNP code is used (for example haplogroup CTS3402) or a main branch code followed by the SNP code for the smaller subdivision branch of interest (for example haplogroup R-CTS3402 or R1a-CTS3402).
Clarification: A haplogroup (or a branch of the Y-DNA tree) includes all the subdivision haplogroups (all the subdivision branches). For example, R-S18681 is a branch of R-CTS3402, and R-CTS3402 is a branch of R1a. R1a is a branch of R1, which is a branch of R. So if a man belongs to the haplogroup R-S18681 he also belongs to the haplogroup R-CTS3402 and he also belongs to the haplogroups R1a, R1, and R. In this example, R is the oldest and R-S18681 is the youngest branch.
If you purchase a DNA test for R-CTS3402 and you have this SNP, the result comes out “positive”, or CTS3402+, and you belong to that haplogroup. If you don’t have this SNP the result comes out “negative”, or CTS3402-.
Upstream means older (SNPs, haplogroups, branches, etc) within the same branch; for example R1a is upstream of CTS3402. Downstream means younger within the same branch; for example CTS3402 is downstream of R1a. Upstream haplogroups generally are larger - more samples in a database. It may be confusing because of the mixing of metaphors, and because most streams and rivers have smaller branches upstream, which is opposite the situation for Y-DNA trees the way the words are used by genetic genealogists. I suppose upstream and downstream could be visualized as a river delta, where a large river breaks up into smaller rivers downstream, for the Y-DNA metaphor. Upstream and downstream should be visualized with the Y-DNA tree as a list, with older nodes up and younger nodes down and indented; see Yfull tree.
SNPs are individually very rare. But the Y has about 60 million locations, and about 1/3 of them are suitable for accurately measuring SNPs inexpensively, so there are thousands of known Y-SNPs. A man who belongs to R-S18681 inherits from his father, on the Y chromosome, not only the SNP mutation S18681, but also CTS3402, and also the mutations for haplogroups R, R1, and R1a, and also the SNPs for other branches that I am not mentioning in this example. It is unlikely that a particular man in an entirely different haplogroup might have the S18681 SNP mutation, but it’s like a lottery; some man somewhere outside S18681 probably has that mutation. However, it is extremely unlikely one man outside S18681 has many of the mutations from the upstream branches leading to S18681; it’s like winning a lottery many times. In other words, it is almost impossible for one man to carry the sets of SNPs for two different haplogroups. A mistake in SNP testing, or a mix up of SNP data, or someone cheating with DNA samples, are each more likely than valid haplogroup confusion.
Many haplogroups have an MRCA who lived thousands of years ago, so these span multiple ethnic groups and nationalities. For example, the R1a haplogroup is of interest to me. R1a is most common in Slavic countries but calling R1a Slavic can be misleading because it is found throughout Europe and west Asia. The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it. It is possible that he did not even live in what is now the Slavic region of Europe; maybe his descendants moved there in a massive migration from the Asian steppes, or from India, or from somewhere else. No one knows for sure. He may have been proto-Slavic in language and culture, but we don’t know for sure. If he was proto-Slavic, by now some of his descendants long ago moved to other parts of Europe and Asia. One of the appeals of genetic genealogy is finding such clues about ethnic descent and migration from the statistics of haplogroups. Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of our 24 chromosomes. True enough. Some individuals and some web sites go too far with genetic genealogy claims based on DNA. That said, statistical analysis of Y haplogroup data provides clues on human origins.
Most known haplogroups have a TMRCA of thousands of years ago, before family names were common, so most men in a haplogroup do not usually have the same family name.
Some relatively young haplogroups have been discovered that correspond to families, where most men in that haplogroup do have the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line distant cousins identified by genealogy research, due to secret adoptions, illegitimacies, cuckoldry, etc. Such a situation is called an NPE, non paternal event. This is one of the reasons some genealogists prefer to avoid genetic genealogy.
The male line associated with the Y-chromosome is only one ancestral line. Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then; the one man at the tip of the male line root is only one of those thousand. That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many. That said, many men enjoy purchasing a DNA test to find out which ancient Y-DNA haplogroup is theirs. Many people enjoy the challenging hobby of discovering new SNPs in their male line (including women studying their father’s or maternal grandfather’s male lines), thereby defining their younger branches of the Y-DNA tree.
A paragroup is a haplogroup considered without its known haplogroup branches. An asterisk is often used in paragroup codes, like R1a1a* or CTS3402*. R1a1a* usually (not always) means all the samples that belong to R1a1a but do not belong to any of the known branches. When a new branch is discovered, samples positive for that new SNP get assigned to the new branch; that changes the meaning of the corresponding paragroup. The meaning of a paragroup varies between databases, because different databases have different samples with different SNP test results. As a simple example of the idea of paragroups, “apes” is a biological clade that includes humans, but for many discussions we talk about “apes” as all apes except not including humans, in which case “apes” is temporarily a paraclade in that discussion.
Many SNPs were independently discovered more than once and listed with multiple names. For example, CTS3402 has two other code names, so it can be listed as CTS3402 / V2670 / S3361, or any of those names can be used alone. Those three codes are all the same mutation, at the same location on the Y chromosome.
Most haplogroups have multiple phyloequivalent SNPs. For example, CTS3402 has two other phyloequivalent SNPs: Y32 and Y2194. These are not the same; they are different mutations at different locations. So far, every sample that is positive for CTS3402 is also positive for those other two. This may change; for example in the future a sample may show up that is positive for CTS3402 but negative for the other two SNPs, in which case all previous branches of CTS3402 will be assigned to a new haplogroup, branching from CTS3402, defined by either of those other two SNPs, while that hypothetical new sample will be assigned into a new branch of CTS3402.
R1a has more than 100 phyloequivalent SNPs.
It’s OK to say “equivalent SNPs” instead of “phyloequivalent” if the meaning is clear, but “equivalent” has other unrelated meanings.
For clarity let me offer a tree analogy for “phyloequivalent”. Between branching nodes, a real tree has branch segments, which are smooth, without any branching. The older branches of a tree have long smooth segments. When that tree was younger, there were lots of branches along that segment, but those branches died and fell off the tree as it grew over time. Similarly, the Y-DNA tree has segments between nodes; the older branches of the Y-DNA tree correspond to haplogroups that have initial segments that are long - sometimes thousands of years in time, so there were many generations along that segment, with multiple independent SNPs, that now seem to be phyloequivalent, because all the other branches became extinct (no surviving males). I say “seem to be phyloequivalent” because a new sample may show up from a man in a newly discovered branch, thereby splitting that segment into two segments. So a “smooth” segment really includes the possibility of small twigs along that apparently smooth segment. The metaphor of a tree is appropriate, because a large branch segment with very few twigs looks smooth from a distance. A Y-DNA branch can be smooth in one database (like the Polish Project) and not smooth in a larger databases (like Ysearch, if significant branches in that segment are rare or absent in Poland).
Extinction: Over the life of a real tree, most branches die and fall off. Similarly, due to the statistics of male descendancy, most male lines become extinct over time. That seems surprising to many people, but it is a well known statistical result. If you want verification, search the web for the theorem called “gambler’s ruin”, whereby a gambler with a fixed stake almost always looses everything when playing at a casino, even if the odds would be neutral. The number of males in a haplogroup fluctuates up and down due to statistics over the generations, almost always eventually fluctuating to zero, similar to a gambler’s stake. A haplogroup that survives for thousands of years is like a very rare lucky gambler in a casino. In a casino the odds are usually fixed in favor of the casino; in the Y-DNA tree the odds were favorable for male lines during population expansions, and the odds were unfavorable for male lines during population declines. Extinction of male lines is faster during population bottlenecks. Although almost all male lines become extinct, there are those very rare lucky male lines that grow, forming a major branch of the Y-DNA tree, with many new male lines, whereby most of those new male lines eventually become extinct, repeating the process.
It is common to call haplogroups and corresponding SNPs “nodes”, particularly when discussing the Y-DNA tree represented as an upside down tree, as a list, with haplogroups indented to indicate branching. In such a list the haplogroups and their corresponding SNPs do appear as nodes, not distinguished from the true nodes which are MRCAs. I avoid using “node” for SNPs, although such use is common in genetic genealogy.
Actually, phyloequivalent SNPs are almost always spread out in time, within the segment that is older than the MRCA. I have more discussion about this in the age topic, below.
There is another kind of mutation, in a microsatelite, which is also called a short tandem repeat, STR. Briefly, an STR is like a necklace. Each bead of the necklace is the same short sequence of DNA, repeated multiple times. An STR can mutate such that the number of repeats in the necklace changes. So an STR mutation is expressed as the number of repeats after the mutation. STRs are not used for haplogroups because they are not rare enough. You can read more on line, for example wiki STR.
Here are some common terms (in boldface) for genetic genealogy. A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation. A haplotype is a set of gene values at any number markers. In Y-DNA genetic genealogy “haplotype” is usually used to mean a set of numbers, for the values of a particular set of Y-DNA STR markers. The word sample (plural samples or data or database) refers to the Y-DNA STR and SNP values from one man. A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples. A clade is a general term for common descent, so an SNP haplogroup is one kind of clade.
Many people, including me, in the past, worked to “stay ahead” of the SNP haplogroups by analyzing STR mutations. That’s because SNPs used to be difficult to discover and expensive to test, while STR data was relatively inexpensive. That’s changed since about 2013; SNPs are now discovered relatively easily; SNP testing is inexpensive. So STR analysis is no longer as popular as it was. STRs still have value for genetic genealogy. I have more about STRs and haplotypes in my Method topic below.
Although SNPs are more important than STRs for Y-DNA genetic genealogy, STRs are still valuable because on-line databases have thousands of samples with STR data but not as much SNP data (yet). You can search for statistical matches to STR data.
There are many organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy STR and SNP testing and matching and analysis, for example FTDNA. I am not associated with the company FTDNA; I mention them because I make extensive use of their data; check Google for competitors. At FTDNA, click on Products for cheek swab kits. DNA results are confidential unless you register the data at a public database; at FTDNA, click on Projects to register your data into one of the many databases; for example, most of my analysis is from the data in the FTDNA Polish Project.
For STR analysis, I prefer the FTDNA standard set of 67 STR markers. I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets. Other companies use standard marker sets that may not overlap with all the FTDNA markers.
As a first Y-DNA test, I recommend the FTDNA standard STR 37 set, because the result will automatically place you in one of the main large haplogroup branches of the Y-DNA tree, and because FTDNA provides you with matches to other men with similar STR haplotypes. If cost is not an issue, the 67 set is better for accurate matches and the 111 set is best. The 12 and 25 sets are no longer available in the Product list at FTDNA.
Once you know your haplogroup you can follow the on-line tree and purchase SNP tests to determine your younger branches on the Y-DNA tree. I have instructions available for SNP ordering.
If you already purchased a DNA test your result probably already has your main Y-DNA haplogroup branch (for men), so you can proceed with SNP testing.
Ysearch is the largest web database for Y-STR data, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. I use Ysearch often for analysis so of course I encourage you to register your STR data at Ysearch. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch.
Data sharing: Thousands of men are sharing their Y-DNA data by making it available on the web for analysis. Most DNA testing companies give you choices: you can keep everything private, in which case nothing is shared; you can allow sharing of everything; you can be selective, for example sharing STR and SNP data associated with your test kit ID number but withholding your name; and other options. The Y chromosome has relatively few genes, and none of those few have been correlated with health issues or significant human characteristics. Nevertheless, some people feel private about their DNA. One issue is non paternal events in the past, whereby your DNA matches may provide surprises. Although I encourage people to submit DNA data to databases, I understand if you are reluctant. I use public databases, mostly FTDNA projects and Ysearch, so my analysis is based on data that is already public. I use names of people if those names are already on the web. I ask permission to use names of people to use as references for new information or analysis.
Many people are using statistical analysis of Y-DNA data to gain insight into human origins and migrations. I am one of those people. My interest is Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments. This document is dedicated to Y-DNA data and analysis, both SNP and STR, identifying haplogroups, types, and clusters concentrated in Poland, with detailed explanations.
My Method topic has more definitions, but that topic is more advanced, intended for readers with some experience in genetic genealogy.
The Fall 2009 issue of the Journal of Genetic Genealogy has my publication split into two parts:
Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.
Part II is the application of that method to Common Polish Clades. That article has a lot more detail than this web page, but that article was published in the Fall of 2009, so this web page serves as an update.
PolishCladesUpdate is my folder for updates of the Excel analysis files for those two articles.
This web page continues as an introduction, summary, and update.
The Fall 2010 issue has my publication announcing the L260 SNP.
New topic 6 Nov 2009. Last edit 27 May 2010.
An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.
This web page about Polish Clades was completely rewritten using this new information. Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).
R1a1a7 is the new M458 haplogroup. R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other R1a types
This Underhill article has data for 158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458 Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.
Up to here, I have tried to write this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity with genetic genealogy jargon has understood. If you read this top to bottom, it gets progressively more detailed, with more and more jargon. I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions. If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.
If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location; this serves as an index.
Rewrite 5 Jul 2016.
Lawrence Mayka is the administrator of the Polish Project. Click on the Polish Project web link to see how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA and for Y-DNA. This web document of mine is restricted to Y-DNA. I help Larry with assignments to types. Lukasz Lapinski and Paul Stone also help with assignments.
Haplogroups are defined by SNP mutations. The goal is to assign samples to their proper terminal haplogroups. Your terminal haplogroup corresponds to the youngest known SNP haplogroup in your male line of the human Y-DNA tree. When the terminal haplogroup cannot be assigned with reasonable confidence, assignment may be to an upstream haplogroup (one of the older branch segments) leading to the terminal haplogroup. New terminal haplogroups (new younger segments) are continuously discovered because of the recent rapid rate of discovery of relevant new SNPs.
Samples are grouped by assignment categories on the Y-DNA STR data pages, where the category titles appear as horizontal colored rows, followed by rows of samples assigned to that category.
When appropriate, the assignment category names include a recommended SNP for further testing, in order to confidently determine the terminal haplogroup for those samples.
You can always save money by waiting. DNA testing costs are coming down as better testing methods are developed. More detailed SNP packs will surely be available in the future. As more data accumulates, a sample may show up eventually that matches your current STR data very closely, and if that sample has recent new SNP data with positive results, your sample will probably be assigned (or predicted) to that same corresponding new haplogroup at no cost to you. Assignment categories in the Polish Project are provided to help men who are doing male line genetic genealogy research (and women doing research on the male lines of their husbands and fathers and maternal grandfathers), and who would like results soon, as detailed as possible.
If you are planning to purchase Big Y (next topic), there is no need to purchase SNP tests, because Big Y includes just about all the commercially known Y SNPs.
If you are new to Y-DNA testing, you should join an FTDNA haplogroup project corresponding to one of your main haplogroup branches. The haplogroup project administrators are usually up to date on the latest SNPs for that haplogroup, and are often eager to help beginners figure out where they fall in their branch of the Y-DNA tree. The Polish Project administrators can also help out, although we may not be quite as knowledgeable about your specific haplogroup. To find your main haplogroup branches, from your FTDNA home page (Dashboard), click on “Haplotree and SNPs”. The tree should come up, indicating your FTDNA assigned haplogroup. The tree should have the main branches indented in rows above your FTDNA assigned SNP haplogroup (upstream haplogroups indented to the left). To find haplogroup projects, from your FTDNA home page, under the “Projects” tab, click on “Join a Project”, then scroll down to the header “Y-DNA Haplogroup Projects”, then click on the first letter of your haplogroup assignment; check out the projects that come up.
STR mutations had been easier to test than SNPs in the past, so many samples have STR data without recent SNP data. Predicted assignments (for samples without up to date SNP data) are based on STR correlations, by comparison to samples that have both STR and SNP data. The men with such predicted assignments can verify their assignments by ordering the corresponding SNP test that is named in the assignment.
The Polish Project also includes FTDNA computer generated assignments in a column labeled “Haplogroup”, which uses a color code; green text means assignment based on a positive SNP test result; red text means assignment based on STR prediction. I do not know the FTDNA computer algorithm for those red STR based predictions, but it is conservative; I notice they have more than 97% probability - less than 3% of those red predictions end up in different haplogroups when they are eventually SNP tested. However, that means most of the SNPs for recently found branches of the Y-DNA tree are not predicted by FTDNA, because there is not enough data for 97% probability predictions. Most of the newer SNPs are for younger branches, where STR prediction simply cannot be done with such high confidence, because those younger branches do not all have unique STR signatures.
The Polish Project assignments are more aggressive. The assignment guideline for predicted assignments based on STRs is better than 80% probability of future validation by an SNP. The intention is to provide more STR based predictions, accepting the risk that some might later be found incorrect. In practice, over the years, the average validation rate in the Polish Project is much better than 80% because most predicted samples fit much better than 80%, and relatively fewer samples turn up in the approximate 80% range.
We avoid recommending SNP tests when a sample has more than 95% confidence of testing positive, because testing money is better spent testing for the branches - younger SNPs. A negative SNP test result is OK because by eliminating a haplogroup, predicted assignment to another branch can be made with higher confidence.
Some Polish Project SNP prediction categories have qualifiers, such as “Predicted, Recommended, or Needed” to indicate relatively higher confidence, or qualifiers such as “Credible, Consider, or Borderline” to indicate relatively lower confidence. The Borderline category has the least confidence; this category is an exception to the 80% guideline; estimated probabilities as low as 50% have been used, although rarely, for samples that do not fit any other category.
Many haplogroups have multiple phyloequivalent SNPs. Polish Project Assignment categories select one of those phyloequivalents to be used consistently in the category name. Usually one of the first to be discovered is used, and then changes are made only when new data causes previous phyloequivalents to be split into different branch segments; so a new code name is used only when necessary. I do not fully understand the FTDNA computer assignments in this regard; samples in the same haplogroup often use different phyloequivalent SNP code names in the FTDNA “Haplogroup” column.
Clusters and types are hypothetical haplogroups, used as assignment categories based on STR analysis. There were more of these a few years ago, before the flood of new SNPs. Many of those have been validated (equivalent to an SNP haplogroup) or shown to be invalid (STR matches not having high probability of belonging to one haplogroup). In most cases the original cluster or type name is still used for assignments, along with the corresponding equivalent SNP. There are still some clusters and types that have not been validated or invalidated.
To order a recommended SNP, from your FTDNA home page, click on “Haplotree and SNPs”. The tree should come up, indicating your FTDNA assigned main haplogroup. The tree should have the SNP branches indented in rows below (downstream - younger than) your FTDNA assigned SNP.
FTDNA is not particularly fast to add new SNPs. Other companies are available; some of them honor requests for the newest SNPs. If you are new to SNP testing, ask for help from your haplogroup project administrator in selecting new SNPs appropriate for you. I generally recommend waiting for FTDNA because I like the convenience of all SNP data for the Polish Project in one place, but I sometimes purchase elsewhere. For SNPs that are not in the FTDNA tree, see SNP ordering information.
SNP packs are available and sometimes recommended in the Polish Project assignment name. SNP packs include many SNPs at a very low cost per SNP. SNP packs are recommended for samples that have no SNP tests, or have not been SNP tested in a long time, and where the assigned haplogroup has quite a few known younger SNPs. You can usually find your recommended SNP pack in the FTDNA tree, above your assigned position in the tree.
Sometimes a Remainder category is used for paragroups, which means the remaining samples from a haplogroup that have sufficient data to conclude that they do not belong to any of the known subdivision branch categories of that haplogroup.
STRs: Until a couple years ago I (and most experienced genetic genealogists) recommended purchasing the maximum number of STR markers. The FTDNA maximum standard set is 111; smaller standard sets have 67, 37, 25, and 12. More markers increased the confidence of assignments; some types and clusters required 67 or 111 markers for assignment; more samples with 111 markers allowed discovery of more types and clusters. Today, this is all still true, but SNPs are more important than STR markers for assignments. Assignments that need more markers have phrases like “SNPs or Markers Needed” in the assignment name. A sample with only 12 markers can be assigned to a main haplogroup branch based on only 12, and from there an SNP Pack can be purchased to identify the appropriate downstream haplogroups, and more SNPs can be individually tested in sequence to determine the terminal haplogroup for that sample. If cost is not an issue, Big Y (next topic) is better than purchasing SNP tests.
Another purpose for STRs: Finding male line best matches from the large number of samples without sufficient SNP data. The FTDNA site automatically lists close STR matches. For this purpose, more markers are better. 111 markers is best if cost is not an issue. However, even with 111 markers tested, the FTDNA site does check for matches at fewer markers because there are many samples in the FTDNA database that have only sets with fewer.
New Topic 6 Oct 2015. Edit 8 Oct 2015.
“Big Y” is a commercial project by FTDNA for reading about 12 million base pairs of the DNA of the Y chromosome, which has about 60 million base pairs total. New s are being discovered in the Big Y data provided by customers.
The FTDNA home page for your DNA kit has a link for ordering Big Y, and for later viewing the results.
If you are new to Y-DNA genetic genealogy, you might ask for help on Big Y, from an administrator of your haplogroup project, because Big Y results are not easy to understand. Yfull does an excellent job of analysis of Big Y results, at a modest price; I highly recommend joining Yfull by submitting to Yfull your “BAM”, which is the very large data file in Big Y results.
The cost is currently $575. Other companies offer similar tests; I recommend FTDNA because I like the convenience of Polish Project data being available at one place.
If you are planning to purchase Big Y, there is no need to purchase SNP tests, because Big Y includes just about all the commercially known Y SNPs. In addition, Big Y lists your private SNPs, corresponding to recent Y chromosome mutations in your male line ancestors. There is a good chance one of your private SNPs will match up with a private SNP of someone who previously purchased (or later purchases) Big Y, in which case that private SNP will define a new relatively young terminal haplogroup branch for just the two of you; more men can later test for only that SNP to see if they belong to that new haplogroup.
If you purchase Big Y you probably won’t have to purchase any more SNPs for quite a few years. Even when a test reading more of the Y chromosome at a lower price becomes available, men in your male line using that newer test might discover additional SNPs not in your Big Y results, but those SNPs may be phyloequivalent to your Big Y SNPs.
Other testing companies provide similar tests. I don’t work for FTDNA. If I seem to be pushing their tests it’s because I encourage men with Polish ancestry to take up to date tests and to join the Polish Project, where FTDNA data is readily available for analysis.
Edit 25 Apr 2016.
This is a long topic with many short subsections, each for a category. Many of these subsections are out of date and need to be rewritten. The subsections without a date on the first line may be a few years old.
This large topic has descriptions for many of the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters. Types and clusters are high confidence hypothetical haplogroups.
Please don’t get confused. The following capital letter names are my codes for R1a categories. Capital letters are also used for the large official haplogroups, but that’s different.
Some of the following categories are discussed in my November 2009 publication, and may have archive copies of my 2009 Excel analysis files stored in the Supplementary folder. Many of the following types have my update Excel analysis at PolishCladesUpdate.
Y2619 is a branch of Z93, which is a branch of R1a. In the Results Table, Z93 is 2.6%, mostly A type at 1.5% (early 2016). That table is restricted to samples from the Polish Project that indicate “Poland” for male line ancestry. Z93 is unusual in that only 37% indicate “Poland”, perhaps because this haplogroup seems to be dominated by Ashkenazim, who join the Polish Project looking for close Y-DNA matches. The Polish Project welcomes descendants from Historical Poland, and in the case of Z93 many of the samples even give origin from countries outside the historical Poland borders. The frequency of Z93 in the full Polish Project is about 7%.
A type is one of the more dramatically "isolated in haplospace" clades that I have analyzed using my type method. With cutoff 10 and gap 5, there are no samples in the Polish Project at steps (STR mutation count from my A type definition) 10 through 14.
Explanation of A type isolation (see also Population Bottlenecks): At the Yfull tree, Y2619 has 16 phyloequivalent SNPs. Y2619 is the only known (at Yfull) branch of CTS6, which has 11 phyloequivalent SNPs. There is only one sample on the twig CTS6* (CTS6+ Y2619-), so not counting that one exception, those 27 total SNPs are spread out in time on a long segment of the tree. Yfull estimates the length (in time) of that segment as from 3400 ybp to 1450 ybp. That’s 2 millennia of isolation (early 2016 estimate). That's a good explanation for why A type is so well isolated in STRs - a long time for unique STR mutations.
On Ysearch, that FCUFG definition is fairly well isolated, although not as dramatically as in the Polish Project.
There are no CTS6* samples in the Polish Project, nor in the R1a Project (early 2016).
The SNP sequence leading to A type is Z93 > Z94 > Z2124 > Z2122 > F1345 > CTS6 > Y2619. Other Z93 samples are sparsely spread among the Z93 paragroups excluding A type in that sequence. There is no significant evidence of any concentration in Poland. Possible exception: The Polish Project has 5 samples with Poland male line origin that are Z93+ Z94-. Two of them have Big Y. Perhaps these represent a very small haplogroup, to be discovered in the future, that is concentrated in Poland.
A type is discussed in my publication, Part II. The definition, using 67 markers, has been available since 2008 at Ysearch as FCUFG. This web page has consistently predicted that an SNP would be discovered that would seem equivalent to A type, which is the case today with Y2619.
I have consistently expressed more than 98% confidence that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web. It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi.
Between 2008 and 2011 I predicted that A type was a subtype of K type, but I never had high confidence in that prediction. I eventually dropped K type, but many of the hypothetical divisions of K type are coming out as valid clades of Z283, which is a “big brother” SNP to Z93. The match of A type to K divisions at the first standard set of 12 markers is now seen to be a coincidence. Older publications call that 12 marker haplotype, very common in Eastern Europe, the “Ashkenazi” haplotype, but we now know that only a small fraction of men who match at 12 markers are Ashkenazim.
In the Results Table, B type is 3.3% (early 2016). That table is restricted to samples from the Polish Project that indicate “Poland” for male line ancestry. For the full Polish Project, open to all of Historical Poland, 21 of 33 B type samples indicate ancestry from “Poland” implying the current borders of Poland. However, Y2902 is common throughout Eastern Europe. In the R1a Project Y2902 is called “Carpathian-Russian”, which is not particularly concentrated in Poland. Perhaps one or more branches of Y2902 are concentrated in Poland; we might soon see evidence for that as more B type samples get tested for terminal SNPs. At the Yfull tree (Jul 2016), Y2902 has 18 known branches.
At the Yfull tree (Jul 2016), Y2902 has 20 phyloequivalent SNPs, which demonstrate why B type is isolated in haplospace; for more discussion of this point see Population Bottlenecks. I don’t expect any of those 18 known branches to form a significant STR cluster or type, because none of them has more than 6 phyloequivalent SNPs.
Mayka suggested the B cluster to me in Sept 2009, at which time I verified that it qualified as a type, with SBP 12.4%, although there were only 11 samples at the time. I mentioned B type in my 2009 Publication.
In the Polish Project we have always used a “B Borderline” assignment for samples near the cutoff (a few steps below or above). We try to maintain 80% confidence for individual assignments, but this type is an example where “borderline” has been used for samples with lower confidence if no better assignment is possible. I have noticed that which samples make the cutoff is sensitive to exactly which markers are used in the definition, and I have tweaked the definition year to year as data accumulated. However, samples well below the cutoff are consistently predicted B type. You can see this in my analysis file, BType.xls, where the best fit 24 samples always come out B type using 30 to 67 markers (with my automatic marker selection method), and none of these have tested Y2902-. SBP is 20% in that analysis file, which is good but not great. At 111 markers, there are proportionally fewer borderline samples but it is still an issue.
I now realize the reason for this borderline issue: The Yfull tree shows a lot of immediate branches for Y2902 - a bushy tree - probably due to a rapid population expansion at about the TMRCA. Apparently they all branched again during or not long after the expansion. So STR outliers should be expected among the various branches, because there has been a lot of time for STR mutations to accumulate within the full Y2902 population.
My calculated Polish Concentration Index for B type is 13%, not good enough for the PCI table. Maybe a fuuture branch of D type will do better.
In the Results Table, D type is 2.0% (early 2016). That table is restricted to samples from the Polish Project that indicate “Poland” for male line ancestry. For the full Polish Project, open to all of Historical Poland, 20 of 35 D type samples indicate ancestry from “Poland” implying the current borders of Poland. However, Y2613 is common throughout Eastern Europe. In the R1a Project Y2613 is called “Carpathian-Dalmatian”, which is not particularly concentrated in Poland. Perhaps one or more branches of Y2613 are concentrated in Poland; we might soon see evidence for that as more B type samples get tested for terminal SNPs. At the Yfull tree (Jul 2016), Y2613 has 8 known branches.
At the Yfull tree, only one sample is Y2613*; the other 12 samples are in Y2609, the only known branch of Y1613 (Jul 2016). In the Polish Project no samples have yet tested Y2613+ Y2609-. Only one of the 9 at Yfull is Y2609*; the other 11 are in Y2608. (Another branch of Y2609, YP4993 has no samples listed.) In the Polish Project only one Y2609+ sample has come out Y2608-.
Summary: D type seems also equivalent to the haplogroup branches Y2609 and Y2608.
This type was added to the Polish Project in Jan 2010, before the SNPs were discovered. The cluster was brought to my attention by Mayka, who pointed out that Nordtvedt mentioned the cluster in web discussions some time before that, based on DYS462=12.
Signature (460,481,462,650) = (10,<22,12,18). Those last two are only available in the 111 STR marker set, where DYS462 is the best signature marker. At 37 markers, only DYS460 is available. D type cannot be distinguished using the 25 markers.
That definition is also available in my analysis file DType.xls, where SBP = 9.1% (data from early 2016), which is a good indication of confident STR prediction. In fact there are no outliers so far (no predicted D type with Y2613- result and no Y2916+ not predicted D type based on STRs). I don’t recall any D type sample over the years being reassigned to another category.
Yfull (Jul 2016) lists 11 phyloequivalent SNPs for Y2613, explaining why it is so well isolated. I cannot construct STR based divisions of Y2613, and I doubt it will be possible because Yfull shows only 4 phyloequivalent SNPs for the branch Y2609; Y2608 has only 3.
My calculated Polish Concentration Index for D type is 14%, not good enough for the PCI table. Maybe a future branch of D type will do better.
E type seems equivalent to the haplogroup defined by the SNP YP569, a branch of Z92. YP569 is not particularly concentrated in Poland, but there are plenty of E type samples in the Polish Project from the region of the Historical Polish Commonwealth.
E type is well isolated in STR Haplospace, so it was confidently identified as a hypothetical clade in early 2010, well before Z92 was discovered. V. Rudich entered a modal for this cluster into Ysearch as ID MW7DP, named “North East European”. Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”, 67 markers, which is still an excellent definition for E type in 2016. My June 2010 analysis EType.xls is still available on-line. In Jan 2016 E type had 6 YP569+ samples and no YP569- samples yet; there is only one YP569+ sample not captured by the E type definition, but that one is marginal, right at the cutoff; when a few more like this show up I’ll edit the E type definition to include them.
FH Clade. F and H types were suggested by Mayka. They have the signature (439,511,452 = 11,11,28). They differ from each other, so I could not make a combined FH type. I can make a reasonable FH cluster, but it is not necessary, since the FH clade can be better defined as the combination of the three types Fa, Fb, and H. The original F type (introduced Jun 2010) was split into Fa and Fb in Dec 2010. DYS452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated. Mayka and I helped most of the Polish Project members in FH, and members just beyond FH, to get 452 evaluated. Samples beyond FH have 452=30. My analysis files do not use 452 for determination of SBP. 452 would not significantly lower SBP because most of the background near the cutoff for each type are samples from the other two. In other words, Fa, Fb, and H are very well isolated from the rest of R1a, but not so well isolated from each other. These three FH types do not seem to be specifically concentrated in Poland (per Ysearch) although they are concentrated in Slavic countries including Poland. All three types seem quite young, without relatively low STR variance (see the ASD sheets in the analysis files).
FH Borderline. The borderline samples from Fa, Fb, and H are combined into a single FH Borderline category in the Polish Project, because these clearly belong to the FH clade but have less than 80% probability of belonging to any one of the 3 types.
Fa. Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%. See FH clade, above.
Fb. Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%. These samples were the original F type, before Fa was split off. See FH clade, above.
H. Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.
This type was suggested to me by Mayka, who calls it the Pomeranian cluster. Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland. Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians. Kashubians consider themselves an ethnic group or nationality within Poland. It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians. Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.
G type is mentioned only briefly in my publication because not much data was available to me at that time. My GType.xls update analysis file with June 2010 data had excellent results: There are 12 samples in a nice type with SBP = 11.2%.
In Jan 2011 a new SNPs, L365, seemed to include G type, which I mentioned here at this web page, based on only 5 samples at that time.
Now, Sep 2016, L365 has known branches YP389 and YP269; G type seems to be equivalent to YP389, and Gb type seems equivalent to YP269.
This type should not be confused with another G type in the N haplogroup.
13 Nov 2015 link to Stanaszek facebook page: https://www.facebook.com/R1aS18681/?fref=ts
Map showing location of S18681 samples (men): https://www.google.com/maps/d/viewer?hl=pl&authuser=0&mid=zIcwIZnt7lUg.kLMb_kZH4B3c
Update 26 Feb 2015. Lots of new SNPs have been discovered in the past few months, so my Nov rewrite is already out of date.
For the latest status check http://www.yfull.com/tree/R-S18681/.
Rewrite 9 Nov 2014:
I type seems to be about equivalent to the new SNP S18681.
In other words, if you test positive for the SNP S18681, that places you in the S18681 haplogroup, which is closely equivalent to what I have been calling I type.
Samples that match I type at 111 STR markers are coming out positive for S18681. Samples that do not match I type at 111 markers are coming out negative for S18681, with only one exception.
At 67 markers, there are three S18681+ outlier samples that do not match I type. I type is defined by STRs, so future S18681+ outliers may not match I type, and a few samples beyond the I type cutoff may come out S18681+ in the future. I have been slightly adjusting the definition of I type as more 67 marker data accumulates, so the definition has been improving with time.
At less than 67 markers the probabilities of outliers are higher. More discussion about this below.
Most but not all I type samples in the Polish Project are also coming out positive for the new SNP YP331. There are two newer SNPs, YP314 and YP315, that are located between S18681 and YP331. The most recent SNP finding is Y5973.2.
The “father” of S18681 is CTS8816, with the two “brothers” L1280 and Y2902.
The SNP sequence is R1a > Z280 > CTS3402 > CTS8816 > S18681 > YP315 > YP314 > YP331 > Y5973.2.
This recent work on new SNPs is being done by Stanaszek, Milewski, Lapinski, and Mayka. Łukasz Stanaszek has a document R1a_S18681.doc with a listing of I type samples from both projects, along with discussion of the possible origin of the S18681 haplogroup. Michał Milewski has a tree chart for Z280, which includes S18681, at the forum: http://eng.molgen.org/viewtopic.php?f=77&t=1464&start=120 Check that forum topic for the most recent update.
My definition for I type is published at Ysearch EKVHX, uses 58 of the 67 STR markers, cutoff 8, SBP 16.2%. My analysis file is available as IType.xls. My Aug 2011 definition, which used 62 markers, still works quite well, as demonstrated in that xls file.
My I type definition works OK in the R1a Project and at Ysearch, but not as well as in the Polish Project. One obvious difference is that there are only four samples in the Polish Project confirmed or predicted into the paragroup YP314+ YP331-, while this paragroup at the R1a project is almost as large as the haplogroup YP331+. This paragroup does not seem to be as concentrated in Poland as YP331+, which explains why my I type definition, tuned to the Polish Project, does not work quite as well in the R1a project. Those four paragroup samples in the Polish Project are highlighted in that file IType.xls.
SNP confirmed data: So far (14 Oct 2014, both projects) there are only 3 samples confirmed with SNP tests in the paragroup S18681+ YP315- (plus one cousin assumed). There is only one sample confirmed in the paragroup YP315+YP314-. There are only three samples confirmed in the paragroup YP314+YP331-. There are 10 samples (59% of 17 fully confirmed samples) confirmed in the haplogroup YP331+ (plus one cousin assumed). Three of those 10 YP331+ are in the new haplogroup Y5973.2, and one of them is confirmed Y5973.2-.
There are about 20 samples confirmed or predicted S18681 that need testing for the branches; we guess that most of these will come out YP314+, and most of those will come out YP331.
In other words, the largest subdivision of I type is looking like the haplogroup YP331+. The second largest looks like the paragroup YP314+YP331-.
The I type samples not yet SNP tested with low step at 67 or more markers are predicted S18681 with high confidence. There are “Borderline” samples close to the cutoff for I type, and samples close to I type at <67 that might be S18681, but cannot be predicted with high confidence.
The Polish Project and the R1a Project categorize all samples and recommend which SNPs should be purchased for those interested in determining the paragroup or haplogroup for their sample. See Stanaszek (link above) for the combined recommendations.
History of I type: This name was introduced by me in my Fall 2009 publication, Part II, page 178.. I named it after my Polish Iwanowicz grandfather, who carried this type. Later, I was informed that Russian web sites had been calling this STR cluster “Northern Carpathian”.
The best ranked signature marker for I type is DYS578=9. The ancestral value is 8. DYS578 has the second slowest mutation rate of the 67 standard markers per the Chandler rates. This marker is in the 37 set, but not in the 25 set. So the 37 set is a reasonable predictor for I type, while the 25 set is not. The 9’s are colored orange in that analysis file IType.xls. Three other good signatures are in the standard 111 set, but not in the 67 set: DYS463=24; DYS532=12; DYS504=14. Another fairly good signature, available in the 25 set, is DYS458=14, again orange in the file. This is a rapid mutator, so there is more variance.
DYS511>10, available in the 67 set, highlighted in IType.xls, seems to be a marker for paragroup YP314+YP331-, while the ancestral value 10 dominates the rest of S18681. However, one STR marker does not provide very confident assignments.
Previous subdivision attempts of I type: At this web page, I have reported that I type has a particular structure: I type has always formed a well isolated mountain in STR haplospace, with relatively few samples near the cutoff. However, the mountain is broad, not tall, more like a range of foothills, in the graph of number of samples vs step (STR mutation count). I type has a few bimodal STR markers. All this implies subtypes. Over the years, I have tried various subtype definitions within I type, and I have seen attempts by others on the web. But the subtypes have never provided statistical confidence; they are not well isolated in STRs.
Now we know why. Those four paragroups and that one haplogroup all seem to be old branches. It seems the S18681 tree is more like a bush. Indeed, the confirmed samples do not come out together when grouped by STRs at 67 markers. (Not many samples have 111 markers yet.) In fact, most of the confirmed samples from small paragroups seem closer using 67 STRs to samples in other groups than to members of their own group. That’s because there are many more samples in the database from other groups, and some are bound to match more closely just due to the luck of random mutations.
In other words, I type is a broad mountain in haplospace because it has many relatively old branches.
What does this mean? It’s difficult to say with certainty, but here is what I consider the simplest speculative explanation: S18681 I type originally grew quickly and prospered while other related clades died out, leaving this clade well isolated in STR haplospace. Over the years, a number of I type clades survived, all with approximately the same age.
This type was suggested by Mayka. I documented it here in June 2010, when there were only 6 members in the Polish Project, but with JType.xls at that time this type was well isolated at SBP= 13%. I entered J type into Ysearch as 743N9.
Now, Sep 2016, J type has 17 samples and seems to be equivalent to YP977, but only 4 of these have tested YP977+ (no other samples are YP977+)); we need to wait for more testing for confidence in YP977 as equivalent.
I have been using the subscripts “z”, “y”, “x”, etc backwards through the alphabet because I am running out of letters for new clusters and types. These small hypothetical clades seem to be subclades of K, although I do not have high confidence about the subclade status.
Ky. Ky cluster is now called Z92y cluster.
That KzType.xls file demonstrates that the same 6 samples are extracted using any number of markers from 2 to 67, so the definition is not critical for this well isolated type.
Kz is effectively more isolated than the SBP values (row 12 in that file) indicate, because the samples just beyond Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.
I’m using a hand edited definition, Kz59, using 59 markers, for the following reasons:
Kz is unusual in that 5 of the 6 samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those 6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%, which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a representative of a branch of this hypothetical clade, where each of the 6 branches has a node not much younger than the TMRCA.
Hand editing like this does introduce some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but moot). Countering the selection bias, many if not most of those 8 markers that I masked out might represent small tribal sized subclades, so future prediction of new Kz samples should work better using Kz59 with those 8 removed. Again, this is moot, because any number of markers extract the same samples.
The far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out. You can see that my selection is a bit arbitrary; I could have masked less than 8, or more than 8.
ASD age using all 67 markers comes out 724 years, cell N12. ASD age using the 59 markers not masked out comes out 704 years, cell N29, not much less. ASD age has a number of caveats, and 6 samples are not significant, so this age is highly uncertain. Kz is clearly young, as haplogroups go.
Additional information supplied to me by Mayka: Three of the Kz type samples are from non-Polish men who suspect they have Polish male line ancestry, so it is not certain Kz type is Polish. Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP appears to be private, restricted to his family. Insofar as that man recruited 3 more Kz samples into the Polish Project, Kz seems proportionally twice as large.
Kz has the prominent signature DYS459b=18. Mayka points out the additional signature DYS461=12, not one of the 67 marker set; most of the samples in Kz have been verified with this 12 value. Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other hypothetical clades, we do not know if the signature markers define a larger father clade.
L. This cluster is highly hypothetical. It is rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested this cluster to me. It is a well known Scandinavian cluster. I quickly checked it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here. More documentation about L will be available here when I find time to study it.
L1029 is the main branch of CTS11962. At 67 STR markers, L1029 samples can be predicted as those that fit N type (CTS11962) but do not fit Np cluster (YP515). Of course, an SNP test is preferred. L1029 has been available as an SNP test since March 2012 at FTDNA. For more discussion see the topics N type (CTS11962) and M458.
L1080. New SNP needs documentation here.
L342.2. New topic 30 Oct 2011. This SNP was recognized as a new haplogroup by ISOGG during the summer of 2011. This was an L342 haplogroup category at the Polish Project for a short time in the summer and fall of 2011, but it has been replaced by Z93, because it seems all the L342.2+ samples are also Z93+ in the Polish Project. Apparently there are very few men elsewhere in the world found to be Z93+ L342.2-.
Z93 is a more reliable SNP than L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2, discovered earlier in the E haplogroup. L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests together are more reliable. These 4 mutations are in the same segment, which is apparently a segment that mutates relatively rapidly. Z93 is recommended as the better test for R1a samples that do not fit STR definitions of other R1a haplogroups; the Z93+ samples can do the L342.2 test. This information about L342.2 was supplied to me by Mayka.
The Z93 category has the samples that do not fit the two known subdivisions: A type and L342T cluster (next topic).
L342T. New topic 30 Oct 2011. Based on 26 Oct 2011 Polish Project data. Analysis file: L542TCluster.xls. I just noticed this cluster.
Seven samples at 67 markers fit my new 48 marker definition for L342T. There are 19 A type samples, which should all be in the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T; the closest A’s are at step 8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5 also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit the L342T definition) and the four samples beyond the cutoff, are assigned to K type and to subtypes of K; Z280 has recently become available for K type; as those background samples get tested in the future for Z280, my L342T cluster will start looking better. Let me say that another way: a cluster should be analyzed with data from its own haplogroup, so L342T should be compared only to L342.2 (Z93) data. But there is very little L342.2 (Z93) data available, so I used the full R1a database in that xls file. That means L342T is likely more isolated than it seems right now, so it is more likely to correspond to a valid haplogroup.
Mayka pointed out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code name. Of course, Tatars may belong to only a branch of L342T; I have no idea what fraction of L342T in Poland are Tatar. And of course Tatars are expected to be a mix of multiple haplogroups.
Three of the L342T samples, with the name Muchla, are apparently a family set, so they count statistically as only one sample, reducing the current count from 6 to 4, so SBP as calculated in that xls file should be increased (not as good). This is evidence against L342T being valid.
M. Needs documentation. M type was brought to my attention by Larry Mayka, who informs me others have called this haplotype the Viking haplotype because of its concentration in northwest Europe.
Actually, the structure of the M458 tree is a bit complex, with haplogroups (Yfull, Apr 2016) PF7521, PF6188, Y2604, and others, but these haplogroups are defined by very few samples. L260 is a branch of Y2604, which with PF6188 and CTS1962 are branches of PF7521, which is the only named branch of M458. For simplicity, I just say that P type (L260) and N type (CTS11962) are the two main branches of M458.
In the Polish Project, there are only 9 samples assigned to M458+ L260- CTS11962-, neither P type nor N type, and 5 of those belong to the Ry family set, next paragraph.
Ry type: There is a family set (five samples with the same family name, very close STR match to each other) where one of them tested M458+ L260- CTS1192-. These five are clearly not P or N. Not even close in STRs. These 5 samples are now categorized in the Polish Project as “Ry type”. These were independently noticed by Lapinski, an administrator of the R1a Project, with a separate category for these in that project, also. Since 3 of these have been recruited to the family group, Ry counts as only two samples for statistical purposes in my Results Table.
N type (CTS11962) is concentrated in Slavic countries. N type is discussed in my publication, page 179.
According to Ysearch and Yhrd N type seems to be spread all around the Slavic lands and central Europe, common from East Germany to Russia. Within Poland (Polish Project database) N type seems a little smaller than P type. Worldwide, N is much larger than P, which is concentrated in Poland. N type should be properly studied in a database that is not restricted to Poland. However, there seem to be subtypes of N that are concentrated in Poland. See the discussions on N subtypes, topics below. I’ll continue to watch the Polish Project, because it will be interesting if more data provide more Polish subtypes within N. The R1a Project has more details, including rare samples that do not fall into the main branches.
My publication in 2009 introduced the names “N type” and “P type” before discovery of L260 (spring 2010) or CTS11962 (early 2013).
N is an STR type, equivalent but not exactly equal to the SNP haplogroup CTS11962, because of STR outliers (step near the cutoff). At 67 markers in the Polish Project N type has worked quite well since 2008 because a sample close to the N cutoff usually fits another type or cluster better than N or not as well as N, so it gets assigned respectively to that other cluster or to N. If it fits neither closely, that sample gets a Borderline assignment.
Using other than 67 standard markers: using 111 markers N type can be fully isolated with STRs. Using 37 markers many samples come out N Borderline; N type does not work satisfactorily for 25 or 12 markers.
Yfull (Apr 2016) roughly estimates the formation of the CTS11962 tree branch at 4500 years ago, with TMRCA at 3300 years ago.
It’s interesting to wonder why M458 seems to be composed of two main types that differ substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type topic. Much of my P type discussion is also related to N type, so I avoided repeating all the details here; please read my P type discussion if you are interested in more speculation about N type.
My Type.xls files have a macro for automatically selecting the best STR markers for definitions. That file NType.xls demonstrates that any number of markers from 3 to 67 from the 67 STR set does a reasonable job of identifying N type samples. I did some manual editing in 2011 for that improved 46 marker definition, and that NType.xls file shows that definition is still very good now (Apr 2016). Of course, now that many SNP tests for branches of N type are available, we are more interested in those SNP branch test results. There are only 5 samples that have not been SNP tested near the N type cutoff at 67 markers (Apr 2016).
The signature for N type is (439,Δ389,537,413a,446) = (11,16,11,21,13). Δ389 = 389-2 minus 389-1, the second STR chain in the pair.
Over the years this N type topic had a very long speculative discussion about the STR structure of N type, and hints at possible subdivision based on STRs. All that is moot now with the flood of new SNPs due to Big Y. The near future should provide yet more SNP subdivision of N type (CTS11962). For updates, watch:
Signature STRs (19,385a,594) = (15,12,11)
I introduced this type in Jan 2011, with only 3 samples tightly isolated in STRs, and with SBP 23%, slightly more than my stated 20% limit for using the word type. Two reasons: First, the Ashkenazi names are independent evidence of a clade. Second, the N-Ashk modal haplotype differs from the N modal at 6 markers, which is evidence of a fairly old node in the N branch of the Y-DNA tree.
Mayka pointed out to me that the names seem Ashkenazi, per his experience.
I introduced this type as Nca type, because of what I had been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was confusing because the samples do not match what I had been calling the Na marker. (Nc and Na were speculative, no longer documented here.)
SBP has increased now to 34% (Apr 2016), because of additional samples having shown up in gap; Nashk is not as well isolated in STRs. So my confidence is now a bit degraded that this represents a true clade. I’ll change the name to Nashk cluster if future data continues to increase SBP.
Only 2 of the 5 samples give Poland as male line origin.
The definition, unchanged since Nov 2011, uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4. I masked out CDY, because of recLOH.
One of the Nashk samples has tested L1029+. More important: the STR neighbors at step greater than the cutoff are all L1029+, so Nashk is assumed to be a branch of L1029, which is a branch of N type. No Nashk samples have been tested for Big Y, or for the known branches of L1029, so we don’t know the branch of L1029.
Ng type is very small, only 3 samples in the Polish Project, since 2010. SBP = 20%, marginally qualifying to be called a type, a rather confident branch of N type. My definition is Ng56, using 56 of the 67 STR Markers, actually 58 because DYS464 e & f are used. Signature markers DYS492 = 14 and DYS537=10. Analysis file: NgCluster.xls.
All 3 samples have Big Y results. All 3 samples belong to the SNP haplogroup YP1136. There is a 4th YP1136+ sample in the R1a Project, kit N25798, but this one does not fit my Ng56 definition, not even close, so on that sparse evidence it seems Ng type is a branch of YP1136. Yp1136 is a branch of YP593, which is a branch of L1029.
All 3 give Poland as male line origin, but this type is too small to use my PCI. They differ from their common 67 STR modal haplotype by steps 4, 7, and 8, so it is possible this Ng type represents a Polish clade with a TMRCA of only a few centuries. They name 3 different ancestors from the 15th to 19th centuries.
The Ns cluster is quite small, only 5 samples in the Polish Project. SBP = 22.7%, just short of the 20% needed to qualify as a type, so I’m reasonably confident it will prove to be a valid branch of N type. My definition is Ns53, using 53 of the 67 STR Markers. Signature marker DYS446 = 12. I loaded my definition into Ysearch, ID A5NSG. There are actually 55 markers because DYD464 e & f are used, but those are not considered as in the standard 67 set, so Ysearch calls it 53, as do I. Analysis file: NsCluster.xls.
I suppose Ns cluster is a branch of the SNP haplogroup YP445, because one of the Ns samples has tested YP445+, but other YP445 samples do not fit the Ns definition, not even close. YP445 is a branch of YP444, which is a branch of L1029. I tried to form an STR cluster for YP445 but could not come up with a credible signature or definition.
Ns is too small to provide evidence of concentration in Poland.
The Np Cluster is a subdivision of N type. N type seems equivalent to CTS11962. So far (Apr 2016), all CTS11962 are coming out either L1029+ or YP515+. There are few YP515 tests in the Polish Project, so far only 10 YP515+, so most of the Np cluster assignments are samples that are L1029- and also CTS11962+ or N type based on STR prediction. There are a few more YP515 results in the R1a Project, where also so far all CTS11962 are coming out either L1029+ or YP515+. In other words, I am not aware of any samples that are CTS11962+ L1029- YP515-.
Using 67 STR prediction, Np (YP515+) samples can be separated from L1029+ with better than 70% confidence in the Polish Project. I constructed an STR definition Np35 using 35 of the 67 markers. The cutoff is 2 (mutation step less than 2 are considered matches). I uploaded this definition to Ysearch, code CHFXB. My analysis file is NpCluster.xls. This definition captures 7 of the 8 Polish Project samples with YP515+ test result and 67 or more STRs. That one exception is due to a single recLOH mutation, mentioned again in the following paragraphs. This definition also captures 9 of the 11 N type L1029- samples at 67. So the net capture efficiency is 16 out of 19, or 84%. No false positive L1029+ samples are captured. Because of selection bias, future prediction accuracy should be a little less than 84%, and there is statistical uncertainty, so I estimate better than 70% confidence in the first sentence of this paragraph. I slightly modified my 2012 definition in this 2016 analysis, based on recent data; I’ll probably slightly modify CHFXB with future data.
I call Np a cluster because it does not qualify for a type. There are too many L1029+ samples at the cutoff value of 2 in my analysis at 67 markers. In other words, Np is not isolated in STR haplospace.
That NpCluster.xls file uses only N type CTS11962 data, 162 samples. Using all 1919 samples from the Polish Project at 67 markers: no samples from other haplogroups are captured by that 35 marker Np definition. At the cutoff step 2, there are only 3 samples from haplogroups outside N type (one each Z92 Credible, Z280 D type, and Z280+).
At Y search, using this definition CHFXB, there are proportionally more samples at step 2 (Step 2 samples / Steps 0 to 2 samples = 61% Polish Project vs 76% Ysearch). This is enough to be statistically significant, if not fully convincing. This can be taken as evidence that YP515 is rarer outside Poland; this can also be taken as evidence that there might be haplogroups common outside Poland with STR overlap with YP515. Either way, the Np definition is not quite as good at Ysearch as it is in the Polish Project.
In the Polish Project 9 of the 10 YP515+ samples give “Poland” as country of ancestry; that last gives Russia. The Np cluster is highly concentrated in Poland, per my PCI Index.
The Yfull tree (14 Apr 2016), has 5 YP515 samples, two of them in YP1182, thereby defining a haplogroup branch. The other three are in YP515*, meaning they are YP1182- and have not provided a common new STR for any pair of these three. That means future Big Y data at Yfull should define at least 4 total branches of YP515 - a bushy node for the MRCA.
My Results Table estimates 2.0% for Np (YP515) Polish frequency.
The R1a Project has 4 YP1182+ samples, but only one of those was available in my analysis here A second of those 4 recently joined the Polish Project after my data download. I tried but failed to develop a credible STR definition for YP1182 by adding the R1a Project YP1182+ samples to a copy of the Polish Project database. I see no obvious signature markers that would distinguish YP1182 from the other Np samples. Comment 18 Apr 2016 Milewski responds that there are now 5 YP1182 in the R1a Project, and a good signature is DYS439>11, DYS481>25 and DYS710>33.
DYS460 = 10 is best. In the Polish Project 9 of the 10 YP515+ samples have this value. Among the samples N type (CTS11962+ or predicted), with no YP515 test, assigned to Np cluster based on an L1029- result, all 14 have this 460=10 value. However, 8 of the 74 N type L1029+ samples also have this 460=10 value, and of course it is common in other haplogroups. In the Polish Project the YP515 test is encouraged for N type samples with 460=10.
CDYa = 33 is another good signature, present in 8 of the 10 YP515+ and in 9 of the 14 L1029-. One of those YP515+ exceptions has the homogeneous pair values CDYa,b = 39,39, along with DYS459 and DYS464 homogeneity, an obvious recLOH mutation, mentioned above.
The third good signature is the value 13 included in the DYS464 set, present in 7 of the 10 YP515+ and in 10 of the 14 L1029-. Again that recLOH is one of the exceptions. This signature cannot be used at Ysearch, where all the DYS464 markers are used together. My Np35 definition excludes the 464 set.
Using all 3 signature markers together, 5 of the 10 YP515+ samples are captured at step 0, 4 of them are captured at step 1, and only that one recLOH sample is missed at step 2, which is the cutoff that we use. For the 14 L1029- samples, 9 are captured at step 0, 1 at step 1, and 4 are missed at step 2. That’s 17 out of 22 = 77% capture efficiency, with the remaining 5 just missed at the cutoff 2. Considering selection bias and statistics, expected predictions might come out slightly below 70%. Among the 71 L1029+ samples, 1 is captured at step 1 and 19 at step 2; that’s the reason for using 2 as the cutoff - high false positives at step 2.
Conclusion: that 3 STR marker signature with cutoff 2 does a fairly good job of predicting YP515 samples among the N type (CTS11962+) samples. However, this signature with step 2 does not work in the full Polish Project because other haplogroups are captured, even at step 0. Within R1a, a few D type samples are captured at steps 0 and 1. Actually, that’s pretty good considering only the 37 STR set is needed.
History of Np: I introduced the Np cluster at this web page 18 Jul 2012, as a hypothetical clade, equivalent to N type with L1029-. “Np cluster” was used as an assignment at the Polish Project since Spring of 2012. Mayka introduced the “p” because it was already obvious that most Np Polish Project samples come from Poland. In Oct 2012, I included Npa, Npb, and other tentative divisions of Np in my Excel analysis files, but these are not used as assignments in the Polish Project. Exception: the Nmsv cluster has been used since 2012 as a division of Np to accommodate STR outliers with recLOH, mentioned above. I suggested “msv” because Lapinski had named this recLOH cluster “Masovian” in the R1a Project. L1029 SNP test results have been available since March 2012, but YP515 results have only been available since Sept 2014; L1029 has been tested much more frequently than YP515, so many samples are assumed YP515 based on N type with an L1029- result.
L260 has 21 known branches listed at Yfull (12 Jun 2016); for update see http://www.yfull.com/tree/R-L260/. The main branches of L260 in the Polish Project are YP414, Y2905, Y4135, and YP1337. I tried (Oct 2015) to construct STR types or clusters for a branch of P type, using SNP data to identify samples, but I did not come up with any significant signatures or definitions. Perhaps as more data accumulates it may be possible, but for now SNP testing seems needed for assignment of samples to branches of P type (L260).
P type is a major topic in my publication, Part II. P type is significantly concentrated in Poland, and in the Czech Republic. It is found at lower frequency in other Eastern European countries, and in eastern Germany. Roughly 10% of Polish males seem to carry P type Y-DNA. L260 was discovered shortly after my publication, found to be equivalent to P type, confirming my prediction that P type corresponds to a haplogroup. The L260 SNP test has been available at FTDNA since April 2010. I published an announcement about L260 in the Fall 2010 issue of JOGG.
My current definition for P type, P50, is a modal haplotype using 50 of the 67 standard STR markers. The cutoff is 9, which means all samples less than step (genetic distance) 9 are predicted P type (predicted L260+). That definition is available in the PType.xls analysis file and at Ysearch as 8U92G. That file and definition are from an analysis done in Feb 2014 using Polish Project data downloaded 20 Jan 2014.
For an update, please see my file PType2015Oct.xls.
In that update, there are predicted 176 P type samples, with only one outlier at step 8 (the highest step of the type); that outlier tested CTS11962+, so it cannot be L260+. In addition, there are seven L260+ outliers that do not fit the type, at steps 9 through 13. So that’s 8 outliers out of 176 predicted = 4.5% rate - not bad for STR based prediction.
As data accumulates, my statistical definitions of types often change by a few STR markers. The P type definition has changed very little in the past few years. I could not significantly improve on that P50 definition in Oct 2015, so I left it unchanged. That file Ptype.xls has columns of various trial definitions for comparison, with indication of which STR markers are included in each. That file Ptype2015Oct.xls has only one column of data for P50, with blank columns where you can try other definitions.
Those files have a sheet “Haplotypes & Masks” with previous P type definitions going back to 2007.
Pawlowski originally noticed what I now call P type. My publication has more about this history. The STR isolation of P type in the Polish Project is now even more impressive than at the time of my publication due to the accumulation of more data.
Ludvik Urban pointed out to me that P type is common in the Czech Y-DNA Database. FTDNA also has a Czech Y-DNA Project. Karen Melis, administrator of the FTDNA Zamagurie Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on the border of Slovakia with Poland. I’m not sure of the concentration in Slovakia.
I added a “Ysearch” sheet to that PType.xls analysis file, with analysis from Ysearch. The Western Slavic Modal haplotype, Ysearch 28WGP, matches P type perfectly at all 50 markers used in my definition. That Western Slavic Modal uses 76 markers, but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.
Age of P type: The Yfull tree for L260 (12 Jun 2016) estimates 4500 ybp as the formation date, and 2500 ybp as the TMRCA. In this case, the formation date is the node where the main M458 tree splits forming the two branches L260 (P type) and CTS11962 (N type). The TMRCA is the node where the L260 haplogroup splits into known branches. Subtraction gives 2000 years for the length (in time) of a smooth branch segment that includes L260 and the several phyloequivalent SNPs, spread out in time over that segment. Estimation of such ages is uncertain due to a number of caveats, and subtraction compounds the uncertainty, but this is a long estimated time with no known branches. This explains why P type is so well isolated; there was plenty of time for both SNP and STR mutations, shared by all P type samples, providing a significant STR signature. For general discussion of this point, see population bottlenecks.
In my 2009 publication I put the TMRCA of P type as 2,000 to 3,000 years ago. That estimate still stands.
Why is P type so common in Poland? One obvious explanation is a rapid population expansion in the region that is now Poland. In my 2009 publication I speculated about such an expansion perhaps less than 1,600 ybp. A rapid population expansion should provide a bushy tree. The current Yfull tree has 9 branches with formation 1750 ybp, plus 2 branches with formation 1550 ybp, plus 4 branches with formation 2000 ybp. These 15 branches together are roughly equivalent to by previous guess - a bit older as a whole. These 15 branches are roughly equivalent to historical estimates of formation of the Slavic tribes that ultimately formed the Polish nation. Four branches of L260 (still consulting the Yfull tree, 12 Jun 2016) have formation > 2,000 ypb; taking these at face value is evidence that the population expansion started slow and accelerated.
However, mutation analysis does not yet provide a definite explanation for why P type is concentrated in Poland. I suppose the simplest explanation is statistical luck. The ancestral male line leading to P type and L260, that smooth branch segment with a span of 2,000 years with no known branch nodes, is evidence that the line came very close to extinction. The MRCA - the very lucky sole male survivor of that male line to leave male ancestors known to us today - may have just happened to have been living roughly 2500 ybp in the region that is now Poland.
In my publication I speculated about a migration, but I see no specific evidence yet. Evidence would be multiple new small SNP branches with formation roughly 2500 ybp, and with nodes close to the formation node for L260, and all from a unique region other than Poland.
P type Signature: DYS385a=10 is the best single STR marker for predicting P type. In that update file PType2015Oct.xls: There are 182 samples (from all of R1a at 67 markers) with this 385a=10 value; 175 of them are P type, including 6 of the 7 outliers missed by that P50 (above) definition. Only 7 samples apparently not P type have this 385a=10 value. Only 6 P type samples plus one of the L260+ outliers have 385a not = 10. That one marker alone does almost as well at predicting L260 as the definition P50.
Two other good markers for P type are available in the 67 set: 481=25 and 572=12.
R1a. See R1a Abstract.
Worldwide, Z280 is more complex, with lots of branches. Tree: http://www.yfull.com/tree/R-Z280/.
Z92y type was suggested to me by Mayka on 21 Dec 2010. Z92u cluster and Z92t cluster were proposed in 2012. These have not been confirmed by unique SNPs as of June 2016. It is now (June 2016) clear they are not major clades, although one or more of these 3 might end up defining a very small Polish haplogroup (less than 1/2% of the Polish Project) in the future, when the appropriate SNPs are found. Z92y originally qualified as a well isolated type at 67 markers, but as more data accumulated it became marginal (not high confidence as a valid clade at 67).
At 111 STR markers, I constructed for Z92y a very good type, SBP 8.9%, with 6 samples well isolated from the rest of Z92 (37 samples from Z92 have 111 markers in the Polish Project, Jan 2016). Two of those 6 have CTS4648+ result. Only one of those two has Big Y, with YP4479+ result. The best signature marker is DYS513 = 11, which alone segregates these 6 from the other Z92 samples, although other samples in R1a have this 513=11 value. The signature (DYS513, DYS452) = (11,31) does segregate those 6 samples in R1a. (This signature also comes up in haplogroups I and J.) This signature is not available at 67 or fewer markers.
See the Results Table for the SNP branches of Z92 in the Polish Project. Since Z92 is not concentrated in Poland, not all branches of Z92 are represented; consult the R1a Project for world wide structure of Z92.
Z92y, Z92u, and Z92t were originally named Ky, Ku, and Kt, because the letter K was used for clusters and types with definitions that were not distinguishable from the R1a modal haplotype. Since then, many samples from these 3 have been tested Z92+. Other K clusters (such as Kx, Kz) have come out Z92- CTS1211+.
Z93. See A type
End of R1a Branches.
On 20 July 2010 I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).
I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters. We judge that my analysis justifies adding them to our list of types. Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.
I’m following the current ISOGG codes for these types, which may be confusing.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be optimal for other regions. If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability that you belong. You should test the appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.
L23EE. 20 Jul 2010 documentation: This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.
There are only 6 samples in the Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a small type. The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%. In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+. So this type is very well isolated in haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia. That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain. I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no obvious correlation. To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.
If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51 test; a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member. We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.
L47P. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.
Mayka announced the cluster corresponding to this type on the web in March 2009.
There are only 4 samples in the Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small type. The cutoff is 7 and the gap is 10. There are no samples from step 7 to 16. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.
This type is very robust; the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project. Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.
The “P” in the code L47P represents my hypothesis that this type is Polish.
Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.
See the last paragraph of L47A, next topic, for more comments.
L47A. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative for L44, a known branch of L47.
Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers; I would appreciate an email of a reference to add here, even if it does not exactly match my definition. It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.
There are only 5 samples in the Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to 18. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust; the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.
This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics; SBP = 4.6%, which is remarkable. It might be even better with an optimized definition; I used the modal haplotype that I extracted from the 4 Polish Project samples.
This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48. In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR values expected for L148. (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.). All this will quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148. Mayka provided the SNP data that I have documented here.
Update 25 Mar 2012. Edit 1 Apr 2016.
At the end of July 2010 I added two types from the I haplogroup to this web document. I independently found these two by analyzing the Polish Project I data. Mayka informed me that they were previously known as clusters, hypothetical clades, discussed some time previously by Nordtvedt. Mayka added these two to the Polish Project web page in July 2010, based on my recommendation, based on my SBP analysis. One is a branch of what has previously been called I2-CE, and seems to represent a Polish collection of M253 branches so we named it M223CE type, discussed in the next topic. The other seems to be a Polish branch of I1-M253, so we named it M253P type, discussed in a topic below. I am now also using the short code names I-CE and I1a-P for these. I am now splitting I-CE into I-C, I-D, and I-E, topics below.
These types are calibrated to Polish Project data. The I1-P definition WC8JD forms a type in the Ysearch database, so it seems to be reasonably valid world wide. The I-C definition SB6YK, and the I-E definition QUXE3, are probably not valid at Ysearch for a sample with origin remote from Historical Poland, because of interference by other clades with similar STR values, particularly from Russia.
Input new topic 19 Aug 2015 by Paul Stone.
I1, defined by the SNP M253, is unique in that it has the signature STR value DYS455= 8 which is present in approximately 99% of all I1 samples. It is also useful that DYS455=8 is nearly non-existent outside of I1. I1 is unusual for the very large number of equivalent SNPs. See http://www.yfull.com/tree/I1/ for the list of about 300 SNPs that are equivalent to M253. Surely I1 must have had many male line branches in the past along that long smooth branch segment, but apparently they went extinct; it is possible a branch (a new node along this smooth segment) will be discovered in the future. Perhaps there was a population bottleneck. This is supported by the calculations done by Yfull, with the result listed in that tree reference: A formation date of I1 at 27,500 ybp but with a TMRCA of only 4,700 due to the large number of equivalent SNPs.
In the Polish Project, I1 is almost all I1a (DF29), which comprises about 6.4% (high confidence range 4% - 9%) of the Polish male population based on the statistically adjusted Polish Project data (see Results Table for update). The website http://www.ydna.eu/ lists the I1 population of Poland at 8.5% based on a sample size greater than 1,000.
I1a-P Type seems equivalent to Y6349. (Old name M253P.)
Rewrite 1 Apr 2016.
Input rewrite 19 Aug 2015 by Paul Stone.
In the Results Table (early 2016), this haplogroup is estimated at 1.4% of the Polish population. Although this is low, my PCI Table (2014) ranks I1a-P type third in Poland, because it seems very highly concentrated in Poland, and my PCI is based on a combination of frequency and concentration.
Yfull lists 17 SNPs as phyloequivalent to Y6349: Y6340 to Y6355 and Y6373. Yfull uses Y6340 as the representative name of this haplogroup. Only Y6349 is available for testing at FTDNA. At Yseq, Y6350 and Y6354 are available. See SNP ordering information. Actually, Big Y is a better test (if cost is not an issue), because more Big Y tests will further divide Y6349.
Sequence for locating Y6349/Y6340 at the Yfull tree (Mar 2015): IJK> I1> DF29> Z2336> Y3866> S4767> S7642> Y6340.
Age rough estimate for Y6349 at Yfull, based on the large number of equivalent SNPs: Formation 3,300 years before present; TMRCA 1,550 years before present.
Marek Skarbek Kozietulski has called this I1-Vistula, because most of the samples are from the basin of the Vistula river.
History: In the past, I have called this M253 P type, I P type, and I1 P type. Now that this type is clearly a subdivision of I1a, the name I1a P type seems best, or short code I1a-P. On 26 July 2011, I added this I-P Polish type for the I haplogroup to this web page. This type had already been known as a cluster for a few years. Mayka pointed out to me that Nordtvedt listed it on the web as I1*-P1, with related clusters I1*-P2 and AS4. Marek Skarbek Kozietulski has studied this cluster quite a bit, since he’s a member. I mentioned this type briefly in my publication, where I was previously calling it Y type, considering it not high confidence based on the data available then in 2009.
My analysis file for the STR type at 67 markers is I-PType.xls. That file was generated in 2012 with only 11 samples. My definition for I1a P type, from 2010, uses 54 markers, cutoff 4, gap 5, no samples in the gap from steps 4 through 8 in the Polish Project at 67 markers. SBP came out 5.0%. Marek informs me that he had identified 4 men who matched at 12 markers and actively recruited them to obtain all 67 markers and to join the Polish Project. That means only 7 of these 11 samples should be used for statistical purposes. (That 1.4% frequency in the Results Table was calculated excluding any known recruited samples.) SBP calculated on the basis of 7 samples is 8.7%. This low SBP along with that large gap of 5 was compelling evidence (in 2012 before Y6349 was discovered) that I1a-P is a clade that is isolated in haplospace. I used all 11 samples in my analysis file in order to best estimate the definition, which is also available at Ysearch as WC8JD.
Recent Polish Project data (early 2016) has 21 samples in Y6349 / I1a P type; 19 have Poland for male line ancestor, one Hungary, one Unknown, demonstrating that this clade is highly concentrated in Poland.
Four recent M254 P (I1a-P type) samples in the Polish Project have Big Y results with Y6349+, and a 5th sample has the SNP test result Y6349+.
A good signature for I1a-P is (391, 392, 447) = (11, 12, 24), although this signature alone is not foolproof for distinguishing I1a-P from all other I haplogroup samples. My definition gives better than 80% confidence of assignment to I1a-P for samples below step 8, and better than 90% confidence for samples below step 4.
Here is some interesting speculation for which I do not have convincing statistical evidence: Marek points out that a sample at step 4 on Ysearch is Danish, which adds to his evidence that there might be a related clade in Denmark, perhaps with a node in the I1 tree slightly older than the node for the I1a-P Polish clade. I do not know where that Danish sample falls in Nordtvedt’s tree. I do not know where that Danish sample falls in the recent flood of new SNPs.
All the I-CE samples in the Polish Project fall into one of the 3 branches discussed in the following topics.
The M223 clade is very well isolated in STR haplospace. FTDNA is able to predict I2b1(M223) with high confidence using only the first 12 standard markers, for more than 90% of the samples. Using 67 markers, I found that any reasonable definition does a good job of extracting M223 samples from Y-DNA STR data. A good definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).
STR isolation in the Polish Project is generally evidence of a single Polish clade. It is possible that two or more clades with distant nodes in the Y-DNA tree might have similar STR values by coincidence. In the case of Polish I-CE, since the larger I-CE world-wide clade is well isolated, my Polish I-CE type might well be a collection of multiple clades, perhaps including some clades that are not particularly concentrated in Poland. My original M253CE type used 4 of the 8 I-CE samples back in 2010. There are now 12 I-CE samples, and they form two types plus one cluster. It may seem silly to split these into 3 branches, but there are new SNPs, discussed below, that justify the split as valid haplogroups. These small types are interesting because they are preliminary evidence of small Polish clades.
CE stands for Continental Europe, including Britain.
The M223-Y-Clan project has lots of data; I used this project data for reference.
A good signature is (392, 437, 450) = (12, 14, 9), which distinguishes almost all M223 samples from others, allowing one mutation step. (594) = (11) is also an excellent signature for M223, with the value 10 dominant outside M223, but this one is strange in the Polish Project, where 4 of the 12 samples have value >11; this is evidence that I-C might comprise two clades.
At Ysearch, the percent Polish samples for I-M223 is low. The following 3 STR definitions, my proposed Polish branches, capture a small fraction of M223 at Ysearch.
My Excel file I-CE.xls has analysis of this type and also analysis of the following three branches. That file has ASD analysis, but ASD age is very misleading when calculated from samples that are a collection from multiple large old clades. The three branches have too few samples to attempt age estimates.
I-C type includes all 4 samples assigned to I-CE last year, plus one that was missed last year, plus 3 new ones, for 8 total at 67 markers in the Polish Project. SBP has improved from 19% to 2.6% over the past year, so this is a clade with high confidence due to the excellent isolation, although there is a chance it may be two or more independent clades as discussed above.
My Excel file I-CE.xls has analysis of this type in column CJ, SBP=2.6%. My definition uses 67 markers, cutoff 20, gap 14. There are no Polish Project samples in the gap from step 20 through 33, so this type is very well isolated. This definition also isolates I-E type, 4 samples, steps 34 to 42, but there is a better definition for I-E, see the next topic.
There are no Polish Project samples at step 43 or 44. There is only one I2b2 sample (not M223) as step 45. Then there are no further samples at steps 46 through 52. So this I-C definition also captures all of the broader I-CE (M223), although surely a better I-CE world wide definition could be constructed.
A good signature is (406, 487) = (10, 12), which itself distinguishes the 8 I-C samples in the Polish Project.
Two of the I-C samples are I-D samples, discussed below. Two other I-C samples have the same family name, very close in STR values. The remaining 4 samples in I-C are not particularly close to each other in STR values. The SNP data for each sample is included in column BX of the “Calculator sheet”; 4 of the samples tested negative for all 4 known haplogroup branches of I-M223. So I-C seems to capture M223* plus P95 (below) in the Polish Project.
My definition is also available at Ysearch, SB6YK. On Ysearch there are plenty of samples from step 20 through 33, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.
My Excel file I-CE.xls has analysis of this type in column CM, SBP=13%. My definition uses 67 markers, cutoff 19, gap 7. There are no Polish Project samples in the gap from step 19 through 25, so this type is very well isolated. Only the I-C samples are all at steps 26 to 44, so this definition also nicely separates I-C from I-E in the Polish Project.
A good signature is (393, 459a, 446) = (15, 9, 10), allowing one mutation step, which distinguishes the four P78 samples in the Polish Project.
Both the P78+ Polish Project samples are in the M223-Y-Clan Project, and there are 13 others, but there are many more P78- in M223-Y-Clan, so this is not a particularly large subdivision of M223.
The other two I-E samples in the Polish Project that have not been tested for SNPs, but both have P78+ close matches on Ysearch, and no close matches from the other 3 branches of M223, so those are likely also P78+.
There are two other known haplogroup branches of M223: M379 has no positives in M233Y-Clan, and plenty of negatives, so it is very rare. M284 has plenty of positives in M223-Y-Clan; that branch is a large subdivision with a couple known branches of its own, but no samples in the Polish Project.
My I-E definition is also available at Ysearch, QUXE3. The Ysearch closest matches are I2b1c, so my definition is good at extracting P78 samples, but I suppose a better definition could be constructed for the world-wide P78 data. On Ysearch there are plenty of samples from step 19 through 25, including some I2b1c beyond step 25, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.
There are only 3 samples P95+ in the M223-Y-Clan Project, and many P95-, so this is a small haplogroup. Those 3 include one but not both of the Polish Project I-D. Two of those 3 have Poland listed as origin, and the third has no origin listed, so this may be a Polish clade, but it is too soon to tell. It is possible that I-C has a larger subdivision Polish branch, of which this I-D may be a branch, but this is just speculation until we get more data.
(640) = (13) seems to be a signature for I-D, but one STR marker should not be very reliable for prediction.
I did not enter a definition into Ysearch. The two I-D samples are highlighted bold blue in column CI of I-CE.xls. Only one sample is P95+ in the Polish Project - the one that is also in the M223-Y-Clan Project, so I used that sample as the definition. There is a sample at step 10, and none others out to step 22, so I tentatively assigned that step 10 sample to I-D, forming a cluster of two samples, SBP=25%, well isolated from others but not a type.
Input new topic 19 Aug 2015 by Paul Stone.
The haplogroup defined by the SNP L22 is a major branch of I1a (DF29) and indeed a major branch of I1 (M253). L22 is concentrated in Scandinavian Countries, but is also found elsewhere in Europe. In Poland L22 is about 1.5 %.
The paragroup CTS6364+ L22- has the STR signature (390,385a,385b) = (20,14,14) although the signature alone is not a foolproof prediction. This paragroup has recently been divided into a large number of new haplogroups, due to the flood of new SNPs. A good place to view all these new divisions of I1a1 is http://www.yfull.com/tree/I-Z2336/.
I1-Z63. New topic 23 Sep 2015 by Paul Stone.
The Z63 SNP is estimated to have formed around 4,700 ybp. In terms of raw numbers, the Z63 population is small compared to the large CTS6463 and Z58 subgroups and is similar in size to P109 or Z73. The geographic range of Z63 is vast and stretches from Iceland to Spain and into the Balkans and central Russia. The subgroup is primarily Continental in nature with Scandinavian Z63 samples being relatively few in number. Z63 has the highest density in Central Europe based on empirical data. Inside of and throughout Poland, several different Z63 lineages are represented. SNPs downstream of Z63 found in Poland include BY351, FGC14480, L1237, PR683, S2078, S10360 and S15301 with others yet to be determined.
N-G. (N-L551). (N-G Type). Update 22 Mar 2012. Introduced on 17 Oct 2010 as “N1c1(M178)-G type”. The latest ISOGG code is N1c1d1a (L551).
Mayka suggested this one, based on a suggestion by Andrzej Bajor, from his Rurikid Dynasty Project. This type is concentrated in Lithuania, and Andrzej suggests that at least one member might be a male line descendant of Gediminas, the medieval Lithuanian Duke. Hence the “G” code.
This type has 9 samples at 67 markers very well isolated in the Polish Project with SBP = 8.9%. See N-GType.xls. The definition is also available at Haplotypes.xls and at Ysearch as RGE95, using 51 markers, cutoff 3 (samples < step 3). All but one of the N-G samples can be extracted from the Polish Project using only the signature (392, 607, 557) = (15, 14, 13).
This type should not be confused with another G type in the R1a haplogroup.
That new L551 SNP verifies our prior prediction that G type corresponds to a clade. All 9 of the predicted G type samples at 67 markers have tested L551+, and samples predicted just beyond G type are coming out L551-. Of course, there will probably be a few exceptions as more data accumulates, but so far N-G type (STR match) is equivalent to L551 in the Polish Project.
At Ysearch, N-G type is not as well isolated; the SBP is 22% with cutoff 4, due to interference by what might be a Russian clade. There are many Lithuanian samples matching my N-G definition (RGE95), including Lithuanian samples beyond the cutoff (step 3). 46% of the Ysearch samples below step 9 indicate Lithuanian origin. L551 is too new to be included in Ysearch, so this paragraph refers to N-G type as defined by STRs.
I do not know if the Polish Project N-G samples are an independent Polish sub-clade of a larger Lithuanian clade; or if the Polish Project samples are just a random sample of individuals from a larger clade(s). I have not taken the time to search other projects for STR matches to my N-G definition, or to search for more L551+ samples. Someone might inform me before I get a chance to search. Watch this topic for updates.
The age of N-G type seems to be less than 1,000 years, perhaps only 500 years. Check the “ASD” sheet in my analysis file. ASD age is highly uncertain, particularly for such a small sample, but G type has little STR variance, so surely G represents a clade younger than 2,000 years old. Isolation is evidence of an old , with TMRCA much younger than the node. The age of the L551 mutation can be anywhere in the time span older than the TMRCA of G type and younger than the node. N-G type is well isolated in Lithuania and Poland, but N-G may have a relatively young node with those other clades world-wide with similar STR values. Those other clades can be used to better constrain the age of the L551 mutation.
N-M. (N-L591). (N-M Cluster). Update 22 Mar 2012. Mayka suggested this one also, introducing it at the Polish Project in Jan 2011, as “N1c1(M178)-M Cluster”. The latest ISOGG code is N1c1d1b (L591). Includes Mickevius (Mickewicz) descendants. Hence the “M” code. Also concentrated in Lithuania. These two, N-G and N-M, are a small fraction of the M178 clade.
I call this a cluster because it does not meet my criterion SBP<20% to be called a type. Actually, the original proposed cluster is equivalent to what I am now calling Ma cluster, discussed below. The recent new SNP named L591 is coming out with about twice as many samples, so we have adopted the “M” short code name for the STR data for L591; this larger N-M cluster is so considered equivalent to N-L591.
My analysis is available, N-MCluster.xls, 10 samples at 67 markers. My best automatic for N-M, column CL, SBP=25%, is 80% accurate, missing one sample that is obviously L591 and predicting one sample that came out L591-, out of 10 predicted. Actually, this result is a nice confirmation of my SBP method, because although the data has only 10% background (false positives captured by the definition), my SBP formula has an increase to account for statistical confidence; hence 25% is a better upper confidence estimate of the background for so little data. I bet as more data accumulates my best N-M definition will drift below SBP=20%, qualifying as a type. Anyway, this is moot, because L591 is a better criterion for the clade, and there is a logical distinction between the N-M cluster (samples with STR correlation) and the L591 haplogroup. My definition serves as a guide for priority for L591 testing. Testing should be concentrated near the cutoff.
Accordingly, I came up with an improved STR definition for L591, using a mask to manually adjust marker selection. I’ll still call it by the short code N-M. Column CC in that file. SBP=50%, but SBP does not matter here, because the purpose of the definition is not to discover a hypothetical clade, but to predict samples for a known clade. Most clades do not produce low SBP because most clades are not well isolated. Let me elaborate with discussion of the statistical issues for N-M:
Obvious issue: There are three N-M samples with a very rare 6 step mutation at DYS446, from 17 to 11. Without DYS446, two of these three marginally fit the N-M cluster (based on STRs). These three seem to represent a subclade of L591 with modal STRs slightly drifted since their node. I marked them as “Mb” in that Excel file. Only one of these has actually tested L591+. Another one of these is that “obviously L591” sample that I mention above, the “obviously” based on this 6 step mutation, which is almost as good a marker as an SNP. That “obviously” sample is an STR outlier at other markers, which need to be excluded from the L591 definition, assuming more samples like this will show up. This seems obvious, but it needs verification with more data over the near future.
Speculative issue: There are two other outliers, which I labeled Mc and Md. Tested L591+. These may represent two clades with nodes only slightly younger than the TMRCA for L591, with independent modal drift. Highly uncertain. They might just be statistical outliers, due to the luck of random mutation. Again, more data will tell. For now, I adjusted the N-M definition to capture them, on the assumption that some future samples might come up with similar STR values.
Another issue: That one sample, mentioned above, fitting the M cluster very well but L591-, probably represents a clade with a node slightly older than L591, but similar STRs by coincidence; there may be other such clades. Again, this is speculative, but I adjusted my definition to exclude this one.
Statistical speculation summary: L591 does not seem very well isolated in haplospace, albeit more isolated than most young Y-DNA clades. It seems the L591 tree has nodes close to the SNP age, both younger and older.
This L591 clade seems to be concentrated in Lithuania. The evidence is Ysearch - Lithuanian concentration of the N-M cluster. L591 test data is not available yet at Ysearch. My Ysearch analysis (data in a sheet in that Excel file) is similar to the G type analysis: SBP not as good because of apparent interference from clades world-wide. Using the N-M definition at Ysearch, there is Lithuanian concentration at steps well beyond the cluster cutoff, so there seems to be a larger Lithuanian clade.
In the Polish Project, I spotted evidence of such a larger STR type, about double the size of N-M, including the all the N-M samples as a sub-clade. I colored these samples green in column BX of N-MCluster.xls, using all 67 markers. I dubbed this one N-L type. That 67 marker evidence is not satisfactory because it captures a couple N-G samples. In another file, not posted on-line, I came up with a satisfactory definition for N-L; I provide it in the “Haplotypes & Masks” sheet, row 21, of N-MCluster.xls. Mayka advises me that there are two new SNPs, L1025 and L1027, that are currently candidates for a haplogroup larger than L591. We are waiting to see how those come out before introducing N-L. That N-L definition cutoff provides a suggestion of where to prioritize SNP testing.
The age of N-M (L591) comes out similar to the age of N-G type, probably less than 1,000 years; see that short paragraph in the N-G topic above. My comments about isolation of N-G in the Polish Project do not apply to L591. For N-M, it is important to exclude DYS446, because that one marker triples the age as calculated using ASD (STR variance), due to that 6-step deletion mutation mentioned above. You can see this by editing cell BV21 in my mask in my “ASD” sheet in that file. Another way to edit this is to edit the 446 value, to make the mutation count one or two, which is more representative of the age. This is a good example of one of the caveats associated with age calculation based on STR variance.
N-Ma. New topic 20 Mar 2012. This is the original “N1c1(M178)-M Cluster” cluster, explained in the previous topic. Only 3 samples when introduced Jan 2011, SBP=36%. Now there are 5 Ma samples, SBP=30%. Although still not qualified as a type, there is better than a 30% chance this will improve over the next couple years as data accumulates. Lithuanian concentration, same as N-G and N-M. Again, I do not expect validity world-wide for N-Ma because of interference from other clades world-wide, but this might grow into a nice small, young Lithuanian clade. Analysis is in N-MCluster.xls, where the 61 marker definition for Ma is in column CG.
19 Sep 2016: Update Frequency column per Results Table; update SNP column; PCI column needs update
New Topic 22 Feb 2014. Update 20 Apr 2016.
[2016 comment: This PCI discussion was written in 2014.]
I introduced a preliminary version of this index in my 2009 Publication, page 161, with results in the far right column of Table 1, page 162. In 2009 I defined this preliminary index as the percent of samples, of a given haplotype, that have the word “Poland” included in the “Origin” field for male line ancestry, for Y-DNA data at Ysearch.
In this 2014 version, I extend the index to types and clusters. In this topic, for brevity, let me use the general term “cluster” to mean a cluster, or a type, or a haplotype, or a haplogroup, or a paragroup, or a clade, or any other word for a category of Y-DNA STR data, although I make a technical distinction at the bottom of this topic.
In this 2014 version, the PCI is statistically adjusted for data sample size, as explained a few paragraphs below.
There are a number of reasons why % “Poland” from Ysearch is not an accurate measure of concentration in Poland. I mention some of these reasons in my 2009 Publication. I plan to add a longer discussion to this web page, with more detail about objections to statistics from Ysearch.
However, the objections are not a very serious problem if we are interested in relative concentration in Poland. Comparing clusters, we expect the clusters with higher % Poland at Ysearch to likely be more concentrated in Poland than those with lower % Poland. This is one reason I do not use a % sign for PCI. The other statistical reasons are explained below in this topic.
Caveat: You may use my PCI for another country, for example England, with a new “ECI” related to % “England” at Ysearch. It would not be valid to compare the PCI numbers to the ECI numbers, because we do not expect equal joining probabilities. Men with English male line ancestors do not necessarily join Ysearch in proportion to men with Polish male line ancestors. However, we do not expect significantly different joining probabilities for men with Polish male line ancestors in different clusters. There are exceptions, which I leave for future expansion of this discussion here (for example Polish Ashkenazi clusters).
The problem of false positives: We expect “false positives” - clusters with high % Poland just due to statistical probabilities (the luck of how many Polish and non-Polish men that would fit that particular cluster just happened to join Ysearch). The more clusters we study the more false positives we expect to find. The more clusters we study the more likely we will find a false positive that seems very highly concentrated in Poland.
For small samples of data, the statistical uncertainty is larger, so we expect more false positives. Suppose we check a large number of clusters for % Poland at Ysearch (or at any database), and suppose many of those clusters have fewer than 10 samples at Ysearch, and suppose some of those clusters have fewer than 5 “Poland” samples. We will surely find false positives. I discuss this sample size uncertainty in my 2009 Publication, where I used the lower bound of confidence range as a method to compensate for this statistical problem, particularly in small clusters.
My PCI uses the lower bound of the 95% confidence range. For example, a PCI = 25 means 95% confidence that the “true population % Poland” is 25% or greater. By “true population % Poland” I mean the % value that would show up in a much larger database drawn from the same population in the same way (in the distant future at Ysearch, for example, if Ysearch is still popular in the distant future).
If you are knowledgeable about statistical methods, you may quickly understand the details of my method from my file Ysearch.xls; check the “Summary” sheet and the “Instructions” sheet. There is a technical statistical explanation of PCI near the center of the “Instructions” sheet. Check the other sheets for specific cluster results. My automatic procedures use macros; if you are concerned about macros your browser should allow you to open my file in “View Only” mode.
Even if you are not knowledgeable, you might try following the “Instructions” sheet to evaluate your own clusters.
The “Summary” sheet in the Ysearch.xls file has 5 example rows labeled “50% Tests”: one row shows that a cluster with 100 “Poland” samples out of 200 Total has 50% Poland, and has a lower 95% confidence limit of 44.8%, so the PCI is 45. However another row shows that a cluster with 5 “Poland” samples out of 10 Total also has 50% Poland, but has a lower 95% confidence limit of 25.3%, so the PCI is only 25. In other words, smaller clusters get more downgrading to compensate for small sample statistics, but small clusters are allowed.
This is the main reason I leave out the % - to avoid confusion - probability vs confidence - PCI is a mix of both. PCI allows small clusters to be compared to large clusters, where the smaller clusters get adjusted to a lower index number.
There is another issue: selection bias: In defining clusters, we chose the number of markers, and the cutoff, to best define the cluster. The cutoff should be adjusted to best capture Poland samples, as demonstrated in the example sheets in Ysearch.xls. Selection bias effectively reduces confidence, because we may be selecting parameters based on statistical flukes.
Those “objections to Ysearch”, mentioned above but not listed, also reduce confidence. Many of the objections cause variation of results, with more variation expected for smaller clusters
So my net confidence is not 95%, but less. This is my style here: calculate PCI using 95% sample size confidence because sample size confidence can be easily calculated in an Excel sheet. All those objections, including selection bias, reduce confidence, with larger effect expected for smaller clusters, but smaller clusters get more reduction in PCI. I suppose my net confidence is about 80%, although this estimate is based on subjective experience - difficult to document with words. I expect about 80% of my PCI predictions to slowly increase over the years as more data accumulates at Ysearch, with smaller clusters increasing faster in PCI. I expect roughly about 20% of my PCI predictions to drop over the years, meaning roughly 20% overestimation of Poland concentration confidence. On the other hand, this PCI method automatically ignores small clusters with lesser concentration in Poland, many of which will significantly increase in PCI as data accumulates.
My Excel sheet automatically removes “Modal” Ysearch data, and excess “family set” data, as explained in the “Instructions” sheet, where the user is invited to manually edit the results, because human judgment is an improvement over automatic editing.
The Polish Project is representative of the historical Commonwealth of Poland, a geographic area much larger than modern Poland. PCI is weighted toward men with ancestry self-described as from Poland. One of these days I might add here a discussion topic about this complex topic. Check the web if you wish discussion now.
My PCI index can be used for haplogroups, but technically I do not do this, because the haplogroup assignments are not up to date at Ysearch. I like Ysearch because of the huge amount of STR data at 67 markers. Many of my STR based types are “almost equivalent” to SNP based haplogroups. For example, P type is almost equivalent to L260. “Almost equivalent” means a few haplogroup samples are STR outliers and a few samples from other haplogroups marginally fit the type just below the cutoff. In my tables this technical distinction between P type vs L260, and other equivalents, may not be obvious to you.
Update rewrite 21 Sep 2015. Edit 5 Jul 2016.
Click on the Search for Genetic Matches tab to search for Ysearch members closest to you in STR values.
Ysearch has a Research Tools tab to click, where you can type in other User ID’s for detailed comparison to your data.
Ysearch does not keep up very well with new SNPs, so this is not the best place to find out your location in the Y-DNA tree. It is better to join an FTDNA project, where the administrators help to figure out your terminal branch in the tree. However, Ysearch has a lot more data, so you might find closer matches here.
My Definitions. I have entered a number of definitions into Ysearch, for the types of interest to me. These are modal haplotypes; they do not correspond to any real person. These definitions use only some of the STR markers, so it may be misleading if you seem to match one of them closely. Issues: These are based on selected markers from the 67 marker set, so you need to have the full 67 marker data. If you are a perfect match (Ysearch reports Genetic Distance = zero) then it is highly likely you belong to that type. If your match is Genetic Distance 1 or 2, it is less likely but still a good bet you belong to that type. If your match is 3 to 10, it depends on the type; some types are more restrictive than others; you need to read the documentation about that type, so search for it in this web page.
Examples: P type is 8U92G at Ysearch; I type is EKVHX.
29 March 2010 correspondence: I mentioned Russian sites for R1a clusters in my publication. It’s not easy for me to figure out which of those clusters correspond to my types. Mayka worked out a correspondence on 29 March, warning me that the correspondence is not exact. Some of the Russian clusters are broader than my types; some are narrower. Here are Mayka’s findings:
My Type code vs Russian cluster name:
A Ashkenazi Jewish
B Western Eurasian
C Old European
D Baltic - Carpathian
E Northern Eurasian
F Central Eurasian
G Northern European
H Western Carpathian
I Northern Carpathian
N Central European
P Western Slavic
19 Sep 2010 update: A nice tree display of the Russian subdivision of R1a is at http://www.r1a.org/. Robert Sliwinski brought this site to my attention.
My opinion: R1a cannot be highly subdivided with confidence based on STR data. This web site of mine is dedicated to estimating the confidence of each type that I study. I try to indicate which types are speculative. Even for the types with high confidence, the location of the nodes in the R1a tree will be uncertain until corresponding SNPs are discovered. These Russian clusters, apparently by Klyosov, have plus / minus values for accuracy of TMRCA ages that are far to small, because there are serious caveats associated with systematic statistical uncertainties.
Edit 10 Jun 2016.
Update: I published my “Mountain Method” for STR analysis in 2009, back when STR analysis was important because SNP testing was expensive and new SNPs were rare. Now that SNPs are relatively inexpensive, STR analysis is not as important. I still find STR analysis useful because there are still plenty of data on the web for samples with STR data, but with insufficient SNP data.
Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see my publication. By haplospace I mean multidimensional sets of STR values; each haplotype is a point in haplospace.
This topic is about STR analysis, but restricted to Y-DNA genetic genealogy. For a more general introduction consider wiki STR Analysis.
Men submit their Y-DNA data to various web sites. There are lots of STR data available on the web. A cluster is a set of samples with similar STR values. Men are divided into STR clusters as hypothetical subdivisions of the haplogroups, based on similarities of STR values. All such clusters are hypothetical. Some will be validated by new SNP discoveries. There are various statistical methods for estimating the confidence of STR clusters. I published a method that I developed. That publication has references to other methods.
A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain. The step is the genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate step. The cutoff is the next step just beyond the mountain. A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1). A good type forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the upstream father haplogroup usually) at higher step numbers.
In other words, I use the word “type” for a cluster with high quality, where quality is estimated on the basis of STR isolation. Of course the ultimate measure of quality is when a cluster or type is eventually shown to be equivalent to a newly discovered SNP. Before such an SNP is discovered, a well isolated type provides confidence that it corresponds to a future SNP haplogroup. After such an SNP is discovered, an equivalent type or cluster is used for STR prediction of haplogroup, for samples with STR data but insufficient SNP data. Eventually, we expect most well isolated types to have multiple phyloequivalent SNPs discovered; isolation should lead to both a unique set STR values and a unique set of SNPs. See also the discussion about smooth branch segments.
The Statistical Background Percent (SBP) is an objective measure of the quality of the type. Low SBP is taken as evidence that a type corresponds to a clade that is equivalent to a haplogroup defined by an SNP (perhaps not yet discovered). Larger types with lower gaps have lower SBP. SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade because they are outliers from other clades. An outlier is a sample that has very unusual STR values due to the luck of mutations. SBP is also intended to account for the estimated percent of samples from small foreign clades that just happen to have the same STR values but are not closely related to the type. (Actually, an individual STR sample represents a clade insofar as fathers, sons, brothers, and cousins should have almost the same STRs.) SBP is approximately the probability that a sample with STRs matching the type does not belong to the corresponding clade, but SBP is adjusted for the confidence interval. Small sample counts have wide confidence intervals. So larger types (more samples) automatically get lower SBP. For a valid clade, SBP should decrease with time as data accumulates in a database. A very well isolated clade will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely to be a clade. SBP < 20% is good enough to be announced as a type on the web. SBP > 20% is a cluster worth watching as data accumulates with time. I avoid using the word type for SBP > 20%. SBP > 50% is not statistically meaningful although such clusters might improve as data accumulates. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains. The number of markers in the definition should be chosen to provide as small an SBP as possible; my Excel tools provide automatic rank of markers as an aide; human judgment can be used to include or exclude markers with obvious problems. A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.
I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff. The definition of a type is the modal haplotype plus cutoff. The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible if there is a tie. The definition of a valid type may change slightly as data accumulates.
I use the word clade in general, meaning a Y-DNA clade that may or may not be a defined official haplogroup. All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP. The modal value for a marker is the most common value in the cluster. The modal haplotype is the set of most common values, usually the most common haplotype in a cluster. Many people use the adjective “modal” as a noun, meaning “modal haplotype”; so do I; I tried to avoid that in this web document.
Most of my types have been validated in the past few years by newly discovered SNPs that seem equivalent. I say “seem equivalent” instead of “equal” because of two distinctions between types and haplogroups:
Outliers: First, types are defined by STR correlations, and outliers are expected because of the random luck of STR mutations. Even if a type is very well isolated, with all type samples coming out positive for a newly discovered SNP, and with no samples negative for that SNP fitting the type, eventually outliers are bound to show up as new samples provide more STR and SNP data. We expect to eventually discover a few outliers that do not fit the STR type but are positive for that SNP, and we expect to eventually discover outliers that are negative for that SNP but fit the STR type.
Better Equivalents: Second, as new branches are discovered, it is possible a better equivalent SNP might be discovered, slightly younger than the original equivalent, leaving out one or more of the oldest branches of that original equivalent haplogroup, where those oldest branches might have many outliers that do not fit the type. Conversely, a better older equivalent SNP might be discovered, including a few branches with nodes older than that original node, where those older branches might have most of the original outliers that fit the type.
The rest of this topic provides discussions and more definitions that not part of my Mountain Method. These are discussions and terms that I use often, so I provide them here for easy link reference from my web pages. Some of these terms are not common in genetic genealogy. Some of these I do not recall seeing used in genetic genealogy documents at all, so they might be my inventions, although I suppose other writers may have used these terms with similar meaning. Some of these are common but I do not use them often.
A bimodal marker has a second STR value with many samples - more than expected statistically - in addition to the most common modal value. A multimodal marker is possible if there are more than two common values for the marker and if those common values are not distributed more or less symmetrically on both sides of the most common value. (A Bessel distribution is statistically expected for a low fraction of random independent mutations at an STR marker. A Bessel distribution is close to a Gaussian distribution for a high fraction of independent mutations. A Bessel for a low fraction looks like a tent; a Gaussian looks like a bell.) Step up mutations are more common than step down for short STRs, so for example a modal 8 plus a few more 9 values than 7’s does not necessarily mean the 9’s are statistically significant; experience helps to judge. RecLOH and other issues at compound markers also cause confusion in this regard. A bimodal marker is a hint that there may be a clade associated with that 2nd value, so genetic genealogists study clusters defined by one or a few STRs with such bimodal 2nd values. The main modal value also sometimes makes a good signature at a bimodal marker. In other words, a set of values using one or more bimodal or multimodal markers makes a good signature for a hypothetical cluster.
In the past, I have sometimes called such clusters hypothetical types. I now prefer to reserve the word type for < 20% SBP. Sometimes I make exceptions above 20%, for example when a cluster is regionally concentrated, or associated with an ethnic group.
I had sometimes used “bimodal marker” for that second STR value, but I try to avoid that confusion. It’s the STR marker that is bimodal, with two common values.
There is no known way to calculate the % confidence that a cluster corresponds to a clade, but an experienced genetic genealogist can roughly estimate confidence based on experience. I developed SBP so that 100% minus SBP expresses my confidence, but only for clusters with less than 30% SBP; SBP breaks down around 50%. I avoid publishing clusters in which I estimate less than 50% confidence, although I may mention some as speculative, particularly if they have been announced by others.
Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous. For insights into why types form, please see my discussion of extinction and population bottlenecks.
A main branch of the Y-DNA tree is old, with data on the web for thousands of samples belonging, and with many known further branching divisions. I like to use the word twig for a small young branch of the Y-DNA tree. A terminal branch is a smallest known division of the tree; a terminal branch might be a terminal haplogroup, or a subdivision of a terminal haplogroup - a type or a hypothetical cluster. A terminal branch at one web site might not exist at another web site; a terminal branch might be very small (one or only a few samples) or very large (many samples).
(Sometimes more than two branches split off at one node, but we expect future SNPs might usually resolve that node into multiple nodes, each branching into two branches.)
Some genetic genealogist use TMRC as the age of the corresponding haplogroup (or type, or cluster, or branch or clade). I often do; it’s usually good enough. But there is a technicality that causes confusion:
An SNP is probably older than the TMRCA of the haplogroup it defines, because there are usually many generations between old nodes, due to the statistical pruning of the Y-DNA tree (discussed above in the definition of segments of tree branches and in the paragraph about extinction). The probability is very low that an old SNP mutation happened in exactly the same generation as the TMRCA. (An exception would be a recent private SNP found in an extended male line family.)
Conceptually, we might prefer to consider the age of the SNP as the age of the corresponding haplogroup. But there are usually multiple phyloequivalent SNPs for a haplogroup, and of course, they differ in age. Methods to estimate TMRCA age do not provide distinct ages for all those SNPs.
A third stipulation of the age of a haplogroup might be the age of the previous known node. Then a haplogroup would include the male descendants of a MRCA plus his male ancestors in the immediately older known segment, but this is opposed to the traditional idea of a Y haplogroup being the male clade of descendants of the man who experienced the mutation for the corresponding SNP.
The Yfull Tree solves this confusion by using two ages, the TMCA and the older “formed” age, which is the TMRCA of the previous known node. Yfull estimates age by analysis of the number of SNPs per segment. Before about 2013 or so, most Y clade age estimates were based on STR distributions, and assumed STR mutation rates. Any method of age estimation has serious caveats. Most of my xls on-line STR analysis files have a sheet that estimates age from STRs in various ways, but I’m not including that sheet in current analyses because the Yfull site does an adequate job of estimating age.
I call the segments between nodes smooth branch segments, where there are no known nodes in that segment of the Y-DNA tree. A long smooth segment in the Y-DNA tree is one way to visualize isolation in haplospace. A type, because it is isolated, probably has a long smooth segment older than the MRCA of the type, with more than the usual number of phyloequivalent SNPs. A smooth segment is necessarily a statistical estimate, because the number of SNPs is influenced by the luck of statistics.
Edited 23 Jan 2015:
Sometimes one individual recruits male line relatives to submit data to a Y-DNA database, for example to the Polish Project. I call these family sets. I count these together as one sample when compiling statistics on frequency. By statistics on frequency I mean the number of samples per clade. By clade I mean a haplogroup or type or cluster. I do this adjustment for family sets because otherwise a small clade might get reported as too large.
My Results Table page is an example of statistics on frequency where I adjust for such recruitment bias.
I do not discourage such recruitment; it is a great research technique. I recruited my third cousin. I don’t mind the effort of adjusting for such recruitment bias.
My adjustment method: I sort databases by name, and automatically flag name repetitions. Then I examine the flagged data to see if the STR data is a very close match, which is a sign of recruitment. Often I make contact by email when it is not obvious if the samples have been recruited. Actually, even with email discussion, the actual correction may not be obvious; for example it may be difficult to say if a particular recruited distant relative may have later joined the project anyway independently, in which case he should be counted. So I may estimate 2 or more effective “independent” samples for some family sets.
I also sort data by close STR matches and look for evidence of recruitment. Recruitment can also be by close STR matches even with different family names. For example, I recruited a man with a last name different than mine, where his daughter noticed that his 12 marker data (at ancestry.com) matched my 12 marker data (at familytreedna.com). I paid for his 111 marker data which I submitted to multiple projects. We match STRs very closely at 111. I determined that his male line ancestors lived in Poland only 10 miles away from the village where my ancestors lived. We are obviously distant male line relatives. I don’t count him in frequency data because I recruited him.
I also sort by email address, again looking for samples with very close STR matches. Most samples that have the same email address are not in the same main haplogroup branch, even when the family name matches, which means most recruitment by family name brings in samples from different male lines. I do not adjust for these, because I assume the recruited samples fall randomly into haplogroups according to frequency in Poland, so such recruited data is OK. It is difficult to judge what to do when a pair of recruited samples are in different terminal branches that branch from a common larger branch.
Sequential kit numbers, or nearly sequential, are additional evidence of recruitment.
I’m not trying to make perfect adjustments. I’m mainly trying to catch all the large family sets. I don’t bother people with emails about sample pairs that may or may not be due to recruitment; I make my own judgment. If I miss a few pairs, or if I discount a few pairs that are really independent, that just adds a little noise to the frequency data.
New Topic 29 Mar 2015:
The Yfull Tree includes ages of SNP nodes calculated from the number of SNPs in segments between nodes, using an average SNP mutation rate. These serve as reasonable estimates for the ages of the corresponding haplogroups, although there are caveats, next topic. Yfull calibrates SNP mutation rate to a very old haplogroup, of assumed age. Yfull does not document the details of their method; if they mix SNPs found by different methods (as most people do) that would introduce an inconsistency. These Yfull ages are consistent insofar as they are calculated from SNPs as found by a single consistent method, although there still are caveats, next topic. Other methods may come up with different numbers of SNPs, and different rates, so the Yfull ages may not be the same as those calculated by others.
Rewrite 29 Mar 2015. Edit 10 Jun 2016.
There are several biases involved, and I mention some of them in this topic about age caveats. I don’t emphasize ages in my web pages because of the uncertainties, but I occasionally discuss rough age calculations.
With STRs, people generally use ASD (to account for back mutations) calculated for each marker, then average the markers. I provide “ASD” sheets in my type files with a simple ASD method for calculating age, but again, I consider this a rough approximation. I provide an introduction to STR age calculations in my Fall 2009 Publication.
Some publications use a mutation rate from father-son data. This method yields too high a rate insofar as somatic mutations (mutations in the cell lines leading to the test - cheek cells, for example) are included, so the calculated ages are too young.
Chandler published a method for accurately determining relative STR mutation rates, and calibrated the first 37 standard STR markers, to father-son data. Extension of Chandler’s 37 to more markers are available on-line, but without explanation.
With a fixed mutation rate, older haplogroup nodes tend to come out too young. This is due to the structure introduced by the “pruning” of haplogroup branches that go extinct. To compensate, some people use adjustment factors for older haplogroup nodes; some people use calculation algorithms that are mathematically equivalent to a mutation rate that decreases with age.
Almost all DNA damage is repaired by various cellular mechanisms, so the “mutation rate” measures only the damage that is not repaired. Repair mechanisms vary from person to person due to variations in minor damage to the repair mechanisms. An ancestral line mutation rate depends on the probability of a few ancestors with much higher than normal mutation rate, so mutation rate varies more between ancestral lines than due to simple random number statistics.
It is important to calculate the +/- confidence range for data based on very few samples, using standard Poisson statistics for a small number of samples. Many reports of age calculations include such confidence range. For large samples, however, this is misleading. The confidence range comes out small for large samples, with a small +/- on the age. The various age caveats, however, provide a much larger uncertainty due to bias. In other words, with a large sample, we have excellent confidence that another large sample, taken from the same population, will provide the same age result within a small confidence range. However, insofar as any age calculation is more uncertain due to the biases introduced by the various caveats, those small confidence ranges are meaningless.
I mention only some caveats here as examples. There are more. This web page is not the place for a review of all age caveats. I have never seen an article with a thorough review of the caveats associated with age calculations of ancestral lines based on mutation rates. I suppose genetic genealogists are rarely statistics experts, and I suppose statistics experts avoid publishing such articles, because the emphasis would be that most genetic genealogy age calculations are not quite right.
Summary: There is no known way to figure for all caveats with calculating age of Y-DNA nodes or haplogroups. Nevertheless, rough age calculations can be interesting, as long as we realize the results are not very accurate.
Rewrite 10 Jun 2016.
A “population bottleneck” means a significant reduction in population followed by a significant increase in population.
Population bottlenecks generally reduce genetic variation. There are other reasons, covered briefly toward the end of this topic.
I intend this topic to be a discussion of reasons for isolated STR types, which are one instance of reduced genetic variation. As explained in other topics (follow the links), isolated STR types are usually the same as smooth branch segments of the human Y-DNA tree, which are usually the same as haplogroups that have a large number of phyloequivalent SNPs. In this topic I intend the word “type” to be shorthand for reduced genetic variation that results in a type, or a smooth segment, or a large number of phyloequivalent SNPs.
Most male lines go extinct. If you are not familiar with this statistical fact, please see the discussion about extinction.
Because male lines tend to go extinct, the Y-DNA tree tends to prune itself with time, so technically a bottleneck is not required for reduction of genetic variation. However, humans have surprisingly low genetic variation; many geneticists who have worked out the statistics have concluded that the human population must have suffered an extreme bottleneck, because a surprisingly low human population seems required to understand the low genetic variation.
I have never seen a statistical analysis of the distinction between (a) one extreme bottleneck, vs (b) multiple bottlenecks not so extreme, vs (c) a prolonged period of moderately low population (not low enough to be called a bottleneck). I suppose a detailed study types can provide a distinction. Consider the nodes corresponding to the TMRCAs of the oldest types. If those TMRCAs turn out to be clustered at mostly one particular age, that would be evidence of a single extreme bottleneck early in human history. It does not seem to be coming out that way, so I have my doubts about the common assumption of one bottleneck.
If those TMRCAs tend to cluster at a few age values, that would be evidence of a few bottlenecks at different times - perhaps with different continental localization.
Most nodes have two downstream branches, but some nodes are bushy, with several branches. A bushy node is evidence of a significant rapid population expansion at that node (any node - not just the node of a type).
Actually, it’s a bit more complex: We should consider a node bushy if there are several branches within a short time distance downstream, where “short time distance” means segments with fewer phyloequivalent SNPs than typical.
A region of the Y-DNA tree with no bushy nodes is evidence that there was no significant rapid population expansion for that particular Y-DNA population. With no bushy node there is no evidence of a bottleneck. A type node that is not bushy (immediately downstream of the TMRCA of a type) is evidence of a prolonged moderately low population (the upstream smooth branch segment) without a bottleneck (no bush).
I say “evidence” and “significant” because statistically, we expect rare bushy nodes just due to luck even without population expansion; similarly we expect rare cases of population expansion without bushy nodes. So we cannot draw conclusions about the population structure of one particular node, for the same reasons that we cannot decide on the basis of statistics alone if a person winning lots of money in a short time at a poker game is cheating or just lucky. All this discussion is for the human Y-DNA tree as a whole, not individual types.
I have not seen a statistical analysis (or computer simulation) that figures out what fraction of human Y-DNA nodes should be bushy just due to random fluctuations without significant rapid population expansion. It seems it should be possible to do such an analysis; I’m not aware if it has been attempted. Of course, a model of human population is needed for such an analysis. I suppose the simplest model would be a Poisson distribution for the frequency distribution of branches per node. Such an analysis could be compared to the actual human Y-DNA tree for an estimate of excess population bottlenecks (bushy nodes in excess of statistical expectation), which could be interpreted as the frequency of real population bottlenecks and not just statistical flukes, for the tree as a whole.
The effective population might be a lot smaller than the actual population. For example if in the past human population structure was dominated by family clans or tribes (small Y-DNA clades) competing with each other, and with most clans going extinct, and with rare clans surviving to grow and split into many new clans and thereby continue the competition, then humans would have low genetic diversity; the Y-DNA tree would have very long smooth branch segments between the oldest nodes, producing SNP haplogroups that can be distinguished as isolated STR types. You might say this is an explanation different than population bottlenecks. Or, you might say this is the same explanation, with effective population bottlenecks, where the full population never gets particularly small, but the breeding population (that produces future population) is small.
Another example of an effective population bottleneck is the Genghis Khan Y-DNA clade, with an effective founding population of only one man, starting a large male line - a Y-DNA clade - in just a few generations. We do not expect a Genghis Khan clade to be an isolated type, because he no doubt belonged to an existing haplogroup, and there is no historical evidence of a population bottleneck shortly after he lived. When the Y-DNA tree was pruned in the past, for whatever reason, any “Genghis Khan style” clade would have had a statistical advantage, due to size, of surviving and forming a type.
Migration can be an explanation for an apparently isolated STR type. I say “apparent” because an immigrant haplogroup can be isolated in one region if that haplogroup was absent before immigration. The result would be a type that is valid in only that region. If the home (emigrant) region is today undeveloped, so that today few men from that home region submit DNA samples for on-line, then that would be an apparent type, not truly valid world wide. On the other hand, if there had been a severe population bottleneck only in the home region after the emigration, then the descendants of the migration would form a valid type today.
You can probably think of more examples of effective population bottlenecks. I am not aware of a statistical method to distinguish population bottlenecks from effective population bottlenecks.
I have no conclusions to provide here. My purpose for this discussion is to have one place to refer with links when I discuss various types.
Rewrite 5 Jul 2016
This topic will be dropped in about a year or so, after I rewrite topics that have links to here. In the past, most Polish Project assignments were based on STR predictions, where appropriate SNPs were not available. This topic was a detailed explanation of my statistical methods. With the flood of new SNPs, and the lower cost, it is now usually easy to say which SNPs need to be tested for particular samples.
My publication explains the basis of my STR prediction methods.
Update 21 Jan 2015:
I introduced P, N, and K in the Fall of 2007, publishing this web page 6 Dec of that year. I did not predict that P and N were brother clades, in fact it looked to me like P was closer to K. I did not make predictions about the P, N, K structure because the statistics did not justify such predictions. I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008. I stated my overall confidence in the subtypes of K type as only 80%, and without high confidence that the various subtypes of K actually belonged to a single unique K clade.
K, which never qualified as a type, represents the R1a modal haplotype. Today it is clear that K is not a clade. Many of my predicted subtypes of K have been verified by SNPs, and so far none of them have been shown to be invalid.
In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for those samples that do not fit P, N, or K. R was never intended as a clade. The R category is no longer used because there are many branches of R1a known today, so that each sample with sufficient STRs can be confidently predicted into a branch.
This topic uses R1a as an example, but the same discussion applies to other haplogroup assignments.
My publications have several references of general interest and relevance to my web documents.
My Tools and data for STR analysis are Excel files. These are available at the JoGG publication site as Supplementary Data: www.jogg.info/52/files/cpcindex.htm.
Polish Clades Update. This folder is for update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my publications. I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type. Link to English abstract: Pawlowski 2002.
Lawrence Mayka is the Administrator of the Polish Project. Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions. He reviews and makes suggestions regarding this Polish Project web page of mine. He also reviewed the original drafts of my publications. A number of my types were originally suggested to me as STR clusters by Larry. Larry continues to provide data for this web page. Many of my references to other websites in this document were suggested to me by Larry.
Cyndi Rutledge is the administrator of the R1aY-Haplogroup Project. Larry and Cyndi had been sending me M458 test results when that was new. SNP results are now listed at project web pages.
Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications. Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his publication in JoGG - some in Russian). It was encouraging to me seeing independent identification of clusters by different methods. He emailed to me an English version of one of his 2008 publications. His Fall JoGG articles have references to his other publications. Here is a web link: Klyosov Home.
Russian web sites: Semargl, http://www.r1a.org/; http://www.rodstvo.ru/; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a, brought to my attention by others, particularly by Mayka, who worked out a correlation with my types. These sites clearly have proposed subdivisions of R1a based on STR data, but I cannot quickly understand these due to the language barrier. Klyosov seems to be active at these sites. The sites make use of the FTDNA projects and Ysearch.
Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions, suggesting many STR based clusters.
FTDNA link: http://www.familytreedna.com/. This is a commercial DNA testing company. I make extensive use of the project databases maintained by FTDNA. These are my primary sources of data. Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted for /polish/ in the Polish Project link, below. I do not work for FTDNA; many other companies offer DNA tests; I recommend FTDNA because I like the convenience of most DDNA data being available at the projects, particularly the Polish Project.
WTY. “Walk Through the Y”. This is an obsolete commercial product by FTDNA, for reading more than 200,000 base pairs of your Y chromosome, in a search for new SNPs in your branch of the Y-DNA tree. You can read about my WTY at another of my web pages. WTY has been replaced by Big Y.
Big Y: Replacement for WTY. Discussion of Big Y.
Polish Project link: https://www.familytreedna.com/groups/polish/about/background. One of many FTDNA projects. This is my primary source for Polish data. The Polish Project tracks both Y-DNA and mtDNA. The Y-DNA STR data that I use is at https://www.familytreedna.com/public/polish?iframe=yresults. The Y-DNA SNP data is at https://www.familytreedna.com/public/polish?iframe=ysnp.
Paul Stone is an administrator of the Polish Project, with emphasis on the I1 haplogroup.
R1a Project link: http://www.familytreedna.com/public/R1a. Largest R1a project, with multiple co-administrators, active in subdividing R1a data into hypothetical haplogroups. The project home page has a summary chart of R1a SNP subdivision, and other reference links. The administrators have been very helpful to me, particularly Michal Milewski, Lukasz Lapinski, Artur Martyka, and Lawrence Mayka.
R1aY-Haplogroup Project link: www.familytreedna.com/public/R1aY-Haplogroup. Original R1a project.
Ysearch link: http://www.ysearch.org/. Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch. I have Instructions for use of Ysearch. I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch. I am not associated with the company FTDNA.
Yseq: http://www.yseq.net/. A company that provides Y- SNP tests at competitive price and fast turnaround.
Yhrd link: http://www.yhrd.org/. A forensic Y-DNA data base. Data is separate by city, with many Polish cities. I relied on Yhrd to figure out the geography of the various haplotypes.
Semargl. R1a site by Vladimir Tangankin. R1a tree in pie chart format using 111 marker data, Oct 2021: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009tmb700.png
Sorenson link: http://www.smgf.org/. Another DNA testing company.
ISOGG link: http://isogg.org/tree/ Y-DNA tree SNPs and corresponding alphanumeric codes for the haplogroups.
FTDNA Draft Tree link: http://ytree.ftdna.com/index.php?name=Draft another Y-DNA tree with SNPs, but not updated in more than a year.
recLOH: A technical detail discussed in many publications, for example http://en.wikipedia.org/wiki/RecLOH. I discuss this and other compound marker issues, and how step is calculated, in the “Documentation” sheet for my Calculator.xls tool.
DYS389: Another technical detail, also discussed on the web and in my Calculator.xls. Briefly, 389II is the sum of 389I plus another STR, so 389II should be figured in terms of the delta value.
I’m a very rare type in Poland - E-L540. My maternal 1st cousins are R1a1a. That means my late maternal grandfather was R1a1a. I became interested in Y-DNA in 2004. My maternal family name is Iwanowicz. I discovered a family with that name in my maternal grandfather’s home town in Poland. They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins. I brought a cheek swab kit when I visited them the second time in 2006. Sure enough, the male son was a perfect 25 STR marker match to my 1st cousin. I didn’t get around to checking the web for a year. I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers. That’s a hell of a lot of matches in the summer of 2007. Most of these matches are Polish. I did some research and found an article by Pawlowski (reference in my publication) about this most common Polish Y-DNA haplotype, which I now call P type. That got me interested in doing more research, leading to this web page for others to see my results. My experience, however, is a reminder that statistics can be misleading. I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers. In June 2010 I realized that the probability was really about 93%, because 13 out of the 14 then current Polish Project members who had 67 markers and who also matched P type perfectly at 12 markers were in fact P type as judged by all 67 markers. My grandfather does not match P type at 67 markers. My grandfather is that 14th one. He matches the small clade that I named I type, after Iwanowicz. I type has since been verified as haplogroup S18681, which is also concentrated in Poland. That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about common Polish Y-DNA clades. This web page was originally called “R1a”; it got so many hits from Poland that I eventually renamed it to include all common Y-DNA clades.
2007 Dec 6 First web posting of this file
2007 Dec Two revisions
2008 8 revisions
2009 33 revisions
2010 36 revisions
2011 26 revisions
2012 18 revisions
2013 & 2014 10 revisions
2015 Jan to Mar 8 revisions
2015 Aug to Nov 15 revisious
2016 Feb 24 Full update of the Results Table and Abstract
2016 Apr 1 Edit of I1a-P; also misc minor edits
2016 Apr 15 Rewrite of Np Cluster (YP515)
2016 Apr 16 Nt topic deleted (Nt cluster, speculative, used only briefly for assignments early 2012; not a valid clade)
2016 Apr 16 Na,b,c,d,e: 5 topics deleted (Speculative from 2012, never used for assignments; not valid clades)
2016 Apr 16 Excel files - associated with these speculative clusters - also deleted
2016 Apr 17 Ns Cluster rewrite
2016 Apr 18 Ng Type rewrite
2016 Apr 20 Update PCI table; Nashk Type rewrite
2016 Apr 22 N Type rewrite
2016 Apr 23 L1029 rewrite; P type edit; M458 rewrite
2016 Apr 24 new Results topic, rewrite of Abstract and R1a Abstract
2016 Apr 25 edit lead paragraphs of “Description of the R1a Branches”
2016 Apr 25 delete topic “L260 and M458 Signatures”
2016 Apr 25 delete topic “37 Marker Network”
2016 Apr 25 quite a bit of minor editing, including bookmarks
2016 Apr 25 Z92 rewrite
2016 Apr 26 delete stubs for speculative Kt, Ku, Kw, Ky
2016 Apr 28 Z93 and A type rewrite
2016 Jun 10 Rewrite: Age Caveats, Introduction, My Mountain Method, Population Bottlenecks
2016 Jun 11 Rewrite E type and Z92
2016 Jun 12 Edit P type
2016 Jun 30 Rewrite of B type. Edit of a couple topics
2016 Jul 5 Rewrite of D type. Edit of B type. C type R1a1a (M198+,M417-) deleted (2012 last C type update explained this is obsolete). Rewrite of topic Probability and Confidence. Rewrite of topic Polish Project Assignments. Minor edits of other topics
2016 Sep 18 Update of the Results Table with July 6 database
2016 Sep 19 Update of G Type. Update of J Type (YP977)
2016 Sep 19 Update of frequency % in the PCI Table.