Polish Y-DNA Clades

13 Nov 2015

Peter Gwozdz




             14 Oct 2015 news:  Rewrite of P type.  Rewrite of M458, L260, CTS11962.


             29 Aug 2015 news:  Rewrite of the topic “I1-P type”, which seems to be equivalent to the new SNP Haplogroup Y6349.


             20 Mar 2015 news:  Click here - New Results Table - for a complete update of my summary table with common Polish Clades with % frequency in Poland.  This update has many of the latest new SNP haplogroups.  The older 2013 Results Table is also still on-line, because many of the links in this web page connect to that older version, which will be dropped when this page is finished updating.


             26 Feb 2015 news:  Lots of new SNPs were discovered during 2014, and the pace of discovery continues.  Most of the discoveries are coming from Big Y data.  This web page of mine needs lots of updating to reflect all the new SNPs, most of which are not yet mentioned here.  Most of my STR types are being validated as haplogroups defined by new SNPs. A few of my predictions are being proven invalid.  (My goal has always been about 80% validation - mildly aggressive predictions.)  One good place to find new SNPs arranged into the Y-DNA tree is http://www.yfull.com/tree/.



             Abstract rewrite 27 Aug 2015.

             The Polish Project has assignments of men (samples) to haplogroups and to proposed subdivision clades based on their Y-DNA data.  The Polish Project provides data for this web site of mine.

             Lawrence Mayka is the primary administrator of the Polish Project.  Paul Stone is also an administrator, with emphasis on the I1 haplogroup.  I am also an administrator, helping Mayka with statistical methods for assignment of samples.  This web document is for explanation, details, and update news.

             The topic is common Polish Y-DNA clades - identification of male line Y-DNA clades that are concentrated in the region of Historical Poland.

             This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to read the Introduction first.

             The Results Table has a summary of assignment statistics.  Some assignment categories have a link to more detailed discussion.  If you know your assignment you can click on the link in the right column of the Results Table to read more about your assignment category.  For more recent results (without many category links) see the New Results Table.

             The Results Table lists all the haplogroups that are common in Poland, but most of these are also common elsewhere.  The PCI Table lists the haplogroups that are particularly concentrated in Poland.

             Many of the assignments are to established haplogroups, based on SNP test results.  Many samples without sufficient SNP data, if their STR data matches closely to sample(s) with SNP data, are assigned to those corresponding SNP haplogroups.

             Some assignments are to hypothetical haplogroup branches, based on STR correlations.  Such branches are proposed by many people, including Mayka and me.  In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence.  With less than 80% confidence, my assignment categories are tentative, not called types, usually called clusters.

             About half of Polish men belong to haplogroup R1a.  Most of my work has been on R1a.  The R1a Project has lots of additional information.

             This web document has three purposes:  1. More detailed explanations for the sample assignments in the Polish Project.  2. Summary of my published results.  3. Update with recent results.

             Before 2014, it was expensive to discover new SNPs, so emphasis was on STRs, which were much less expensive.  That was true of this web page, the Polish Project, most other web based projects, and most published articles about genetic genealogy.  2014 was a transition, where the cost of discovering new SNPs was greatly reduced. As of 2015, there is a continuous flood of new SNP discoveries, and testing for SNPs has become inexpensive.  Emphasis is now on SNPs.  Most samples have STR data without test data for the most recent SNPs, so STR matching of samples continues to be needed for assignments.

             Most of the new SNP branches are very small (I call them twigs), with less than 5 known samples.

             See Big Y for one way to find new SNPs.  See SNP ordering information for testing individual known SNPs.

             I use the word type to mean an STR cluster with statistical confidence as established by my Mountain Method.  Many of my types have been validated by discovery of new SNPs that qualified the corresponding clades as official haplogroups.  I expect more than 80% of my types to be validated some day, but my method is intended to be slightly aggressive, so I do not expect 90% validity.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  All types have associated clusters but not all clusters qualify as types.  In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references.  Often when I discover a type I later find out someone else had mentioned it earlier on the web;  let me know if you the reader have more clues and references for me.

             Most types that I discuss seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA).  A few of my types are younger or older than that range.

             I use phrases like “seem to be” over and over because the STR prediction methods are statistical.

             Click here for more discussion of the methods for Polish Project assignments.

             The Polish Project is considered representative of Historical Poland, with caveats explained in my Publication.

             I am interested in Polish origins.  This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal.  This document is dedicated to identifying haplogroups and types and clusters concentrated in Poland, with detailed explanations.  I am aware that some people object to the use of Y-DNA for historical analysis, so I try to mention caveats along with my comments.


R1a Abstract

             Update rewrite 22 Sep 2015.

             About half of Polish men belong to haplogroup R1a.  The R1a Project has lots of additional information about that haplogroup.

             When I originally posted this web page in December 2007, no significant haplogroup subdivision of R1a was available, so this page started with hypothetical subdivisions of R1a.  A major division, roughly 50-50, based on the SNP M458, became available in November 2009.  Now, 2015, there are many haplogroup branches known in R1a.

             In 2010 I expanded this page to include clades from other haplogroups.

             More than 99% of Polish R1a are R1a1a (M17).  See the Results Table for breakout.

             There are two large categories in the Polish R1a data.  Since 2007, I have been calling them P type and N type.  P type is now known to be more than 95% equivalent to the haplogroup R1a1a1b1a1a(L260).  N type is more than 95% equivalent to the haplogroup R1a1a1b1a1b(CTS11962).  P type is concentrated in Poland, rare with increasing distance from Poland.  N type seems to be mostly Slavic, widespread in eastern Europe.

             R1a1a1b1a2 (Z280) is another large R1a category, also well represented in Poland.

             R1a1a1b1a3 (Z284) is a large R1a haplogroup, common in Scandinavian countries, rare in Poland.

             Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.

             Reminder:  I am concentrating on Poland.  The statistics of STR clusters depend a lot on the database.  For example, P type stands out dramatically in Polish data.  In other countries far from Poland P type is rare.  If you belong to an R1a cluster that is rare in Poland, I’m sorry, but I’m not covering you.  Check out the R1a Project.

             A graphical representations of the main branches of the R1a tree is available at the R1a Project.  ISOGG has an R1a tree that is not up to date.  Yfull has a continuously updated R1a tree including all the recent new SNPs for which data have been submitted to Yfull.



             Update rewrite 21 Sep 2015.  Edit 7 Oct 2015.  Minor edit 14 Oct 2015.

             This Introduction is for people unfamiliar with the jargon of genetic genealogy.

             There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic.  The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.

             This is a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages.  The definition words are boldface.  I often use links to those definitions when I use a jargon word for the first time in a topic.  If you want more detail on those boldface words, consider a web search.  There are more boldface definitions in the summary of my Methods.

             The Y chromosome gets passed from father to son, so it works just like a male family name.  Mutations (changes in the DNA coding) in the Y chromosome are inherited by sons.  Men are divided into haplogroups based on known rare mutations.  These mutations are called single nucleotide polymorphisms - SNP - a change at only one specific location in the Y chromosome.  The human Y-DNA haplogroups, representing all men, can be arranged as branches on a tree.  Diagrams of the tree usually depict the tree branching down (an upside down tree), or sideways.  Example of a sideways image of the Y-DNA tree:  wiki tree.  Usually it is more convenient to arrange the tree upside down as a list, with tabs for the branches;  examples:  Yfull tree and ISOGG tree.  These examples have links allowing you to browse through the thousands of known branches of the human Y-DNA tree.  We don’t really know the full human Y-DNA tree;  all trees are based on current data in a particular database;  new branches are continuously discovered.

             All the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup.  The MRCA corresponds to a node, or branching point, in the Y-DNA tree of male line ancestry.  Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the node.

             Lots of people, including me, are working to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.

             SNPs used to be difficult to discover and expensive to test.  Costs have been coming down.  SNPs are now discovered relatively easily;  SNP testing is inexpensive.  Since about 2013 there has been an increasing flood of new SNPs and corresponding newly discovered haplogroups.

             SNPs have alphanumeric code names (for example CTS3402), assigned by the people who discover them, and registered at on-line databases.  The corresponding Y-DNA haplogroups have alphanumeric code names assigned by ISOGG (for example the haplogroup for CTS3402 is R1a1a1b1a2b3).  Since 2014, with the flood of new SNPs, ISOGG is not keeping up, and the ISOGG codes have become too long.  ISOGG codes are still used for the main branches (the oldest branches with thousands of samples).  For smaller (younger) branches, sometimes only the SNP code is used (for example haplogroup CTS3402) or a main branch code followed by the SNP code for the smaller subdivision branch of interest (for example haplogroup R-CTS3402 or R1a-CTS3402).

             Clarification:  A haplogroup (or a branch of the Y-DNA tree) includes all the subdivision haplogroups (all the subdivision branches).  For example, R-S18681 is a branch of R-CTS3402, and R-CTS3402 is a branch of R1a.  R1a is a branch of R1, which is a branch of R.  So if a man belongs to the haplogroup R-S18681 he also belongs to the haplogroup R-CTS3402 and he also belongs to the haplogroups R1a, R1, and R.  In this example, R is the oldest and R-S18681 is the youngest branch.

             If you purchase a DNA test for R-CTS3402 and you have this SNP, the result comes out “positive”, or CTS3402+, and you belong to that haplogroup.  If you don’t have this SNP the result comes out “negative”, or CTS3402-.

             Upstream means older (SNPs, haplogroups, branches, etc) within the same branch;  for example R1a is upstream of CTS3402.  Downstream means younger within the same branch;  for example CTS3402 is downstream of R1a.  Upstream haplogroups generally are larger - more samples in a database.  It may be confusing because of the mixing of metaphors, and because most streams and rivers have smaller branches upstream, which is opposite the situation for Y-DNA trees the way the words are used by genetic genealogists.  I suppose upstream and downstream could be visualized as a river delta, where a large river breaks up into smaller rivers downstream, for the Y-DNA metaphor.  Upstream and downstream should be visualized with the Y-DNA tree as a list, with older nodes up and younger nodes down and indented;  see Yfull tree.

             SNPs are individually very rare.  But the Y has about 60 million locations, and about 1/3 of them are suitable for accurately measuring SNPs inexpensively, so there are thousands of known Y-SNPs.  A man who belongs to R-S18681 inherits from his father, on the Y chromosome, not only the SNP mutation S18681, but also CTS3402, and also the mutations for haplogroups R, R1, and R1a, and also the SNPs for other branches that I am not mentioning in this example.  It is unlikely that a particular man in an entirely different haplogroup might have the S18681 SNP mutation, but it’s like a lottery;  some man somewhere outside S18681 probably has that mutation.  However, it is extremely unlikely one man outside S18681 has many of the mutations from the upstream branches leading to S18681;  it’s like winning a lottery many times.  In other words, it is almost impossible for one man to carry the sets of SNPs for two different haplogroups.  A mistake in SNP testing, or a mix up of SNP data, or someone cheating with DNA samples, are each more likely than valid haplogroup confusion.

             Many haplogroups have an MRCA who lived thousands of years ago, so these span multiple ethnic groups and nationalities.  For example, the R1a haplogroup is of interest to me.  R1a is most common in Slavic countries but calling R1a Slavic can be misleading because it is found throughout Europe and west Asia.  The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it.  It is possible that he did not even live in what is now the Slavic region of Europe;  maybe his descendants moved there in a massive migration from the Asian steppes, or from India, or from somewhere else.  No one knows for sure.  He may have been proto-Slavic in language and culture, but we don’t know for sure.  If he was proto-Slavic, by now some of his descendants long ago moved to other parts of Europe and Asia.  One of the appeals of genetic genealogy is finding such clues about ethnic descent and migration from the statistics of haplogroups.  Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of our 24 chromosomes.  True enough.  Some individuals and some web sites go too far with genetic genealogy claims based on DNA.  That said, statistical analysis of Y haplogroup data provides many clues on human origins.

             Most known haplogroups have a TMRCA of thousands of years ago, before family names were common, so most men in a haplogroup do not usually have the same family name.

             Some relatively young haplogroups have been discovered that correspond to families, where most men in that haplogroup do have the same family name.

             Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line distant cousins identified by genealogy research, due to secret adoptions, illegitimacies, cuckoldry, etc.  Such a situation is called an NPE, non paternal event.  This is one of the reasons some genealogists prefer to avoid genetic genealogy.

             The male line associated with the Y-chromosome is only one ancestral line.  Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then;  the one man at the tip of the male line root is only one of those thousand.  That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many.  That said, many men enjoy purchasing a DNA test to find out which ancient Y-DNA haplogroup is theirs.  Many people enjoy the challenging hobby of discovering new SNPs in their male line (including women studying their father’s or maternal grandfather’s male lines), thereby defining their younger branches of the Y-DNA tree.

             A paragroup is a haplogroup considered without its known haplogroup branches.  An asterisk is often used in paragroup codes, like R1a1a* or CTS3402*.  R1a1a* usually (not always) means all the samples that belong to R1a1a but do not belong to any of the known branches.  When a new branch is discovered, samples positive for that new SNP get assigned to the new branch;  that changes the meaning of the corresponding paragroup.  The meaning of a paragroup varies between databases, because different databases have different samples with different SNP test results.  As a simple example of the idea of paragroups, “apes” is a biological clade that includes humans, but for many discussions we talk about “apes” as all apes except not including humans, in which case “apes” is temporarily a paraclade in that discussion.

             Many SNPs were independently discovered more than once and listed with multiple names.  For example, CTS3402 has two other code names, so it can be listed as CTS3402 / V2670 / S3361, or any of those names can be used alone.  Those three codes are all the same mutation, at the same location on the Y chromosome.

             Most haplogroups have multiple phyloequivalent SNPs.  For example, CTS3402 has two other phyloequivalent SNPs:  Y32 and Y2194.  These are not the same;  they are different mutations at different locations.  So far, every sample that is positive for CTS3402 is also positive for those other two.  This may change;  for example in the future a sample may show up that is positive for CTS3402 but negative for the other two SNPs, in which case all previous branches of CTS3402 will be assigned to a new haplogroup, branching from CTS3402, defined by either of those other two SNPs, while that hypothetical new sample will be assigned into a new branch of CTS3402.

             R1a has more than 100 phyloequivalent SNPs.

             It’s OK to say “equivalent SNPs” instead of “phyloequivalent” if the meaning is clear, but “equivalent” has other unrelated meanings.

             For clarity let me offer a tree analogy for “phyloequivalent”.  Between branching nodes, a real tree has branch segments, which are smooth, without any branching.  The older branches of a tree have long smooth segments.  When that tree was younger, there were lots of branches along that segment, but those branches died and fell off the tree as it grew over time.  Similarly, the Y-DNA tree has segments between nodes;  the older branches of the Y-DNA tree correspond to haplogroups that have initial segments that are long - sometimes thousands of years in time, so there were many generations along that segment, with multiple independent SNPs, that now seem to be phyloequivalent, because all the other branches became extinct (no surviving males).  I say “seem to be phyloequivalent” because a new sample may show up from a man in a newly discovered branch, thereby splitting that segment into two segments.

             Extinction:  Over the life of a real tree, most branches die and fall off.  Similarly, due to the statistics of male descendancy, most male lines become extinct over time.  That seems surprising to many people, but it is a well known statistical result.  If you want verification, search the web for the theorem called “gambler’s ruin”, whereby a gambler with a fixed stake almost always looses everything when playing at a casino, even if the odds would be neutral.  The number of males in a haplogroup fluctuates up and down due to statistics over the generations, almost always eventually fluctuating to zero, similar to a gambler’s stake.  A haplogroup that survives for thousands of years is like a very rare lucky gambler in a casino.  In a casino the odds are usually fixed in favor of the casino;  in the Y-DNA tree the odds were favorable for male lines during population expansions, and the odds were unfavorable for male lines during population bottlenecks.

             It is common to call haplogroups and corresponding SNPs “nodes”, particularly when discussing the Y-DNA tree represented as an upside down tree, as a list, with haplogroups indented to indicate branching.  In such a list the haplogroups and their corresponding SNPs do appear as nodes, not distinguished from the true nodes which is are TMRCAs.  I avoid using “node” for SNPs, although such use is common in genetic genealogy.

             Actually, phyloequivalent SNPs are almost always spread out in time, within the segment that is older than the TMRCA.  I have more discussion about this in the age topic, below.

             There is another kind of mutation, in a microsatelite, which is also called a short tandem repeat, STR.  Briefly, an STR is like a necklace.  Each bead of the necklace is the same short sequence of DNA, repeated multiple times.  An STR can mutate such that the number of repeats in the necklace changes.  So an STR mutation is expressed as the number of repeats after the mutation.  STRs are not used for haplogroups because they are not rare enough.  You can read more on line, for example wiki STR.

             Here are some common terms (in boldface) for genetic genealogy.  A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation.  A haplotype is a set of gene values at any number markers.  In Y-DNA genetic genealogy “haplotype” is usually used to mean a set of numbers, for the values of a particular set of Y-DNA STR markers.  The word sample (plural  samples or data or database) refers to the Y-DNA STR and SNP values from one man.  A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples.  A clade is a general term for common descent, so an SNP haplogroup is one kind of clade.

             Many people, including me, in the past, worked to “stay ahead” of the SNP haplogroups by analyzing STR mutations.  That’s because SNPs used to be difficult to discover and expensive to test, while STR data was relatively inexpensive.  That’s changed since about 2013;  SNPs are now discovered relatively easily;  SNP testing is inexpensive.  So STR analysis is no longer as popular as it was.  STRs still have value for genetic genealogy.  I have more about STRs and haplotypes in my Method topic below.

             Although SNPs are more important than STRs for Y-DNA genetic genealogy, STRs are still valuable because on-line databases have thousands of samples with STR data but not as much SNP data (yet).  You can search for statistical matches to STR data.

             There are many organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy STR and SNP testing and matching and analysis, for example FTDNA.  I am not associated with the company FTDNA;  I mention them because I make extensive use of their data;  check Google for competitors.  At FTDNA, click on Products for cheek swab kits.  DNA results are confidential unless you register the data at a public database;  at FTDNA, click on Projects to register your data into one of the many databases;  for example, most of my analysis is from the data in the FTDNA Polish Project.

             For STR analysis, I prefer the FTDNA standard set of 67 STR markers.  I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets.  Other companies use standard marker sets that may not overlap with all the FTDNA markers.

             As a first Y-DNA test, I recommend the FTDNA standard STR 37 set, because the result will automatically place you in one of the main large haplogroup branches of the Y-DNA tree, and because FTDNA provides you with matches to other men with similar STR haplotypes.  If cost is not an issue, the 67 set is better for accurate matches and the 111 set is best.  The 12 and 25 sets are no longer available in the Product list at FTDNA.

             Once you know your haplogroup you can follow the on-line tree and purchase SNP tests to determine your younger branches on the Y-DNA tree.  I have instructions available for SNP ordering.

             If you already purchased a DNA test your result probably already has your main Y-DNA haplogroup branch (for men), so you can proceed with SNP testing.

             Ysearch is the largest web database for Y-STR data, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.

             Data sharing:  Thousands of men are sharing their Y-DNA data by making it available on the web for analysis.  Most DNA testing companies give you choices:  you can keep everything private, in which case nothing is shared;  you can allow sharing of everything;  you can be selective, for example sharing STR and SNP data associated with your test kit ID number but withholding your name;  and other options.  The Y chromosome has relatively few genes, and none of those few have been correlated with health issues or significant human characteristics.  Nevertheless, some people feel private about their DNA.  One issue is non paternal events in the past, whereby your DNA matches may provide surprises.  Although I encourage people to submit DNA data to databases, I understand if you are reluctant.  I use public databases, mostly FTDNA projects and Ysearch, so my analysis is based on data that is already public.  I use names of people if those names are already on the web.  I ask permission to use names of people to use as references for new information or analysis.

             Many people are using statistical analysis of Y-DNA data to gain insight into human origins and migrations.  I am one of those people.  My interest is Polish origins.  This web document, however, is not for historical analysis and conclusions, except for occasional comments.  This document is dedicated to Y-DNA data and analysis, both SNP and STR, identifying haplogroups, types, and clusters concentrated in Poland, with detailed explanations.

             My Method topic has more definitions, but that topic is more advanced, intended for readers with some experience in genetic genealogy.


L1029 And Np News

             New topic 17 Oct 2012.

             L1029 was a new SNP last March.  L1029 provides a branch of M458, added to the ISOGG tree this year.  The other branch is L260 (update next topic).  L260 was discovered in 2010.  Most M458+ L260- samples are coming out L1029+.  I have been calling M458+ L260- samples N type (very few exceptions - next topic).  It is now clear that L1029 is a major branch, capturing more than 90% of N type (more than 90% of M458+ L260-).

             In the Polish Project, most of the N type L1029- results are samples with Poland given as the ancestral country.  This spring, Mayka started classifying these as the “Np” cluster.

             In this topic I present preliminary evidence that Np corresponds to a Y-DNA clade concentrated in Poland.  I also explain why all Polish N type samples (tested or predicted M458 and not L260) would benefit from the L1029 test, because Np cannot be predicted precisely, and because there is a low fraction of L1029- outlier samples, not fitting Np.

             So far (10 Oct data) there are 20 results L1029- (including a few samples that are not M458+) and 42 results L1029+.  N type requires 67 or more of the standard markers for confident assignment.  Using samples with those 67 markers the numbers are 114 N type, of which there are 12 L1029- and 41 L1029+.  Of the 61 remaining N type samples (at 67 in the Polish Project) not tested for L1029, I estimate only about 5 might come out L1029-, because testing has been concentrated on STR predictions, discussed below in this topic.

             One M458+ L260- L1029- sample is not counted as N type, as discussed in the next topic as Ry type.  This seems to be a very small outlier clade with an old node in M458.

             Two of the others differ significantly from the other 12, so I am predicting these two as outliers, with M458 nodes older than the main Np hypothetical clade.

             Np Cluster Definition:  I constructed an STR definition for the remaining 10 samples with similar STR values and L1029- result.  The definition uses 37 of the 67 markers.  The cutoff is 2 (step less than 2 are considered matches).  I uploaded this definition to Ysearch, code CHFXB.  My analysis file is L1029Study.xls

             On this basis, 3 of the untested N type samples fit the definition and are predicted L1029- members of the hypothetical Np clade.  Two more are marginal, so perhaps there are 14 Np samples among the 114 N type.  N type is 8.8% of the Polish Project, so that means 14 / 114 * 8.8% = 1.1% Np samples in the Polish Project.  The statistical uncertainty is wide, so my estimated 80% confidence range is 0.5% to 2 %.  Insofar as the Polish Project is representative of Historical Poland, it seems the Np hypothetical clade has roughly 1% frequency in the region of Historical Poland.  Of the 10 confirmed Np samples, 8 provide “Poland” as origin, one “Russian Federation” and one “Lithuania”.  The 3 predicted Np samples have two “Poland” and one “Belarus”.  There is no need to subtract the samples without “Poland” because the Polish Project as a whole has a similar frequency of samples not “Poland”;  such samples come from men with evidence of male ancestry from Historical Poland.

             DYS460 = 10 is a very strong signature marker for Np.  All 13 of the confirmed and predicted Np samples have this value.  Those two outlier samples also have this value.  Among those 41 L1029+ samples, only 6 have this 10 value;  3 have 12 and the 32 others all have the N type modal 11 value.  The statistics of this paragraph are misleading because DYS460=10 was used to encourage L1029 testing in the Polish Project.  I would expect a few Np to show up in the future with 460 value other than 10 (mutated from the Np ancestral value), and I would expect in the long run a lower fraction (less than 6 / 32) L1029+ to have the 10 value (independent mutations).  Among the 49 N type samples not confidently assigned to sub-categories, only 5 have the 10 value, and 1 of these is a marginal Np sample mentioned above.

             CDYa = 33 is another good signature.  These two markers alone with cutoff 1 (that means both markers match) capture 9 of the 13 Np samples (Np defined as 13 captured by 37 markers cutoff 2).  These two markers also capture 2 marginal samples (at the step 2 cutoff of Np at 37), plus only one other N type, plus a few D type (D are not members of the M458 clade, but DYS460=10 is modal in D).  CDY is a fast mutator, so it is unusual to serve as a signature marker.  I ran into this on one other occasion, where I postulated a mutation disabled CDYb;  see my discussion at http://www.gwozdz.org/L540.html#CDYb.  Actually, another reasonable explanation is that this CDYa=33 signature is just luck, because using only 10 samples we should not be too surprised that one of the rapid mutators looks like a signature, by the luck of random mutations.  Yet a third explanation:  Np might really be 2 or more clades where the ancestors (MRCAs) of each clade had the CDYa=33 value by luck, but those ancestors differed at other markers;  this explanation is discussed more below.

             There are no more good Np signature markers.  Np modal values differ from N modal values at only 4 of the 67 markers.  There are only two Np samples at 111 markers, and they do not seem to differ from N at those additional 44 markers.  On this basis, I am not confident that my definition is very precise, because it takes as little as 2 mutations in the male line history for a sample to be incorrectly predicted, using any STR definition.

             There is another reason for my uncertainty about my 37 marker Np definition:  I worked harder than usual to construct this definition, so there is selection bias.  Markers that just happen to have no mutations in those 10 samples are all in the definition.  Any marker got dropped if it produced 2 or more mutations in any sample of those 10.  Surely as more samples show up I’ll need to modify my definition.  Those 37 markers are only a “good bet” definition for Np prediction today.

             I  published my SBP method of quantifying confidence in clade predictions based on Y-DNA STRs.  Lower SBP means higher confidence.  I reserve the word type for clusters with SBP < 20%.  I consider SBP meaningless for SBP > 50%.  Np comes out with SBP = 64%.  This does not necessarily mean that Np is invalid as a clade prediction.  My SBP method gives larger values for SBP with few samples, so valid clades improve with more data (SBP becomes smaller).  A clade with modal STR values close to the father clade (N is the father of Np) necessarily comes out with large SBP.  Concentration in Poland is evidence of validity for Np.  That 460=10 is also evidence of validity.  In my estimation, Np has about 80% confidence of validity, all evidence considered, but only 50% confidence of being a unique clade.  Np might be primarily one clade with interference from other independent small clades with similar STR values.  Or, Np might be 2 or more clades, about the same size, all concentrated in Poland, but distantly related.  Clarification:  two clades with very close nodes to the father branch might be considered a single clade;  here I mean that Np might be 2 clades with nodes that are not close in the tree, perhaps with other small clade nodes between them that do not fit Np STRs (by the luck of random mutations in the ancestor).  More discussion below on this idea.

             In the R1a Project, my 37 marker definition captures 11 samples with SBP = 95% (data at 67 markers, download 14 Oct).  Eight of the 11 have L1029- result and the others are not tested yet.  Seven of the 11 are of “Poland” origin.  Two L1029-  are N type that do not match Np.  There are 38 L1029+ that do not match Np.  Summary:  L1029- are rarer in the R1a Project (compared to the Polish Project) and the L1029- predominantly match Np.  SBP is worse (higher) because of interference at the cutoff by more R1a samples from outside Poland.  This paragraph is not conclusive, however, because the administrators of both projects work together;  many of the samples come from men who joined both projects.  Both projects worked hard on getting L1029 results this year, using 460=10 fit as a guide for emphasis.

             As an independent test, I checked (11 Oct) the “RussiaDNA” Project (another FTDNA project).  Of 260 R1a total, only 12 have been tested for L1029, and only 2 of these 12 came out L1029-:  one Poland and one Russian Federation.  This is preliminary evidence that Np is rare in the Russian federation, although N is common in all Slavic countries.

             More projects checked (14-15 Oct):

                           Russian_impire:  4 L1029 tests, one negative, not Poland

                           LituaniaPropria:  4 L1029 tests, two negative, both “Lithuania” origin, one L1029- also in the Polish Project, both also in the R1a Project

                                        in addition, both L1029+ are also in the Polish Project, and one in the R1a Project, so these are not independent data

                           Scottishdna:  no L1029 tests

                           Finland:  1 L1029 positive

                           BritishIsles:  1 L1029 positive

             Other projects are not concentrating on L1029 tests.  I hesitate to encourage them, because M458+ L1029- seem to be mostly from Poland.

             I have an R1a database at 67 markers with 1816 samples from 15 FTDNA projects.  I collected this 20 June, when there were fewer L1029 results.  My 37 marker definition captures 13 samples, but 12 of these are in the Polish Project, and the other is in the R1a project.  No additional samples fit Np.  There are more marginal samples at the cutoff step 2:  10 of them:  only 2 in the Polish Project;  only one from Poland.  This is my strongest evidence that the Np cluster is concentrated in Poland.

             Ysearch:  9 samples are captured by my Np definition CHFXB.  Only 2 are from Poland.  Only 2 of the 13 Polish Project Np joined Ysearch (one Poland and one Lithuania).  SBP is poor for Np at Ysearch because there are 6 samples at the step 2 cutoff, none from Poland.  In addition, 2 “Central European” modals fall at step 2 (37 markers used), emphasizing how hard it is to separate Np.  A simple explanation for these Ysearch results is that there are 1 or more other clades concentrated outside Poland, which might be L1029- or L1029+.

             At the top of this topic, I reported “more than 90% of N type” (M458+ L260-) are L1029+.  Since L1029- are concentrated in Poland, it may actually be more than 95% worldwide.  However, there is a reasonable possibility of one or more small clades showing up L1029- from outside Poland when more samples are tested.

             Age of Np:  It is too soon to estimate the age (TMRCA) of L1029, and age based on STR variation is uncertain because of known caveats.  However, L1029 is probably not much younger than N type because L1029 includes almost all of N type.  N type is surely older than 2,000 years.  Indeed, variation of L1029 STRs is looking similar to N type variation.  The L1029- node is necessarily the same or older than the L1029 node, so Np has an old node.  However, the age of the node is almost always older than the age of the clade (TMRCA).  Np seems very young, as evidenced by the unique 460=10 value discussed above.  On the other hand, other markers have significant variation within Np;  that may mean Np is not so young;  or, that may mean Np is composed of 2 or more clades, each of which is young.

             Speculation:  Np reminds me of P type (L260 update, next topic).  In my 2009 publication, and at this web page, I have speculated that L260 may have a very old node, but the P type ancestor (MRCA) may have lived more recently, perhaps not long before formation of the tribes that led to the Polish nation.  It seems to me that M458 is quite old, but not many M458 individuals survived over the millennia, and a few of the M458 survivors were lucky enough to found clades during the population expansion of the last 3 millennia.  Perhaps the Np ancestor, with L1029- and 460=10, also lived long ago and left few survivors;  most of those few formed what are today very small clades, and one was (or perhaps 2 or more, all with 460=10, were) lucky enough to found the medium sized cluster today apparent as Np.  I find it interesting to consider the men who lived 1,000 to 2,000 years ago in the region that is now Poland (and / or maybe in another region from which there was a migration to Poland).  Due to the statistics of Y-DNA inheritance, most men do not form clades that last long, and very few men form large clades.  Human behavior may perhaps broaden the statistical spread of clade size, allowing rare men to produce relatively larger clades.  I speculate that among those proto-Polish men who founded clades that survive today, most were R1a, and many of those were M458, and one or a few of those were Np, and one was P.


Fall 2009 Issue of JoGG

             The Fall 2009 issue of the Journal of Genetic Genealogy has my publication split into two parts:

             Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.

             Part II is the application of that method to Common Polish Clades.  That article has a lot more detail than this web page, but that article was published in the Fall of 2009, so this web page serves as an update.

             PolishCladesUpdate is my folder for updates of the Excel analysis files for those two articles.

             This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.

             The Fall 2010 issue has my publication announcing the L260 SNP.


M458, L260, CTS11962

             Rewrite 12 Oct 2015.  Edit 14 Oct 2015

             Haplogroup M458 is a major branch subdivision of R1a.  M458 divides into L260 and CTS11962, except for very few samples that are M458+ L260- CTS11962-.

             Tree:  http://www.yfull.com/tree/R-M458/.  Shows 48 known SNP defined haplogroup branches of M458 (12 Oct 2015).

             12 Oct 2015 ISOGG code names:

                           R1a1a1b1a1   M458

                                        R1a1a1b1a1a   L260

                                        R1a1a1b1a1b   CTS11962

             See the New Results Table for an overview of M458 and the branching haplogroups that are common in Poland.  There are 464 samples in the Polish Project assigned to M458, in 53 assignment categories (24 Sep 2015). The R1a Project has more M458 samples and more categories.

             The SNP M458 was published by Underhill.  It has been available as an SNP test since early November 2009 at FTDNA.  M458 is common in Eastern Europe and is found throughout Europe and Western Asia.

             L260 is an SNP that I published in the Fall 2010 issue of JOGG.  It has been available as an SNP test since early April 2010 at FTDNA.

             L260 is equivalent to P type, which is a topic in my previous 2009 publication, Part II.  P type is significantly concentrated in Poland, and in the Czech Republic.  It is found at lower frequency in other Eastern European countries, and in eastern Germany.  About 9% of Polish males carry P type Y-DNA.  M458 was published after I submitted my article, during the referee process, so I added a comment about M458.  After my publication, L260 was discovered in a sample from the Polish Project, later found to be equivalent to P type, confirming my prediction that P type is equivalent to an SNP haplogroup.

             My P type topic at this web page has lots of further discussion that applies to L260.

             CTS11962 has been available as an SNP test since early 2013 at FTDNA.

             CTS11962 is equivalent to N type, which I also introduced in my publication.

             L1029 is the main branch of CTS11962.  L1029 has been available as an SNP test since March 2012 at FTDNA.

             My N type topic at this web page has lots of further discussion that applies to CTS11962 and L1029.

             Both P type and N type are code names published by me before the equivalent SNPs were discovered.  Click on the word “equivalent” for an explanation that STR types are not exactly equal to corresponding SNP haplogroups.

             L260 and CTS11962 each has a large number of phyloequivalent SNPs, which you can see listed in the Yfull tree.

             Ry type:  There is a family set (five samples with the same family name, very close STR match to each other) where one of them tested M458+ L260- CTS1192-.  These five are clearly not N or P.  Not even close in STRs.  These 5 samples are now categorized in the Polish Project as “Ry type”.  These were independently noticed by Lapinski, an administrator of the R1a Project, with a new category for these in that project, also.  Since these have been recruited as a group, they should count as only one sample for statistical purposes.  There are three other independent samples tested  M458+ L260- CTS1192- in the Polish Project (24 Sep 2015).  These are rare examples of M458 samples that are not L260 (P type) nor CTS11962 (N type).



             An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.

                           Abstract                         STR Data

             I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.

             This web page about Polish Clades was completely rewritten using this new information.  Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.

             Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).

             R1a1a7 is the new M458 haplogroup.  R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.

             R1a1a* is a new paragroup.  This is M458 negative.  It includes all my other R1a types

             This Underhill article has data for 158 “Poland” samples (Table 2):

                           R1a1a*:            71 samples       44.9%

                           R1a1a7:            87 samples       55.1%

             The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.

             Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).

             M458 Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.



             Up to here, I have tried to write this web page as news and summary, with links to more discussion below.  I hope anyone having minimal familiarity with genetic genealogy jargon has understood.  If you read this top to bottom, it gets progressively more detailed, with more and more jargon.  I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions.  If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

             If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.


Polish Project Assignments

             Rewrite 6 Oct 2015.  Edit 8 Oct 2015.

             Lawrence Mayka is the administrator of the Polish Project.  Click on the Polish Project web link to see how Larry assigns samples (men) to categories.  The Polish Project has sections for mtDNA and for Y-DNA.  This web document of mine is restricted to Y-DNA.  I help Larry with assignments to types.  Lukasz Lapinski and Paul Stone also help with assignments.

             Haplogroups are defined by SNP mutations.  The goal is to assign samples to their proper terminal haplogroups.  Your terminal haplogroup corresponds to the youngest known SNP haplogroup in your male line of the human Y-DNA tree.  When the terminal haplogroup cannot be assigned with reasonable confidence, assignment may be to an upstream branch (one of the older branches leading to the terminal haplogroup.  New terminal haplogroups (new younger branches) are continuously discovered because of the recent rapid rate of discovery of relevant new SNPs.

             Samples are grouped by assignment categories on the Y-DNA STR data pages, where the category titles appear as horizontal colored rows, followed by rows of samples assigned to that category.

             When appropriate, the assignment category names include a recommended SNP for further testing, in order to confidently determine the terminal haplogroup for those samples.

             I should mention at the outset that you can always save money by waiting.  DNA testing costs are coming down as better testing methods are developed.  More detailed SNP packs will surely be available in the future.  As more data accumulates, a sample may show up eventually that matches your current STR data very closely, and if that sample has recent new SNP data with positive results, your sample will probably be assigned (or predicted) to that same corresponding new haplogroup at no cost to you.  Assignment categories in the Polish Project are provided to help men who are doing male line genetic genealogy research (and women doing research on the male lines of their husbands and fathers and maternal grandfathers), and who would like results soon.

             If you are planning to purchase Big Y (next topic), there is no need to purchase SNP tests, because Big Y includes just about all the commercially known Y SNPs.

             If you are new to Y-DNA testing, you should join an FTDNA haplogroup project corresponding to one of your main haplogroup branches.  The haplogroup project administrators are usually up to date on the latest SNPs for that haplogroup, and are often eager to help beginners figure out where they fall in their branch of the Y-DNA tree.  The Polish Project administrators can also help out, although we may not be quite as knowledgeable about your specific haplogroup.  To find your main haplogroup branches, from your FTDNA home page (Dashboard), click on “Haplotree and SNPs”.  The tree should come up, indicating your FTDNA assigned haplogroup.  The tree should have the main branches indented in rows above your FTDNA assigned SNP haplogroup (upstream haplogroups indented to the left).  To find haplogroup projects, from your FTDNA home page, under the “Projects” tab, click on “Join a Project”, then scroll down to the header “Y-DNA Haplogroup Projects”, then click on the first letter of your haplogroup assignment;  check out the projects that come up.

             STR mutations had been easier to test than SNPs in the past, so many samples have STR data without recent SNP data.  Predicted assignments (for samples without up to date SNP data) are based on STR correlations, by comparison to samples that have both STR and SNP data.  The men with such predicted assignments can verify their assignments by ordering the corresponding SNP test that is named in the assignment.

             The Polish Project also includes FTDNA computer generated assignments in a column labeled “Haplogroup”, which uses a color code;  green text means assignment based on a positive SNP test result;  red text means assignment based on STR prediction.  I do not know the FTDNA computer algorithm for those red STR based predictions, but it is conservative;  I notice they have more than 97% probability - less than 3% of those red predictions end up in different haplogroups when they are eventually SNP tested.  However, that means most of the SNPs for recently found branches of the Y-DNA tree are not predicted by FTDNA, because there is not enough data for 97% probability predictions.  Most of the newer SNPs are for younger branches, where STR prediction simply cannot be done with such high confidence, because those younger branches do not all have unique STR signatures.

             The Polish Project assignments are more aggressive.  The assignment guideline for predicted assignments based on STRs is a future validation rate better than 80%.  The intention is to provide more STR predictions, accepting the risk that some might later be found incorrect.  Over the years, Polish Project predicted assignments, when tested later with new SNPs, have been validated as the correct haplogroups well over 80% of the time.  In other words, >80% probability is the overall goal for predicted assignments.  We avoid recommending SNP tests when a sample has more than 95% confidence of testing positive, because testing money is better spent testing for the branches - younger SNPs.  A negative SNP test result is OK because by eliminating a haplogroup, assignment to another branch can be made with higher confidence.

             Many haplogroups have multiple phyloequivalent SNPs.  Polish Project Assignment categories select one of those phyloequivalents to be used consistently in the category name.  Usually one of the first to be discovered is used, and then changes are made only when new data causes previous phyloequivalents to be split into different branch segments;  so a new code name is used only when necessary.  I do not fully understand the FTDNA computer assignments in this regard;  samples in the same FTDNA assigned haplogroup often use different phyloequivalent SNP code names.

             Many Polish Project SNP prediction categories have qualifiers, such as “Predicted, Recommended, or Needed” to indicate relatively higher confidence, or qualifiers such as “Credible, Consider, or Borderline” to  indicate relatively lower confidence.

             The 80% guideline does not mean each assignment has >80% confidence of validity, for multiple reasons:  I have a separate topic explaining that confidence ranges lower than probability when probability is calculated or estimated from limited data.  Confidence is an interval of %;  I use the lower number.  For example, by >80% confidence I mean an 80% confidence interval that ranges from 80% to a higher number.  The Polish Project makes assignments with roughly >50% and not >95% confidence.  Many assignments are made with >90% confidence if an appropriate SNP is available.  On the other hand, the Polish Project does occasionally make some “borderline” assignments with confidence not much greater than 50%:  where the best SNP from the 50% to 95% range has as confidence not much better than 50%; and where a negative result for such a low confidence SNP test is actually a good result because ruling out that SNP increases the confidence in another SNP.  Sometimes, when 2 or 3 SNPs are available as possible predicted branches but none individually have >50% confidence, the assignment category name has all those SNPs listed.

             Clusters and types are hypothetical haplogroups, used as assignment categories based on STR analysis.  There were more of these a few years ago, before the flood of new SNPs.  Many of those have been validated or shown to be invalid (STR matches not having high probability of belonging to one haplogroup).  In most cases the original cluster or type name is still used for assignments, along with the corresponding equivalent SNP.  There are still some clusters and types that have not been validated or invalidated.  In the Polish Project, types have >80% confidence of validity;  “cluster” is used for hypothetical haplogroups with >50% but not >80% confidence.  Borderline is used for samples that match a type with 50% to 80% confidence and do not match any other category better.

             To order a recommended SNP, from your FTDNA home page, click on “Haplotree and SNPs”.  The tree should come up, indicating your FTDNA assigned main haplogroup.  The tree should have the SNP branches indented in rows below (downstream - younger than) your FTDNA assigned SNP.

             FTDNA is not particularly fast to add new SNPs.  Other companies are available;  some of them honor requests for the newest SNPs.  If you are new to SNP testing, ask for help from your haplogroup project administrator in selecting new SNPs appropriate for you.  I generally recommend waiting for FTDNA because I like the convenience of all SNP data for the Polish Project in one place, but I sometimes purchase elsewhere.  For SNPs that are not in the FTDNA tree, see SNP ordering information.

             SNP packs are available and sometimes recommended in the Polish Project assignment name.  SNP packs include many SNPs at a very low cost per SNP.  SNP packs are recommended for samples that have no SNP tests, or have not been SNP tested in a long time, and where the assigned haplogroup has quite a few known younger SNPs.  You can usually find your recommended SNP pack in the FTDNA tree, above your assigned position in the tree.

             Sometimes a Remainder category is used for paragroups, which means the remaining samples from a haplogroup that do not belong to certain of the known subdivision branch categories of that haplogroup.

             Sometimes an Unassigned category is used for samples from a main branch haplogroup without sufficient STR data to assign those samples to one of the known branches.

             STRs:  Until a couple years ago I (and most experienced genetic genealogists) recommended purchasing the maximum number of STR markers.  The FTDNA maximum standard set is 111;  smaller standard sets have 67, 37, 25, and 12.  More markers increased the confidence of assignments;  some types and clusters required 67 or 111 markers for assignment;  more samples with 111 markers allowed discovery of more types and clusters.  Today, this is all still true, but SNPs are more important than STR markers for assignments.  Assignments that need more markers have phrases like “SNPs or Markers Needed” in the assignment name.  A sample with only 12 markers can be assigned to a main haplogroup branch based on only 12, and from there an SNP Pack can be purchased to identify the appropriate downstream haplogroups, and more SNPs can be individually tested in sequence to determine the terminal haplogroup for that sample.  If cost is not an issue, Big Y (next topic) is better than purchasing SNP tests.

             Another purpose for STRs:  Finding male line best matches from the large number of samples without sufficient SNP data.  The FTDNA site automatically lists close STR matches.  For this purpose, more markers are better.  111 markers is best if cost is not an issue.  However, even with 111 markers tested, the FTDNA site does check for matches at fewer markers because there are many samples in the FTDNA database that have only sets with fewer.


Big Y

             New Topic 6 Oct 2015.  Edit 8 Oct 2015.

             “Big Y” is a commercial project  by FTDNA for reading about 12 million base pairs of the DNA of the Y chromosome, which has about 60 million base pairs total.  New SNPs are being discovered in the Big Y data provided by customers.

             Link:  https://www.familytreedna.com/learn/y-dna-testing/big-y/.

             The FTDNA home page for your DNA kit has a link for ordering Big Y, and for later viewing the results.

             If you are new to Y-DNA genetic genealogy, you might ask for help on Big Y, from an administrator of your haplogroup project, because Big Y results are not easy to understand.  Yfull does an excellent job of analysis of Big Y results, at a modest price;  I highly recommend joining Yfull by submitting to Yfull your “BAM”, which is the very large data file in Big Y results.

             The cost is currently $575.  Other companies offer similar tests;  I recommend FTDNA because I like the convenience of Polish Project data being available at one place.

             If you are planning to purchase Big Y, there is no need to purchase SNP tests, because Big Y includes just about all the commercially known Y SNPs.  In addition, Big Y lists your private SNPs, corresponding to recent Y chromosome mutations in your male line ancestors.  There is a good chance one of your private SNPs will match up with a private SNP of someone who previously purchased (or later purchases) Big Y, in which case that private SNP will define a new relatively young terminal haplogroup branch for just the two of you;  more men can later test for only that SNP to see if they belong to that new haplogroup.

             If you purchase Big Y you probably won’t have to purchase any more SNPs for quite a few years.  Even when a test reading more of the Y chromosome at a lower price becomes available, men in your male line using that newer test might discover additional SNPs not in your Big Y results, but those SNPs may usually be phyloequivalent to your Big Y SNPs.

             Other testing companies provide similar tests.  I don’t work for FTDNA.  If I seem to be pushing their tests it’s because I encourage men with Polish ancestry to take up to date tests and to join the Polish Project, where FTDNA data is readily available for analysis.


Description of the R1a Categories

             There are separate topics below for descriptions of selected categories in Haplogroups I, N, and R1b.

             This is a long topic with many short subsections, each for a category.  Many of these subsections are out of date and need to be rewritten.  The subsections without a date on the first line may be a few years old.

             This large topic has descriptions for many of the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters.  Types and clusters are high confidence hypothetical haplogroups.

             Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.

             Please don’t get confused.  The following capital letter names are my codes for R1a categories.  Capital letters are also used for the large official haplogroups, but that’s different.

             Some of the following categories are discussed in my November 2009 publication, and may have archive copies of my 2009 Excel analysis files stored in the Supplementary folder.  Many of the following types have my update Excel analysis at PolishCladesUpdate.


             A.  Ashkenazi.  Update 8 Mar 2012.

             A type is hypothetical clade of L342, which is a branch of Z93.  A type does not correspond to a haplogroup yet, because there are L342+ samples that do not match the A type definition.

             This type is discussed in my publication, Part II.  The definition, using 67 markers, has been available since 2008 at Ysearch, as FCUFG.

             I have consistently expressed more than 98% confidence that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web.  It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi.  This type is not restricted to Poland.  Levy-Coffman wrote an article about Ashkenazi genetic genealogy;  I noticed discussion in a recent Science article.  I expect an SNP to show up someday equivalent to what I have been calling A type.

             Between 2008 and 2011 I predicted that A type was a subtype of K type, but I never had more than 80% confidence in that prediction, which is now seen to be wrong, because K type is in Z283, a  brother SNP to Z93.  See the R1a Project for a recent SNP tree.  The match of A type to K type at the first standard set of 12 markers is now seen to be a coincidence.  Older publications call that 12 marker haplotype, very common in Eastern Europe, the “Ashkenazi” haplotype, but we not know that only a small fraction of men who match at 12 markers are Ashkenazim.


             B.  Update 8 Mar 2012.  A hypothetical subtype of K type, identified by Mayka.  Concentrated in Poland.  I have more than 90% confidence that B type represents a clade that will be verified some day with a new SNP discovery.  My confidence is only about 80% that it is a subtype of K;  the node for B type in the R1a tree might be slightly younger or slightly older than the K definition node.  Individual assignments to B type have 80% or higher confidence, depending on how closely each fits.

             C.  Update 10 Mar 2012.  This type code name was dropped from the Polish Project in early 2011.  The two C type samples are both now listed as R1a1a (M198+,M417-), and they are the only samples in this paragroup, so that is a better label.  These are the only two R samples in the Polish Project with the signature (385a,455) = (13,10).  C was added to Polish Project in Dec 2009 by Mayka, who pointed out that Didier Vernade originally pointed out the unusual DYS392=13 value in 2007.  DYS392=11 is almost universal in R1a.  C type is very small.  There are only 2 Polish Project samples in C type, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4 different samples with 67 markers.  I calculated SBP = 7% using only 37 markers with Ysearch data (in early 2010).  None on Ysearch are identified as “Poland”.  C type differs very much in STR values from the rest of R1a1.  That is because C type has a very old node in the R1a tree.


             CTS11962.  See M458, L260, CTS11962 for a detailed discussion of these major haplogroups.  See also N Type, equivalent to CTS11961.


             D.  Update 12 Nov 2011:         Based on 1 Nov 2011 Polish Project data.  Analysis file:  DType.xls.  59 marker definition, cutoff = 9, no samples in the gap at 9 to 11;  SBP = 5.3%. 

             Concentrated in Poland:  Ysearch K49NZ;  34% have origin Poland.

             This type was added to the Polish Project in Jan 2010.  The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on DYS462=12.

             Signature (460,481,462,560) = (10,<22,12,18).  Any one of these four markers by itself can distinguish D type with high probability from other R1a1a1i (Z280) samples, but those values can be found individually as independent mutations in other R1a clades.  D type cannot be distinguished using the 25 FTDNA standard markers.  At 37 markers, only 460 is available.

             At 67 markers, 481<22  is an effective signature:  16 total D type:  13 D have 481=21, and only one other R1a sample has the 21 value. 2 D have <21, with no other R1a samples.  One D has the 22 value along with several other R1a.  481=25 is modal for R1a.

             DYS462 is a standard STR marker at Sorenson, and has been  available for years at Ysearch;  462 is now available at FTDNA with the 111 marker set.  In Nov 2011 I noticed that DYS560=18 is another marker for D type from the 111 set, but that is not available at Ysearch (Nov 2011).

             That DType.xls analysis file provides SBP = 5.3%, although I did manual editing of the definition to improve SBP, providing some selection bias.  On the other hand, isolation of D type is even better than indicated by SBP for two reasons:  Samples just beyond D type, steps 12 and 13, all have solid assignments to other types.  Most of the D samples have 462=12 and a few have 560=18, and those samples beyond step 11 with data have other values at those 2 markers, so a future definition using all 111 markers should provide even better (lower) SBP.  Only 3 D type have 111 markers;  most of the DYS462 data was obtained some time ago by purchasing that marker separately.

             D type seems to be Z280+ Z92-, based on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file).  Z92 is a new SNP, so not much data is available;  confirmation should be available soon.

             D type is clearly a Polish type:  In the Polish Project 10 of the 16 D type at 67 markers indicate “Poland” ancestry;  the exceptions are 2 “Unknown” (one with an obvious Polish name and one with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.

             On Ysearch, there are 32 samples below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which is quite high for Ysearch.  SBP is 15% on Ysearch, implying there are clades near the cutoff that are rare in Poland;  indeed none of the 5 samples in the gap at steps 9 and 10 indicate Poland.  For details see the “Ysearch” sheet in DType.xls.

             Age (ASD sheet cell N12) comes out 1,385 years using all 67 markers.  Old human Y-DNA clades have age older than the raw ASD calculation because of population bottlenecks and because of other statistical adjustments.  However, D type is not very old, so this correction may not be needed.  On the far right of that ASD sheet I sorted markers by age, and I added notes about problem values, and suggested four markers that should be masked out, but the age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216 years.  I see evidence of subclades, so D type might be composed of younger subclades that might be identified with more data.

             I noted three markers (on the far right of the ASD sheet) that I consider hints for subclades.  Last year in this topic I mentioned Da, with the signature (458,576,444)= (16,20,14) and that still looks promising, but not convincing.  One of the three D samples with 111 markers fits Da, and provides a hint that markers 463 and 715 from the 111 extension might help to resolve Da, so it will be interesting to see what happens as more D men order the 111 extension.


             E.  Update 8 Mar 2012.  V. Rudich entered a modal for this cluster into Ysearch as ID MW7DP, named “North Eurasian”.  Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”.  67 markers.  It’s an excellent type;  on 25 May 2010 it had 16 samples at 67 markers in the Polish Project, with SBP = 14%.  In late 2011 E type samples tested positive for the new Z92 SNP, corresponding to the R1a1a1g2d haplogroup (ISOGG early 2012).  However, not all Z92 + samples fall into types.


             FH Clade.  F and H types were suggested by Mayka.  They have the signature (439,511,452 = 11,11,28).  They differ from each other, so I could not make a combined FH type.  I can make a reasonable FH cluster, but it is not necessary, since the FH clade can be better defined as the combination of the three types Fa, Fb, and H.  The original F type (introduced Jun 2010) was split into Fa and Fb in Dec 2010.  DYS452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated.  Mayka and I helped most of the Polish Project members in FH, and members just beyond FH, to get 452 evaluated.  Samples beyond FH have 452=30.  My analysis files do not use 452 for determination of SBP.  452 would not significantly lower SBP because most of the background near the cutoff for each type are samples from the other two.  In other words, Fa, Fb, and H are very well isolated from the rest of R1a, but not so well isolated from each other.  These three FH types do not seem to be specifically concentrated in Poland (per Ysearch) although they are concentrated in Slavic countries including Poland.  All three types seem quite young, without relatively low STR variance (see the ASD sheets in the analysis files).


             FH Borderline.  The borderline samples from Fa, Fb, and H are combined into a single FH Borderline category in the Polish Project, because these clearly belong to the FH clade but have less than 80% probability of belonging to any one of the 3 types.


             Fa.  Ysearch YQ6D2.  66 markers, cutoff, 9 gap 2.  SBP = 27%.  See FH clade, above.


             Fb.  Ysearch EFQM7.  56 markers, cutoff, 5 gap 4.  SBP = 23%.  These samples were the original F type, before Fa was split off.  See FH clade, above.


             H.  Ysearch 559EE.  58 markers, cutoff, 7 gap 3.  SBP = 14.5%.  See FH clade, above.


             G.  This type was suggested to me by Mayka, who calls it the Pomeranian cluster.  Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland.  Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians.  Kashubians consider themselves an ethnic group or nationality within Poland.  It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians.  Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.

             G type is mentioned only briefly in my publication because not much data was available to me at that time.  My GType.xls update analysis file with June 2010 data has excellent results:  There are 12 samples in a nice type with SBP = 11.2%.  There is preliminary evidence of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in Ysearch;  see Haplotypes.xls for a list including hypothetical working modals.

             11 Jan 2011 news:  Mayka informs me that one of the new SNPs, L365, is positive for all of 5 G type samples that were tested so far.  A few samples from other types all tested negative for L365.  It seems like G type is included in the new haplogroup defined by L365.  One of those 5 is in that tentative Ga subtype.

             Of course, this is very preliminary.  It is possible, if unlikely, that some of the G type samples still might turn out negative for L365.  It is quite possible other samples not matching G type might be found L365 positive.  I’ll provide updates here.

             Those 5 samples are positive for M417, negative for M458, and negative for a few other new SNPs. 

             L365 is one of a few new SNPs that look like they will receive the notation R1a1a1x, where x = i, j, k, etc.

             This type should not be confused with another G type in the N haplogroup.

             14 May 2011 comment:  Sorry I have not taken the time to update this G type topic.  Recent data continues to verify that G type seems the same as the haplogroup divided by L365, now called R1a1a1i.


             I.  S18681. 

             13 Nof 2015 link to Stanaszek facebook page:  https://www.facebook.com/R1aS18681/?fref=ts

             Map showing location of S18681 samples (men):  https://www.google.com/maps/d/viewer?hl=pl&authuser=0&mid=zIcwIZnt7lUg.kLMb_kZH4B3c

             Update 26 Feb 2015.  Lots of new SNPs have been discovered in the past few months, so my Nov rewrite is already out of date. 

For the latest status check http://www.yfull.com/tree/R-S18681/.

             Rewrite 9 Nov 2014:

             I type seems to be about equivalent to the new SNP S18681.

             In other words, if you test positive for the SNP S18681, that places you in the S18681 haplogroup, which is closely equivalent to what I have been calling I type.

             Samples that match I type at 111 STR markers are coming out positive for S18681.  Samples that do not match I type at  111 markers are coming out negative for S18681, with only one exception.

             At 67 markers, there are three S18681+ outlier samples that do not match I type.  I type is defined by STRs, so future S18681+ outliers may not match I type, and a few samples beyond the I type cutoff may come out S18681+ in the future.  I have been slightly adjusting the definition of I type as more 67 marker data accumulates, so the definition has been improving with time.

             At less than 67 markers the probabilities of outliers are higher.  More discussion about this below.

             Most but not all I type samples in the Polish Project are also coming out positive for the new SNP YP331.  There are two newer SNPs, YP314 and YP315, that are located between S18681 and YP331.  The most recent SNP finding is Y5973.2.

             The “father” of S18681 is CTS8816, with the two “brothers” L1280 and Y2902.

             The SNP sequence is R1a > Z280 > CTS3402 > CTS8816 > S18681 > YP315 > YP314 > YP331 > Y5973.2.

             This recent work on new SNPs is being done by Stanaszek, Milewski, Lapinski, and Mayka.  Łukasz Stanaszek has a document R1a_S18681.doc with a listing of I type samples from both projects, along with discussion of the possible origin of the S18681 haplogroup.  Michał Milewski has a tree chart for Z280, which includes S18681, at the forum:  http://eng.molgen.org/viewtopic.php?f=77&t=1464&start=120  Check that forum topic for the most recent update.

             My definition for I type is published at Ysearch EKVHX, uses 58 of the 67 STR markers, cutoff 8, SBP 16.2%.  My analysis file is available as IType.xls.  My Aug 2011 definition, which used 62 markers, still works quite well, as demonstrated in that xls file.

             I type shows the highest concentration in Poland using my Poland Concentration Index, as listed in the table at the top of this web page.  For details see Ysearch.xls.

             This analysis so far is for I type results posted at the Polish Project.  There are more I type results posted at the R1a Project, so let me continue with comments for both:

             My I type definition works OK in the R1a Project and at Ysearch, but not as well as in the Polish Project.  One obvious difference is that there are only four samples in the Polish Project confirmed or predicted into the paragroup YP314+ YP331-, while this paragroup at the R1a project is almost as large as the haplogroup YP331+.  This paragroup does not seem to be as concentrated in Poland as YP331+, which explains why my I type definition, tuned to the Polish Project, does not work quite as well in the R1a project.  Those four paragroup samples in the Polish Project are highlighted in that file IType.xls.

             SNP confirmed data:  So far (14 Oct 2014, both projects) there are only 3 samples confirmed with SNP tests in the paragroup S18681+ YP315- (plus one cousin assumed).  There is only one sample confirmed in the paragroup YP315+YP314-.  There are only three samples confirmed in the paragroup YP314+YP331-.  There are 10 samples (59% of 17 fully confirmed samples) confirmed in the haplogroup YP331+ (plus one cousin assumed).  Three of those 10 YP331+ are in the new haplogroup Y5973.2, and one of them is confirmed Y5973.2-.

             There are about 20 samples confirmed or predicted S18681 that need testing for the branches;  we guess that most of these will come out YP314+, and most of those will come out YP331.

             In other words, the largest subdivision of I type is looking like the haplogroup YP331+.  The second largest looks like the paragroup YP314+YP331-.

             The I type samples not yet SNP tested with low step at 67 or more markers are predicted S18681 with high confidence.  There are “Borderline” samples close to the cutoff for I type, and samples close to I type at <67 that might be S18681, but cannot be predicted with high confidence.

             The Polish Project and the R1a Project categorize all samples and recommend which SNPs should be purchased for those interested in determining the paragroup or haplogroup for their sample.  See Stanaszek (link above) for the combined recommendations.

             History of I type:  This name was introduced by me in my Fall 2009 publication, Part II, page 178..  I named it after my Polish Iwanowicz grandfather, who carried this type.  Later, I was informed that Russian web sites had been calling this STR cluster “Northern Carpathian”.

             The best ranked signature marker for I type is DYS578=9.  The ancestral value is 8.  DYS578 has the second slowest mutation rate of the 67 standard markers per the Chandler rates.  This marker is in the 37 set, but not in the 25 set.  So the 37 set is a reasonable predictor for I type, while the 25 set is not.  The 9’s are colored orange in that analysis file IType.xls.  Three other good signatures are in the standard 111 set, but not in the 67 set:  DYS463=24;  DYS532=12;  DYS504=14.  Another fairly good signature, available in the 25 set, is DYS458=14, again orange in the file.  This is a rapid mutator, so there is more variance.

             DYS511>10, available in the 67 set, highlighted in IType.xls, seems to be a marker for paragroup YP314+YP331-, while the ancestral value 10 dominates the rest of S18681.  However, one STR marker does not provide very confident assignments.

             Previous subdivision attempts of I type:  At this web page, I have reported that I type has a particular structure:  I type has always formed a well isolated mountain in STR haplospace, with relatively few samples near the cutoff.  However, the mountain is broad, not tall, more like a range of foothills, in the graph of number of samples vs step (STR mutation count).  I type has a few bimodal STR markers.  All this implies subtypes.  Over the years, I have tried various subtype definitions within I type, and I have seen attempts by others on the web.  But the subtypes have never provided statistical confidence;  they are not well isolated in STRs.

             Now we know why.  Those four paragroups and that one haplogroup all seem to be old branches.  It seems the S18681 tree is more like a bush.  Indeed, the confirmed samples do not come out together when grouped by STRs at 67 markers.  (Not many samples have 111 markers yet.)  In fact, most of the confirmed samples from small paragroups seem closer using 67 STRs to samples in other groups than to members of their own group.  That’s because there are many more samples in the database from other groups, and some are bound to match more closely just due to the luck of random mutations.

             In  other words, I type is a broad mountain in haplospace because it has many relatively old branches.

             What does this mean?  It’s difficult to say with certainty, but here is what I consider the simplest speculative explanation:  S18681 I type originally grew quickly and prospered while other related clades died out, leaving this clade well isolated in STR haplospace.  Over the years, a number of I type clades survived, all with approximately the same age.


             J. This type was suggested by Mayka.  Only 6 members in the Polish Project, but this type is well isolated at SBP= 13%.


             K.  Update 28 Feb 2014.  K type dropped.  No longer needed.  This “K” topic will be removed in a future update.

             Since 2007, I had been using the name “K type” for a large R1a Polish category.  Over the years I had subdivided K into several smaller types and clusters, although I did not have high confidence that all of them in fact belonged to a single unique clade, as discussed at this web page over the years.  As types were subdivided, the remaining samples that fit the general K definition did not form a type.  My K group is now known to be a mix of independent haplogroups, so the Polish Project stopped using K as a category in Oct 2012, although quite a few small clusters with names such as Kx and Kz are still predicted, because the confidence in the clusters has always been higher than the confidence in K.  The various K categories are now clusters, types, and confirmed haplogroups within the two major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93).

             Most of the samples originally classified as K are now in B type, D type, I type, and J type, all significant branches of Z280.  If an SNP shows up that captures many of these branches, I’ll be inclined to re-use the short code name K to discuss that branch.


             Kv, Kx.  Small clusters, need documentation.

             I have been using the subscripts “z”, “y”, “x”, etc backwards through the alphabet because I am running out of letters for new clusters and types.  These small hypothetical clades seem to be subclades of K, although I do not have high confidence about the subclade status.


             Kt, Ku, Ky.  Cluster with STRs similar to K type.  These came up Z92+, so their match to K type is a coincidence.  Need documentation as a new topic at this web page.


             Kw.  These came up equivalent to the new SNP L366.  Needs documentation here.


             Ky.  Update 8 Mar 2012:  This is now called Z92y type.

             Update 7 Oct 2011:         Based on 1 Oct 2011 Polish Project data.  Analysis file:  KyType.xls.  Ysearch BBB9T.

             Ky type was suggested to me by Mayka on 21 Dec 2010.  There were only 3 samples in Ky last year;  now there are 5.

             That KyType.xls file demonstrates that the same 5 samples are extracted using any number of markers from 11 to 67, although at some of those definitions one or two other samples are also extracted.  The full 67 markers work best, SBP=23%.

             Ky was more isolated last year;  a few samples showed up in the gap, reducing SBP.

             I’m using a hand edited definition, Ky63, using 63 markers, for the following reasons:

             Ky is unusual in that 4 of the 5 samples have an unusual value for at least one markers.  I highlighted these values in red in that file.  Notice also the high step values for those four, 8 through 11, using all 67 markers (column BX), although SBP came out 23%, which is an excellent low result for 67 markers.  The obvious (but speculative) interpretation:  each of the 5 samples seems to be a representative of a branch of this hypothetical clade, where each of the 5 branches has a node not much younger than the TMRCA.

             Hand editing like this does introduce some selection bias, so the calculated SBP=13.6% for Ky63 is misleading.  Countering the selection bias, some if not all of those 4 markers that I masked out might represent small tribal sized subclades, so future prediction of new Ky samples should work better using Ky63 with those 4 removed.  T

             he far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out.  You can see that my selection is a bit arbitrary;  I could have masked less than 4, or more than 4.

             ASD age using all 67 markers comes out 917 years, cell N12.  ASD age using the 63 markers not masked out comes out 878 years, cell N29, not much less.  ASD age has a number of caveats, and 4 samples are not significant, so this age is highly uncertain.  Ky seems young, as haplogroups go.

             Ky doesn’t not have a prominent signature.


             Kz.  Update 5 Oct 2011:         Based on 1 Oct 2011 Polish Project data.  Analysis file:  KzType.xls.  Ysearch 9QJFQ.

             Kz type was suggested to me by Mayka on 6 Oct 2010.  Mayka speculates this might be a clade of Kazakh origin.  There were only 3 samples in Kz last year;  now there are 6.

             That KzType.xls file demonstrates that the same 6 samples are extracted using any number of markers from 2 to 67, so the definition is not critical for this well isolated type.

             Kz is effectively more isolated than the SBP values (row 12 in that file) indicate, because the samples just beyond Kz are all confidently assigned to other clades and types.  For this reason, those SBP values are moot.

             I’m using a hand edited definition, Kz59, using 59 markers, for the following reasons:

             Kz is unusual in that 5 of the 6 samples have an unusual value for at least 2 markers.  I highlighted these values in red in that file.  Notice also the high step values for those 6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%, which is an excellent low result for 67 markers.  The obvious (but speculative) interpretation:  each of the 6 samples seems to be a representative of a branch of this hypothetical clade, where each of the 6 branches has a node not much younger than the TMRCA.

             Hand editing like this does introduce some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but moot).  Countering the selection bias, many if not most of those 8 markers that I masked out might represent small tribal sized subclades, so future prediction of new Kz samples should work better using Kz59 with those 8 removed.  Again, this is moot, because any number of markers extract the same samples.

             The far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out.  You can see that my selection is a bit arbitrary;  I could have masked less than 8, or more than 8.

             ASD age using all 67 markers comes out 724 years, cell N12.  ASD age using the 59 markers not masked out comes out 704 years, cell N29, not much less.  ASD age has a number of caveats, and 6 samples are not significant, so this age is highly uncertain.  Kz is clearly young, as haplogroups go.

             Additional information supplied to me by Mayka:  Three of the Kz type samples are from non-Polish men who suspect they have Polish male line ancestry, so it is not certain Kz type is Polish.  Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP appears to be private, restricted to his family.  Insofar as that man recruited 3 more Kz samples into the Polish Project, Kz seems proportionally twice as large.  My next edit of the Results Table will reduce the percent size of Kz.

             Kz has the prominent signature DYS459b=18.  Mayka points out the additional signature DYS461=12, not one of the 67 marker set;  most of the samples in Kz have been verified with this 12 value.  Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other hypothetical clades, we do not know if the signature markers define a larger father clade.


             L.  This cluster is highly hypothetical.  It is rare in Poland, but second in size to K in European R1a1.  Larry Mayka suggested this cluster to me.  It is a well known Scandinavian cluster.  I quickly checked it briefly, and it seems to be a “type” by my definition.  However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here.  More documentation about L will be available here when I find time to study it.


             L1029.  See L1029 And Np News.  See also L260 and M458 See also N type.


             L1080.  New SNP needs documentation here.


             L260.  See M458, L260, CTS11062 for a detailed discussion of these major haplogroups.  See also P Type, equivalent to L260.


             L342.2.  New topic 30 Oct 2011.  This SNP was recognized as a new haplogroup by ISOGG during the summer of 2011.  This was an L342 haplogroup category at the Polish Project for a short time in the summer and fall of 2011, but it has been replaced by Z93, because it seems all the L342.2+ samples are also Z93+ in the Polish Project.  Apparently there are very few men elsewhere in the world found to be Z93+ L342.2-.

             Z93 is a more reliable SNP than L342.2, so it is recommended that men first test for Z93.  L342.1 is the same mutation as L342.2, discovered earlier in the E haplogroup.  L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests together are more reliable.  These 4 mutations are in the same segment, which is apparently a segment that mutates relatively rapidly.  Z93 is recommended as the better test for R1a samples that do not fit STR definitions of other R1a haplogroups;  the Z93+ samples can do the L342.2 test.  This information about L342.2 was supplied to me by Mayka.

             The Z93 category has the samples that do not fit the two known subdivisions:  A type and L342T cluster (next topic).


             L342T.  New topic 30 Oct 2011.  Based on 26 Oct 2011 Polish Project data.  Analysis file:  L542TCluster.xls.  I just noticed this cluster.

             L342T is not a type, because SBP did not come out low enough.  However, I included this cluster discussion here for the following reasons:

             Seven samples at 67 markers fit my new 48 marker definition for L342T.    There are 19 A type samples, which should all be in the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T;  the closest A’s are at step 8, where the cutoff is 6.  There are 5 more L342.2 (Z93) samples at 67 markers, and those 5 also do not fit L342T, falling at steps 11 through 21.  In other words, L342T is well isolated from the other L342.2 (Z93) samples, including the A type branch.  The one background sample (STR values fit the L342T definition) and the four samples beyond the cutoff, are assigned to K type and to subtypes of K;  Z280 has recently become available for K type;  as those background samples get tested in the future for Z280, my L342T cluster will start looking better.  Let me say that another way:  a cluster should be analyzed with data from its own haplogroup, so L342T should be compared only to L342.2 (Z93) data.  But there is very little L342.2 (Z93) data available, so I used the full R1a database in that xls file.  That means L342T is likely more isolated than it seems right now, so it is more likely to correspond to a valid haplogroup.

             Mayka pointed out to me that some of the L342T samples have Tatar ancestors.  That’s why I used the “T” in the code name.  Of course, Tatars may belong to only a branch of L342T;  I have no idea what fraction of L342T in Poland are Tatar.  And of course Tatars are expected to be a mix of multiple haplogroups.

             Three of the L342T samples, with the name Muchla, are apparently a family set, so they count statistically as only one sample, reducing the current count from 6 to 4, so SBP as calculated in that xls file should be increased (not as good).  This is evidence against L342T being valid.


             M.  Needs documentation.  M type was brought to my attention by Larry Mayka, who informs me others have called this haplotype the Viking haplotype because of its concentration in northwest Europe.


             M458.  See M458, L260, CTS11962 for a detailed discussion of these major haplogroups.  See also P Type, equivalent to L260.  See also N Type, equivalent to CTS11061.


             N.  N type.  Comments 7 Nov 2013:  N type is almost equivalent to M458+ L260-.  The only exceptions in the Polish Project are a family set of a few samples (called Ry type).  N type also seems equivalent to the new SNP CTS1192.  See the M458 topic for discussion of a L1029, which divides N type.  See also L1029 and Np News.

             Complete rewrite of this topic 25 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NType.xls

             N type is concentrated in Slavic countries.  N type is discussed in my publication, page 179.

             According to Ysearch and Yhrd N type seems to be spread all around the Slavic lands and central Europe, common from East Germany to Russia.  Within Poland N type seems to be about the same size as P type, both about 9% of men.  Worldwide, N is much larger than P.  N type should be properly studied in a database that is not restricted to Poland.  However, there seem to be subtypes of N that are concentrated in Poland.  See the discussions on N subtypes below.  I’ll continue to watch the Polish Project, because it will be interesting if more data provide more Polish subtypes within N.

             During review of my publication in 2009, the SNP called M458 was published.  I added notes about this to my publication on page 184.  The corresponding haplogroup is now called R1a1a1g.  This haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-).  M458+ samples may turn up someday that do not fit either N type or P type, but I have not noticed any yet.

             My current definition for N type, N46, is a modal haplotype using 46 of the 67 standard markers.  The cutoff is 8, which means all samples less than step (genetic distance) 8 from N46 are predicted N type (predicted M458+ L260-).  That definition is available in the NType.xls analysis file, in my Haplotypes.xls file, and at Ysearch as 3SEJK.

             N type age (age means TMRCA) is about 2,000 years.  That’s highly uncertain, but I’m 80% confident that age of 2,000 years is not off by more than a factor of 2 - age 1,000 to 4,000 years.  The M458 mutation is likely much older than the age of N type.

             I’m suspicious that N type includes many younger clades that just happen to have similar STR values, difficult to resolve into clusters or types.  I offer some speculation along these lines in the hypothetical subclade topics below.

             I highly recommend that someone from N type purchase WTY, a commercial product for discovering SNPs.  No sample from N type has been submitted for WTY.  That means there is a good chance that the first N man to submit his sample to WTY will discover one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an SNP that captures about half of N type - or perhaps an SNP that captures a small subclade - or perhaps multiple such SNPs.  My WTY was the first in a long time in my haplogroup, so I found 14 new SNPs.

             It’s interesting to wonder why R1a1a1g seems to be composed of two types that differ substantially in STR values (N and P are separated in haplospace).  I speculate about this in the P type topic.  Much of my P type discussion is also related to N type, so I avoided repeating all the details here;  please read my P type discussion if you are interested in more about N type.

             N seems to be older than P.  I wonder if there are subtypes of N about the same age as P.  I avoid too much speculation in this web page - just enough to indicate my motivation.  I’m wondering if there are clades in various haplogroups, mostly P and N, associated with the origin of the Polish nation - a few centuries more than a millennium ago.

             I have only identified 4 small subclades of N so far:  I am quite confident of Ng type, but less confident of N-Ashk type.  The Nt and Ns clusters are hypothetical;  I have about 70% confidence in them.  These 4 are used for assignments at the Polish Project web page.  I also identified a few clusters with roughly 50%confidence;  these are too speculative for formal assignments.  All are discussed  below.  I made speculative assignments based on all these types and clusters within N type, in column CD of that file NType.xls, Calculator sheet.  My file NClusterAssignments.xls has lots of details.  If you are N type, you can find your row with your kit number, and see your speculative assignment.  For the “clusters”, I estimate a 50-50 chance an assignment will need to be changed in the next year or so, as more data becomes available

             In addition, N type has many bimodal markers, hints at yet more subclades not discussed here.  This is evidence that N type experienced population expansion when it was young (not long after the TMRCA).  More discussion below.

             The paragraphs up to here are a brief summary.  The rest of this topic is a detailed discussion about N type and hypothetical subclades:

             This Sep 2011 analysis includes only data from the Polish Project.  I’ll wait a few months before reviewing data outside the Polish Project.  My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011.  For those last results, see the following topics, which have not been updated for several months:





             For the size of N type, please see the table at the top of this page, where N has only 4 more samples than P (87 vs 83 - 5 Aug 2011 data).  In my 2009 publication N had one less than P (28 vs 29, Table 6 page 169).  The 70% confidence interval for 87 samples is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by implication in Poland) within statistical sampling accuracy, at about 9%.

             My 2009 published definition for N type, N45, still works very well.  I did not change that definition at my Jan 2011 update and analysis here in this topic.  This Sep 2011 N46 update is just a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP data that has accumulated over the past year.  Both definitions are compared in that analysis file NType.xls, Calculator sheet, columns BZ to CC.

             Tweaking the definition like this, to better fit SNP data, introduces some selection bias.  I discuss this issue in the P type topic, where I did a similar tweak;  please read that topic if you are interested in the statistical justification.  The justification is not as good for N type, so I’ll return to this issue in the “old branches” paragraph below.

             This new N46 definition fails to capture only one M458+ sample, which falls at the cutoff step 8.  This new N46 definition captures only one foreigner, L540+, at step 7, the last step of the type.  The other samples at step 8 have tested either M458- or L260+, except one that fits D type well, so they are all confirmed as not N type.  Similarly, 7 of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to other types, with only 2 that are Borderline fits to other types.  In other words, the N46 definition captures the M458+ L260- samples with apparent 98% accuracy.  However, my confidence is about 80% for step 7, about 90% for step 6, and 95% or better for step <6.  Again, please see the P type discussion about confidence for a general explanation.  P and N are similar in this regard.  I have related discussion about N type confidence in the “old branches” paragraph below.

             Almost all the samples near the cutoff for the previous N45 definition have been SNP tested.  This high testing rate is not a coincidence;  Mayka and I have been encouraging men with marginal samples to do the M458 and L260 tests.  (We paid if cost was a problem.)

             The NType.xls analysis file has 10 columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as tentative N type definitions, with automatic selection of the best markers.  For each column, I colored the step count violet for samples captured by that definition.  You can see at a glance that any definition using 2 to 67 markers captures more than 80% of the N type (M458+ L260-) samples, and not many foreigners, so just about any definition works surprisingly well.  In other words, N type is very well isolated in haplospace.

             For the two best automatic definitions, I used boldface to highlight the N type samples missed by that definition, and also boldface to highlight the foreign samples captured by that definition.  I used boldface similarly for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the effect of 3 different cutoff choices.

             You might try resorting the sheet by column (select everything from cell A14 to the end) to better compare the results.

             The issue of SBP is moot for N type now that the SNPs M458 and L260 are available, but an analysis is instructive:  That NType.xls file has automatic marker selection of N type, and automatic calculation of SBP, disregarding the SNP data.  The best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%.  However, N46 is a better definition because N61 captures only 80 of the 87 N type plus that same one foreigner.  But still, 8 misses out of 87 is not bad for N61, better than the 13.2% SBP (SBP is a high estimate for statistical confidence).

              I considered calling N46 a definition for M458+ L260-, with a different definition for N type as a slightly smaller subtype, leaving out some samples that do not fit the N type definition with lowest SBP.  I could not come up with a convincing definition for such a smaller subtype.  So at least for now, I am considering N type as the same as M458+ L260-, with the understanding that may change in the future.

             The summary conclusion for all those columns of trial definitions:  My preferred N46 definition (column CC) does the best job of capturing N type (M458+ L260-).  Most of the other columns are trying to define N type as slightly smaller, leaving out a few of the samples (not always the same samples).  Most definitions for N type have many samples at or near the cutoff.  My explanation is in the next paragraph:

             Old branches:  A type is a hypothetical unique clade.  Of course, every clade is composed of subclades - branches in the Y-DNA tree.  Here is a simple explanation for the previous few paragraphs of discussion:  N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type MRCA at a few STR values from the standard 67 set.  Those old branches have many younger branches (twigs) that differ at yet more STRs.  In other words:  the N tree might have a few small branches near the ground.  Those small old clades provide samples in the database with large step, but each sample is from a different twig, so these do not correlate into obvious clusters.  Any clade has statistical outliers with large step;  a few small old branches would provide more outliers for N.

             Those old branches may not be small world wide.  One possibility - a large subclade of N concentrated outside Poland might have one small branch in Poland, corresponding to a man or tribe that moved to Poland long ago.  I am watching for evidence along these lines, but so far this paragraph is speculative.

             In addition, there might be additional large old subclades that seem young.  I consider this possibility in the discussions below.  The age of a clade can be much younger than the node.  I discuss this in another topic, where I call such clades smooth branches.  The N tree might have a number of small smooth trunks with nodes near the ground - that would not necessarily be evident as STR correlations.  On the other hand, the N tree might have only one main trunk, almost smooth, with only few small branches near the ground.  The actual situation might be more complicated, with multiple trunks of various sizes, at various distances from the ground.  I can’t tell yet from the STR data.  Perhaps another year of additional STR data may help.

             Why am I speculating about N type smooth branches?  I see plenty of hints for more branches in the N type data, but little statistical confirmation.  In the discussion below for subclades, I offer evidence (not definitive proof) for many more significant sub clades within N type.

             This discussion is personal.  It is my opinion, based on my statistical analysis.  Someone might send me an email any day now pointing out a convincing cluster or type in N that I missed.  Someone else might disagree with my analysis about particular hypothetical N subclades.

             Reminder:  This discussion is limited to Poland, as represented by the Polish Project.  Outside Poland there is additional probability of M458 branches showing up someday that fit neither N type nor P type.  Outside Poland I expect yet more N type branches.

             Regarding concentration in Poland, I use percent of samples in Ysearch with “Origin” Poland as an objective measure.  This is discussed in my publication, where Table 1 shows P12 (the P type modal haplotype using only the original standard 12 markers from the Polish Project) with 42%, while N12 has only 14%.  Those numbers 42% vs 14% are not calibrated (because of the unknown concentration of men with Poland origin in Ysearch) but those numbers are a relative indication of concentrated in Poland vs not particularly concentrated in Poland.  My file NYsearch.xls has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift due to more data.  That same file has the N46 definition at 24%.  This is evidence that N type, defined using 46 of 67 markers, is only slightly more concentrated in Poland than the 12 marker equivalent.  The simplest explanation:  There are probably large M458 clades outside Poland that match N12 and also match N46 at less than the cutoff, but the Polish samples are only a twigs on those branches, descended from one man or family or tribe that moved to Poland a millennium or so age.  It makes sense that clades within M458 might be regionally concentrated.  That 24% concentration for N46 is of course an average;  there are subclades of N with higher and lower concentration.  I found a few, discussed below;  that file NYsearch.xls has a sheet for each subclade analysis.

             Age:  N type comes out 2,340 years old using all 67 markers.  See cell N12 in the ASD sheet in NType.xls.

             Because of recLOH issues, the compound markers 464, YCA, and CDY present difficulties estimating age in the N type data.  Other compound markers are OK.  The ASD sheet allows a mask, row 21, where I masked out the 8 markers for these recLOH difficulties.  The result, using 59 markers, cell N29, is 2,010 years.  That’s my best guess for the age.

             On the far right of the ASD sheet I sorted the markers by apparent age.  YCAb comes out 20,704 years, demonstrating the recLOH problem.

             The second (apparently) oldest marker is DYS454, at 18,744 years.  This old age is due to only 5 mutations in this slow mutator.  DYS454 is clearly bimodal.  In my notes, I use the Nj code for the 2nd mode with these 5 samples, defined by 454>11.  This is evidence of a subclade, but the statistics are not convincing yet.  Maybe with more data in the near future I might call some of these samples the Nj cluster.  It’s not fair to exclude this “old” marker, DYS454, because there are 7 markers with zero age (no mutations in the 87 samples) and there are 7 more markers with less than 1,000 years apparent age.  The reason for averaging markers is that apparently old markers should be averaged out with apparently young markers.  Anyway, you can go ahead and mask out DYS454 by deleting the mask number at cell AE21, and the new age (58 markers) without 454 is 1,990 years, only a 20 year decrease.  I offer this paragraph of discussion as one example of preliminary evidence of an N type subclade, based on 454<11.

             The third oldest marker is DYS531, at 14,319 years;  at this bimodal marker I use the code Np for the 2nd mode value.  Again, I’m waiting for more statistical evidence for a subclade.

             That far right side of the ASD sheet has more notes about markers with old apparent age.

             Age estimation from STR variance is highly uncertain.  At another of my web pages, I use M458 as an example of age caveats.  I have more discussion about age estimation methods in the P type topic;  please read those two topics if you would like more discussion;  N is similar to P in this regard.

             I’m not too concerned about getting the age of N type correct in Polish data because I suspect in less than a year there will be enough evidence to subdivide N - new SNPs and / or more STR data for better statistical significance.  I suspect there will be younger subclades.  Furthermore, M458+ L260- is not really a tree;  it seems to be branch of the Y-DNA tree that is well isolated - a long smooth segment near the node;  but I mentioned above my suspicion that the main branch might not be really smooth - there might be significant old branches concentrated outside Poland;  if this is true I’ll need to soon redefine N type as younger, excluding any such significant branches.  I’ll leave it for someone else to estimate the age of M458+ L260- from worldwide data;  I’ll concentrate on N type, and hypothetical sub clades in Poland.

             There are 12 samples from N type available with the new 111 STR marker set (18 Jul 2010).  Only DYS532=12 is an obvious signature marker for N type from the 44 new markers;  10 of the 12 have this value.  Modal for R1a is 532=11.  P type also has the 532=12 value, also 10 of 12 samples, so this marker also provides a signature for M458 with good statistical significance.  I type also has the 532=12 value;  see the I type discussion below.

             The following topics are my proposed subclades for N type in the Polish Project.  Please consider reading the section P Type Bimodal Markers, if you would like more discussion of how I use bimodal markers as hints for subclades;  that same discussion applies here for N type.  If you are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls.  Near the bottom of the “Haplotypes” sheet is a list of 70 code names for signatures that I considered for N type subdivision.  I discuss only a few of these here.  I spent a lot of time studying tentative subclades of N because I’m anxious to find significant subtypes that are concentrated in Poland.  I uploaded a total of 17 Excel analysis files associated with N and tentative subclades, all discussed above and below.


             Ng.  Rewrite finished 22 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NgType.xls.  Ng is a small subtype of N type, but it has highest confidence.

             This is a very small subtype, only 3 samples, but it is very well isolated.  The definition uses 56 markers, cutoff 4, gap 9.   There are no samples in the gap, from step 4 to 12.  SBP = 15.8%.

             These same 3 samples are present in Ysearch, where the gap with no samples is from 4 to 11.  Two samples at step 12 are from Germany and Unknown.  There are none at step 13 and 11 samples at step 14.  It seems Ng is concentrated in Poland.

             The signature is (537, 492) = (10, 14).  These are the only 3 Polish Project samples in N type that have any mutation from the 12 value at 492, and they have a 2-step mutation.  492 is ranked 18th of 67 in the extended Chandler mutation rates.  The 10 value at 537 is also rare - only these 3 plus 2 other samples have it in N type in the Polish Project.  The same 3 Ng samples are extracted from N type using 1 to 67 markers.  They are well isolated using as few as 7 markers because they have little variation from each other in the rapidly mutating markers, so those rank well for the Ng definition.  ASD age comes out 619 years using all 67 markers but of course that is a very rough estimate.

             The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and passed on those 2 unusual mutations.

             The 3 Ng samples fall at steps 4, 5, 6 with the N45 definition of N type, a hint that the Ng node is near the center of the N type branch, not one of those old branches I speculated about, but this is just a preliminary hint.

             I introduced Ng type in Oct 2010;  there have been no new 67 marker data in the STR neighborhood of Ng type, so SBP has been 15.8% since, with the same definition.

             Ng also has what I call the Na signature, discussed below.

             The “g” is only my arbitrary code name that I have been using for the DYS492=14 signature.


             N-Ashk.  Rewrite finished 25 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NashkType.xls.  N-Ashk is a small subtype of N type.  Only 4 samples.

             These seem to be Ashkenazi samples.  Mayka pointed out to me that the names seem Ashkenazi, per his experience.  The samples beyond the cutoff are apparently not Ashkenazi.

             Signature (19,385a,594) = (15,12,11)

             I introduced this type in Jan 2011, with SBP 23%, slightly more than my stated 20% limit for using the word type.  Two reasons:  First, the Ashkenazi names are independent evidence of a clade.  Second, the N-Ashk modal haplotype differs from the N modal at 6 markers, which is evidence of a fairly old node in the N branch of the Y-DNA tree.

             I introduced this type as Nca type, because of what I have been calling the Nc signature, DYS19=15.  The “a” meant Ashkenazi, but that was confusing because the samples do not match what I have been calling the Na marker.  Nc is large;  I doubt N-Ashk is a twig in a large Nc branch;  the Nc mutation more likely arose independently in the N-Ashk hypothetical clade.

             This Sep 2011 reanalysis makes a cleaner cluster of data, although still small with only 4 samples.  The 594=11 marker is very clean;  these 4 samples are the only R1a samples in the Polish Project with this value.  SBP increased to 47%, so it is a stretch to call this a type, but the Ashkenazi connection is improved now and the 594=11 marker seem to be strong evidence.  Also, I avoid making changes in classification names without significantly more data, so I’ll continue to call this a “type” for now.  There are no longer any N-Ashk Borderline samples at 67 markers;  the Borderline category is used for apparent Ashkenazi samples that match well with only 37 markers.

             The improved definition uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4.  (The previous definition used 59, cutoff 5.)  The improvement:  I masked out CDY.  The previous definition used CDYb, missing an Ashkenazi sample that fits the type well, but has recLOH, providing a misleading step of 5 at this one marker.  With that new sample the ranking of markers came out slightly differently, so a few other markers were added or removed from the definition.  The old and new definitions are available in NashkType.xls.  The new definition is also available at Ysearch as 2TZKF, and in my Haplotypes.xls file.

             The ASD age comes out only 668 years, cell N29 in the ASD sheet in NashkType.xls.  Age calculated from only 4 samples is highly speculative, but N-Ashk seems young because of little variation in marker values.  The ASD should use (4-1) in the denominator instead of the total 4 samples (although most genetic genealogists do not do this for small sample sizes);  with that adjustment the age comes out 890 years, but that is still highly speculative.  That cell N29 is using 61 markers;  CDY and 464 are masked out.  (The mask is row 21, which you can easily edit.)  All 67 markers yield 1,024 years, cell N12, because of CDY.  DYS464 has no mutations in the set of 4, so including those 4 reduce the age, but I left 464 out because most people routinely exclude the 464 set from ASD.

             N-Ashk is quite young, but the node seems old because of the 6 marker distinction from  N type.  The simplest explanation:  N-Ashk has a long smooth branch, having an old node with N, but no further branching near that main node.  The samples in the Polish Project all seem to come from twigs with young nodes.  I speculate that there may actually be some branches of N-Ashk outside Poland.  Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat less than a millennium ago.  More data will eventually confirm or refute this speculation.

             2TZKF is the modal haplotype at Ysearch, where only two of these samples are present, and where there are 2 additional samples in the gap, from Russian and Belarus;  the simple explanation is that N-Ashk is concentrated in Poland, although there is too little data for confidence.  See NYsearch.xls for my Ysearch analysis.

             N-Ashk has what I call the Nb signature, discussed below.


             Nt.  Edited 25 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NtCluster.xls.

             With 17 samples, Nt cluster is my largest speculative subclade of N type identified so far.

             SBP = 27%;  this cluster is close to the 20% maximum SBP for Polish Project assignments as a type.  I am suspicions of this Nt cluster due to selection bias:  I considered 70 signatures for N type during the summer of 2011, and carefully analyzed more than 30 of them.  With that many attempts, a false positive is likely.  One of the clusters I analyze will necessarily have the lowest SBP, but that might be just the luck of the data.  No one knows how to calculate the statistical confidence in such a case.  I discovered Nt at the end of this major effort.  If SBP improves with more data for Nt I’ll upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll loose interest in Nt.

             If Nt is valid, it is probably concentrated in Poland.  See NYsearch.xls.  See my Ysearch method discussed above.  I consider this additional evidence that Nt corresponds to a clade, boosting my estimated confidence to about 70%.  We don’t always use 70% confidence for assignments, but everyone is anxious for more subdivision of N type in the Polish Project, so we started using Nt in Sep 2011.

             The Nt definition uses 48 markers, cutoff 4, one sample in the gap at step 4.  The definition is available at Ysearch as 2544E.

             Nt is based on the signature DYS442<14.  However, there are 29 samples with that signature, and 5 of the 17 Nt cluster samples have the N modal 14 value at this marker.  My simple speculative explanation:  the 442 mutation from 14 to 13 occurred independently in the Nt clade after the node with the main N type branch.  Other speculative explanations are possible - those 14’s might be a back mutation within a much larger “father” clade that carries the Nt signature on most but not all samples.

             One Nt cluster sample has the 12 value at 442, which could be another mutation or an independent double mutation.

             If we subtract the 12 Nt signature samples with <14, that leaves 17 more samples (not included in my Nt cluster) with this second modal value at 442.  There are only 3 samples at 15 in all of N, and we expect step up to be more common than step down for a slow mutator (see my publication for references), so that still leaves an excess of samples with <14, implying yet another hypothetical clade with an independent mutation, or a larger “father clade” but this paragraph is getting highly speculative.  I have more speculation like this about independent clades vs large clades in the Na, Nb, and Nc topics below, similar speculation applies to Nt.

             Thirteen Nt samples match what I call the Na signature, discussed below, but two samples match the alternate mode Nb;  the last two samples are one step away from Na.  This is evidence of an even larger Na father clade, but as discussed below the Na vs Nb signatures may have arisen multiple times independently, so I’m not confident to speculate further along these lines.

             See also NclusterAssignments.xls.


             Ns.  Edited 23 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  NsCluster.xls.  Ns cluster is a speculative subclade of Nt cluster.

             With 6 samples and SBP = 27%, this cluster is close to the 20% maximum SBP for Polish Project assignments as a type.  I am suspicions of this Ns cluster for the same reasons given above for Nt:  On the other hand, Ns looks like a credible subclade of Nt, which adds credibility to both of them.

             If Ns is valid, it is probably concentrated in Poland.  See NYsearch.xls.  The 67% concentration is the best I have seen so far, but this % is highly uncertain because it is based only 2 Ns samples at Ysearch.  Such as it is, I consider this additional evidence that Ns corresponds to a clade, same as my confidence for Nt.

             The definition uses 47 markers, cutoff 2, no samples in the gap at steps 2 and 3.  The definition is available at Ysearch as A5NSG

             Ns is based on two signatures.  Ns is my code for DYS446=12, 9 samples, vs 446=13 modal for N type.  Nt is my code for DYS442=13, 5 samples, vs 442=14 modal for N type.  The 6 Ns samples are all at steps 0 and 1 with the 47 marker definition;  the other 3 with that signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation happened twice independently in the N type clade.  Five of the 6 Ns samples have the Nt signature, but that 6th one has the value 12, two steps from the N modal 14, so it should be considered Nt also.

             See also NclusterAssignments.xls.

             All 6 Ns have what I call the Na signature, discussed below.


             Nd.  Edited 24 Sep 2011.  New topic 20 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  Nd53Cluster.xls.

             Based on the signature DYS389I = 14, vs N modal 389 = (13,29).  Nine samples have the Nd signature.  Only 3 of these 9 fit Nd53.  My confidence is only about 50% that these 3 samples really belong to the same clade;  I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.

             DYS389II has the value 30 for Nd but this is not a mutation at 389II.  See compound markers for an explanation.

             I call this Nd53 because the 53 marker definition is somewhat arbitrary - there is no very likely definition.  It is likely I’ll need to change the definition soon, when more STR data becomes available.  Also, “Nd53” makes it clear that this is not the same as the cluster formed using only the Nd signature.

             Nd53 is not used for assignments in the Polish Project;  see NclusterAssignments.xls for speculative assignments.

             The 3 samples do not have Poland as origin, although I suppose those men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples.  On the other hand, Nd53 might be representative of a clade that is concentrated outside Poland.


             Ne.  Edited 24 Sep 2011.  New topic 23 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis file:  Ne40Cluster.xls.

             Based on the signature DYS390 = 24, vs N modal 25.  Twelve samples have the Ne signature.  Only 3 of these 12 fit the Ne40 cluster.  My confidence is only 50% that these 3 samples really belong to the same clade;  I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.  Nd and Ne have similar status.

             I call this Ne40 because it is likely I’ll need to change 40 marker definition soon, when more STR data becomes available.

             Ne40 is not used for assignments in the Polish Project;  see NclusterAssignments.xls.

             Only one of the 3 samples has Poland as origin, although I suppose the other two Ne men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples.  On the other hand, Ne40 might be representative of a clade that is concentrated outside Poland.


             Na and Nb.  I have been rewriting this topic throughout the late summer of 2011.  Finished 24 Sep 2011.  Based on 5 Aug 2011 Polish Project data.

             Clusters based on DYS464, a maker set that is multimodal in N type.  Analysis files Na45Cluster.xls and Nb32Cluster.xls

             I introduced Na and  Nb in my publication, page 179 and Table 3.  I have been updating the discussion for Na and Nb here at this web page.  I consistently emphasize that these are speculative subclades.  In retrospect, I should have avoided the word “type” for these because more data over the years has convinced me that the explanation for what is going on is not two subtypes of N.  It will take me a few paragraphs to explain the issue of Na and Nb:

             One way to split the N type data, obvious at a glance, is by the number of markers for 464.  Some samples have 4 values, some have 6, just a few have 5 or 7.

             I understand that the 464 set is the most prone to genetic testing evaluation errors, so this or any categorization using 464 will have uncertainties.  If 464 is taken in combination with other markers that means some statistical uncertainty due to possible evaluation errors at 464.  Specifically, a sample in a database with 4 values at 464 might really have 5 or more values, and vice versa.

             Follow my links if you wish to read more about compound markers and recLOH issues, which introduce confusion for the 464 marker set.  Briefly, copy mutations can increase the number of 464 markers, but recLOH mutations might reduce the number.  A single copy mutation can change more than one value in the set.  Copy mutations and recLOH mutations are rare, about the same frequency as very slowly mutating STR markers.  Net mutations in the 464 set are common, with frequency among the fastest in the standard 67 set.  For the Chandler rates, each of the four markers 464a to 464d are assigned a rate 1/4th the net rate for single mutations for the set of 4.

             I use Na as my code for the signature 464 = (12,12,15,15,15,16) - the most common value set for 464.  28 of the 87 samples.  My Nb signature is the next most common, 464 = (12,15,15,16).  16 samples.  I say 464 is multimodal because there are also two sets with 4 samples each;  that’s why I’m using Na as a signature even though it is the modal value for N type as a whole.  This is for the 87 N type samples in my 5 Aug 2011 download of the Polish Project;  the proportions change every few months as data accumulates due to the statistics of small sample sizes.

             Na and Nb differ by 2 steps following the Ysearch method, but that is misleading because Na can turn into Nb in a single recLOH mutation, which might have happened more than once in the past in this N type database.  Nb can turn into Na with a single copy mutation.  I may not be exactly correct in this paragraph if my assumption of the structure of 464 in N type is incorrect, but this paragraph is certainly a brief example of the kind of confusion that arises with 464.

             It is easy to construct clusters using 464 in N type.  Too easy.  Too many choices for clusters, as I discuss in the following.  I could not come up with clusters with good statistical confidence.  My Excel analysis files allow setting maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any variation of a sample from a trial definition;  still I found no clusters with confidence.

             My analysis files allow an alternate method, treating the 464 markers as individual markers.  This is the method I used in my 2009 publication, still no clusters with confidence.

             My default is to follow the Ysearch method for counting step at 464, although this method is obviously less than perfect.

             My list of code names is available in Haplotypes.xls.

             When trying individual markers, DYS464b is best.  In my notes I use Na1 - 464b<14, and Nb1 - 464b>13;  these two signatures neatly split all the N type data.  Na1 captures all the Na samples plus mostly  samples with more than 4 markers;  Nb1 captures all the Nb plus mostly samples with 4 markers;  there are exceptions.  Using Na1 vs Nb1 I come to the same conclusions as using Na and Nb, discussed below.

             DYS464e provides another way to split the data.  In my notes I use Nx - any value for 464e, and Ny - no value for 464e.  Nx captures all the samples with more than 4 markers including the Na samples;  Ny captures all the samples with 4 markers including the Nb samples.  Using Nx vs Ny I come to the same conclusions as using Na and Nb, discussed below.

             Consider my definitions Na45 and Nb32, with 45 and 32 markers.  See those two Excel files for details.  My choices for 45 and 32 are arbitrary.  Those files show columns with trial definitions using a wide range of markers, automatically chosen by rank.  A wide breadth of number markers seem roughly equivalent.  It is remarkable how many samples fit very well using up to 50 markers for trial definitions:  Na has 16 samples at step zero using 11 markers, and 15 samples at step less than 2 using 45 markers;  Nb has the same 14 samples at step zero using from 11 to 32 markers.  When the 464 set is excluded from the definition, some Na samples fit the Nb definition, and some Nb samples fit the Na definition.  One simple explanation:  Na45 and Nb32 might correspond to two very young clades.  However, there is an alternate explanation:  Na45 might correspond to two or more young clades, and Nb32 might correspond to two or more young clades, and they may be a “bushy” set of branches where some Na45 clades are connected by nodes to some Nb32 clades.  I see no way to be confident that most of the Na samples are in a branch distinct from a branch with the Nb samples.  I suppose if your sample matches Na45 at step zero or one, there might be better than a 50-50 chance that you and others who match at <2 belong to a unique clade that may someday have an SNP definition, but such a clade will surely exclude some of the step <2 samples, and include some samples from steps 2 and 3, so Na45 does not provide a definition.  The same can be said if you match Nb.

             Some samples that fit the Na signature at 464 = (12,12,15,15,15,16) come out at high step using more markers.  Similarly, some samples that fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more markers.  You can see this at a glance in those two files.  Two opposite simple explanations come to mind:  Na and Nb may have independently arisen more than once, followed by population expansion - multiple branches in the N tree.  The opposite explanation:  Na and Nb sets might be signatures for two old clades that each have a few old subclades - two main N branches that have a few old branches and where both Na and Nb have a bushy clump of branches at the ends.  More complicated explanations also come to mind.  That second explanation, two main branches, is attractive, but I see no proof that is true, or even highly likely.

             In the file NclusterAssignments.xls, I make speculative assignments.  Most of the Na45 and Nb32 samples fit other more believable types and clusters.  I went ahead and assigned the few leftovers to Na and Nb, but these are just speculative assignments, meant so show you which of my clusters you best fit.

             Summary:  There is not enough evidence to consider Na and Nb to be two unique subclades of N.  Maybe Na45 and Nb32 do correspond to the top of two main branches of the N tree, with most of the samples that fit Na45 or Nb32 belonging to the corresponding clades.  Maybe not.  I see no way of ruling out multiple independent clades (branches far apart in the tree) for both Na45 and Na32, or for any other definitions based on the 464 set.  Perhaps in a year or so more STR data will provide convincing sub cades along these lines.  Perhaps in a few years SNPs will be discovered to subdivide N type.

             I have more discussion along these lines below, in the Nc topic.

             At all 67 standard markers, the Na and Nb modal haplotypes are essentially the same for STR markers other than 464.  I say “essentially” because the rapid mutators, particularly the CDY pair and DYS576, typically vary modally from month to month due to the statistics of small samples.  At CDYb, Na type signatures with multiple markers are typically modal 40, while Nb are typically modal 39, but this marker always ranks poorly for definitions because of the wide range of values.  In Nb less than 1/3 of the samples typically have the modal value at CDYb.

             The Russian site independently came up with this same haplotype distinction.  Two modal haplotypes are available on Ysearch, from the Russians.  Each use 78 markers and each match my Na and Nb types at 67 markers, including that 39 value for CDYb in Nb.  Central European-1 Modal GTAVR corresponds to my Nb, using only 4 values, 464a-d.  Central European-2 Modal 495M5 corresponds to my Na, using 6 values, 464a-f.


             Nc.  New topic 25 Sep 2011.  Based on 5 Aug 2011 Polish Project data.  Analysis files Nc32Cluster.xls

             My Nc code is for the signature DYS19 = 15, compared to the modal value of 16.  Similar to Na and Nb, my publication and previous versions of this web page proposed Nc as a tentative subdivision cluster of Nb.  The samples with the 15 value last year had mostly Nb samples, but this year that correlation is not significant.

             My opinion of Nc is very similar to my opinion of Na vs Nb:  No confident conclusion.  Nc might correspond to a single large clade.  Then again, Nc might correspond to independent unrelated clades where the Nc mutation arose independently.

             My Nc analysis complements my Na and Nb analysis:  If you look at Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na and Nb.  If you look at Na45Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16.  If you look at Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16.  If Nc32 vs modal 16 is a valid division of N type, then Na vs Nb cannot be valid.  If Na vs Nb is valid, Nc vs modal 16 cannot be valid.  All three files have, at the bottom, at large step, some Na, Nb, and Nc samples.

             Next, let me consider the 4 combinations using DYS464 and DYS19:

             Nbc42Cluster.xls is my analysis file using both the Nb and Nc signatures together.

             Nac32Cluster.xls is my analysis file using both the Na and Nc signatures together.  This is very different than Nc32;  the latter has a mix of Na and Nc;  the former is a new analysis using the additional restriction to Na match.  They both have 32 markers by coincidence.  As in Na45 and Nb32, the number of markers is my arbitrary choice;  there is no obvious best choice;  the number of markers will likely change as data accumulates for all these definitions where I specify the number of markers in the code name.

             Nb5_37Cluster.xls is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers plus the modal value at DYS19.

             Na7_26Cluster.xls is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers plus the modal value at DYS19.

             In the file NclusterAssignments.xls, I make speculative assignments to these 4 clusters, but samples that fit one of the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more confident assignment if they also fit these 4 combinations.

             The 3 Ng samples are all Na, but they are a mix of values at DYS19.  The neighborhood (just beyond the Ng cutoff) is all Na.  This is a tantalizing hint of a “father” clade with the Na signature.

             The 4 N-Ashk samples are all Nb, but in this case the neighborhood is a mix of Na and Nb.  This is a hint of an independent mutation to Na somewhat older than N-Ashk.  Three of the 4 N-Ashk are Nc, as are most of the neighborhood.  The other has the modal DYS19=16 value.  This is a hint of a father clade with the Nc signature, DYS19=15, plus recent back mutations to the modal value.

             The 6 Ns samples are all Na, with a neighborhood mostly Na but some Nb.  The Ng, N-Ashk, and Ns samples are all very far from each other.  You can see this in the file NclusterAssignments.xls, where each type and cluster has a column, with step value for each samples.  I consider this strong evidence against a large Na clade;  it seems more likely that the Na (464=12,12,15,15,15,16) set arose independently by copy mutation 3 times in these three hypothetical clades.

             Nt, the purported father of Ns, has 17 samples;  13 Na signature, 2 Nb, 2 one step away from Na.  It is reasonable to speculate that those 2 Nb are due to an independent recLOH in Nt, and that the father clade has the Na signature.  Unfortunately, it is also reasonable to speculate that there were multiple mutation to the Na signature within Nt making the 464 set is irrelevant.

             The 3 Nd samples match Nb but again the immediate neighborhood is a mix of Na and Nb, again evidence for independent mutations at 464.

             Ne is another example of a mixed Na Nb neighborhood.  In this example, 2 of the 3 match Na.  That third one, 464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a hint at another copy mutation.

             NYsearch.xls has a sheet with Ysearch data analysis for each type or cluster.  The Polish percent, in boldface, is my important result.  Although this analysis is based on very little data for each of those 4 combination clusters here is the tentative finding:  Nbc42 is not concentrated in Poland.  The other 3 seem to be concentrated in Poland;  that is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors one or more clades that are concentrated in Poland.

             Ns seems related to Nac7_26, because 4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3, so this technique of 4-way combination is good for hints, but not conclusive.


             Summary;  Na, Nb, and Nc clusters:  25 Sep 2011.  That was a lot of analysis to justify my opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without correlation to more markers.  N type probably experienced population expansion not long after the TMRCA, whereby the main N branches come out today with similar STR distributions.  DYS464 is multimodal;  DYS19 is bimodal;  the 4 main combination modes based on 464 and 19 provide evidence of twigs that are concentrated in Poland.  I bet there are many more small Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type.  I’ll continue to watch the STR data.  New SNP markers within N type someday will be even better.


             Np.  Update 1 Apr 2015

             The Np cluster seems equivalent to YP515, and YP515 is so far equivalent to CTS11962+ L1029-.  N type is equivalent to CTS11962, and most N type are coming out L1029+;  about 30% of N type are coming out YP515 (Np cluster).

             My original notes for Np are still available, but these notes are more than 2 years old - in need of updating;  see L1029 And Np News.


             P.  Rewrite 14 Oct 2015.

             P type is equivalent to the haplogroup defined by the SNP L260.  Very rare exceptions are L260+ samples that do not match P type, or P type samples testing L260-.

             For more discussion about L260, see M458, L260, CTS11962.

             L260 has 17 known branches (13 Oct 2015);  for update see http://www.yfull.com/tree/R-L260/.  The main branches of L260 in the Polish Project are YP414, Y2905, Y4135, and YP1337.  See New Results Table.  I tried (Oct 2015) to construct STR types or clusters for a branch of P type, using SNP data to identify samples, but I did not come up with any significant signatures or definitions.  Perhaps as more data accumulates it may be possible, but for now SNP testing seems needed for assignment of samples to branches of P type (L260).

             P type is a major topic in my publication, Part II.  P type is significantly concentrated in Poland, and in the Czech Republic.  It is found at lower frequency in other Eastern European countries, and in eastern Germany.  About 9% of Polish males carry P type Y-DNA.  L260 was discovered shortly after my publication, found to be equivalent to P type, confirming my prediction that P type corresponds to a haplogroup.

             My current definition for P type, P50, is a modal haplotype using 50 of the 67 standard STR markers.  The cutoff is 9, which means all samples less than step (genetic distance) 9 are predicted P type (predicted L260+).  That definition is available in the PType.xls analysis file and at Ysearch as 8U92G.  That file and definition are from an analysis done in Feb 2014 using Polish Project data downloaded 20 Jan 2014.

             For an update, please see my file PType2015Oct.xls.

             In that update, there are 176 P type samples, with only one outlier at step 8 (the highest step of the type);  that outlier tested CTS11962+, so it cannot be L260+.  In addition, there are seven L260+ outliers that do not fit the type, at steps 9 through 13.  So that’s 8 outliers out of 176 predicted = 4.5% rate - not bad for STR based prediction.

             As data accumulates, my statistical definitions of types often change by a few STR markers.  The P type definition has changed very little in the past few years.  I could not significantly improve on that P50 definition in Oct 2015, so I left it unchanged.  That file Ptype.xls has columns of various trial definitions for comparison, with indication of which STR markers are included in each.  That file Ptype2015Oct.xls has only one column of data for P50, with blank columns where you can try other definitions.

             Those files have a sheet “Haplotypes & Masks” with previous P type definitions going back to 2007.

             Pawlowski originally noticed what I now call P type.  My publication has more about this history.  The STR isolation of P type in the Polish Project is now even more impressive than at the time of my publication due to the accumulation of more data.

             Ludvik Urban pointed out to me that P type is common in the Czech Y-DNA Database.  FTDNA also has a Czech Y-DNA Project.  Karen Melis, administrator of the FTDNA Zamagurie Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on the border of Slovakia with Poland.  I’m not sure of the concentration in Slovakia.

             I added a “Ysearch” sheet to that PType.xls analysis file, with analysis from Ysearch.  The Western Slavic Modal haplotype, Ysearch 28WGP, matches P type perfectly at all 50 markers used in my definition.  That Western Slavic Modal uses 76 markers, but many of those are highly variable due to high mutation rate.  That modal is one of the Russian site modals.

             Age of P type:  The Yfull tree for L260 estimates 4400 ybp as the formation date, and 2500 ybp as the TMRCA.  In this case, the formation date is the node where the main M458 tree branch splits forming the two branches L260 (P type) and CTS11062 (N type).  The TMRCA is the node where the L260 haplogroup splits into known branches.  Subtraction gives 1900 years for the length (in time) of a smooth branch segment that includes L260 and the several phyloequivalent SNPs, spread out in time over that segment.  Estimation of such ages is uncertain due to a number of caveats, and subtraction compounds the uncertainty, but this is a long estimated time with no known branches.  This explains why P type is so well isolated;  there was plenty of time for both SNP and STR mutations, shared by all P type samples, providing a significant STR signature.

             In my 2009 publication I put the TMRCA of P type as 2,000 to 3,000 years ago.  That estimate still stands.

             Why does P type have such a long smooth initial branch?  The simplest explanation is the luck of statistics;  all the side branches became extinct.  Extinction can be accelerated by a population bottleneck.  Bottlenecks can be associated with mass migrations, wars, famines, etc.  However, a bottleneck explanation is not required because our viewpoint of P type is similar to our viewpoint of the winner of a lottery;  it is very unlikely that a particular person can win a lottery, but someone always does.

             Why is P type so large and concentrated in Poland?  One obvious explanation is a rapid population expansion in the region that is now Poland.  In my 2009 publication I speculated about such an expansion perhaps 1,600 years ago.

             Mutation analysis will not provide a definite explanation for why P type is well isolated in STRs and concentrated in Poland.  We can look in history for explanations that match in time with the TMRCA time of roughly 2500 years ago, and population expansion afterward.

             P type Signature:  DYS385a=1 is the best single STR marker for predicting P type.  In that update file PType2015Oct.xls:

             step 0;  385a=10;  182 samples:

                           169       P type L260+

                           6           L260+ P type outliers not predicted P type by P50

                           7           L260- not predicted P type by P50

             step 1;  385a=9 or 11




             step 2   17 samples no P type, no L260+

             step 3   0 samples

             step 4   1 P type L260+ with recLOH 385a,b=14,14

             That one marker along does almost as well as the definition P50 (above).  Of course, that file only has the


             Z93.  New topic 31 Oct 2011.  This new SNP was recognized earlier this month by ISOGG as R1a1a1h.

             So far, all Z93 samples in the Polish Project are coming out L342.2+, and vice versa.

             A type, discussed here at this web page since origination, and mentioned in my 2009 publication, is a branch of Z93 (L342.2).  A type samples are coming out positive for both SNPs.

             I just today added L342T as a new cluster, a hypothetical branch of Z93 (L342.2).

             The Z93 category at the Polish Project web page has the samples that are Z93+ or L342.2+ and are not predicted A type or L342T cluster.  Z93 also includes samples not tested for Z93 but are close STR matches to a sample that tested Z93+.

             I tried to come up with an STR definition for Z93 (L342.2).  I could not.  Z93 does not have good signature STR markers. 

             Z93 is a good example of why calculating age of haplogroups is highly uncertain.  A type seems to be very young.  A type dominates Z93 in the Polish Project.  Maybe A type had a particularly vigorous population expansion; or maybe A type luckily avoided a severe population bottleneck;  or maybe the A type ancestors moved to Central Europe from distant lands;  whatever.  Age is calculated from STR variance, so the age of Z93 is dominated by the age of A, which is misleading and too young.  If A type samples are excluded, the age of Z93 still would come out too young, because the A type samples have a unique STR signature, which means significant STR mutations, which means the A type MRCA lived at a time when Z93 was already quite old, so the A data needs to be considered when estimating the age of Z93.  I’ll try to come up with an age estimate, for next time I update this topic.


             End of R1a Categories.


R1b Haplogroup Types


             On 20 July 2010 I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).

             Mayka had already added these three to the Polish Project web page during the previous week, based on my recommendation, based on my SBP analysis.

             I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters.  We judge that my analysis justifies adding them to our list of types.  Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10).  These three small types are roughly 1% each.

             I’m following the current ISOGG codes for these types, which may be confusing.

             The STR definitions for these are available at Haplotypes.xls.  PolishCladesUpdate has a link to an Excel analysis file for each of these three types.

             Reminder:  These two types are calibrated to Polish data.  The definition modal haplotypes may not be optimal for other regions.  If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade.  Up to step 15 there is lower probability that you belong.  You should test the appropriate SNPs (explained below) for higher confidence.  If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.


             L23EE.  20 Jul 2010 documentation:  This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a.  This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.

             Nordtvedt pointed out the cluster for this type some years ago, calling it R1b-EE (Eastern Europe).  Mayka suggested the L23EE code to me.

             There are only 6 samples in the Polish Project in this type (13 Jul 2010).  SBP = 10.7% using all 67 markers, which is excellent for such a small type.  The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%.  In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+.  So this type is very well isolated in haplospace in Poland.

             On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project.  There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia.  That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain.  I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.

             This type has evidence of structure.  A number of markers are bimodal with no obvious correlation.  To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.

             If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers.  Even at 67 markers, I recommend the L51 test;  a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member.  We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.


             L47P.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.

             Mayka announced the cluster corresponding to this type on the web in March 2009.

             There are only 4 samples in the Polish Project in this type (13 Jul 2010).  SBP = 9.3% using 64 markers, which is excellent for such a small type.  The cutoff is 7 and the gap is 10.  There are no samples from step 7 to 16.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.

             This type is very robust;  the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.

             Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.

             Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project.  Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.

             The “P” in the code L47P represents my hypothesis that this type is Polish.

             Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative.  That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.

             See the last paragraph of L47A, next topic, for more comments.


             L47A.  20 Jul 2010 documentation:  This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1.  I do not know yet if this type is negative for L44, a known branch of L47.

             Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down).  I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers;  I would appreciate an email of a reference to add here, even if it does not exactly match my definition.  It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.

             There are only 5 samples in the Polish Project in this type (13 Jul 2010).  SBP = 7.6% using all 67 markers, which is excellent for such a small type.  The cutoff is 10 and the gap is 9.  There are no samples from steps 9 to 18.  Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.

             This type is very robust;  the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.

             This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics;  SBP = 4.6%, which is remarkable.  It might be even better with an optimized definition;  I used the modal haplotype that I extracted from the 4 Polish Project samples.

             This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.

             So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48.  In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A.  SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades.  I also have not searched the web for the STR values expected for L148.  (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.).  All this will quickly become visible when FTDNA updates their haplotree.  As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148.  Mayka provided the SNP data that I have documented here.


I Haplogroup Types

             Update 25 Mar 2012.

             At the end of July 2010 I added two types from the I haplogroup to this web document.  I independently found these two by analyzing the Polish Project I data.  Mayka informed me that they were previously known as clusters, hypothetical clades, discussed some time previously by Nordtvedt.  Mayka added these two to the Polish Project web page in July 2010, based on my recommendation, based on my SBP analysis.  One is a branch of what has previously been called I2-CE, and seems to represent a Polish collection of M253 branches so we named it M223CE type, discussed in the next topic.  The other seems to be a Polish branch of I1-M253, so we named it M253P type, discussed in a topic below.  I an now also using the short code names I-CE and I1-P for these.  I am now splitting I-CE into I-C, I-D, and I-E, topics below.

             My STR definitions for these are available at Haplotypes.xls, in the Excel analysis files discussed below, and at Ysearch.

             These types are calibrated to Polish Project data.  The I1-P definition WC8JD forms a type in the Ysearch database, so it seems to be reasonably valid world wide.  The I-C definition SB6YK, and the I-E definition QUXE3, are probably not valid at Ysearch for a sample with origin remote from Historical Poland, because of interference by other clades with similar STR values, particularly from Russia.


             I1.        Edited 27 Aug 2015.

             Input new topic 19 Aug 2015 by Paul Stone.

             I1, defined by the SNP M253, is unique in that it has the signature STR value DYS455= 8 which is present in approximately 99% of all I1 samples.  It is also useful that DYS455=8 is nearly non-existent outside of I1.  I1 is unusual for the very large number of equivalent SNPs.  See http://www.yfull.com/tree/I1/ for the list of about 300 SNPs that are equivalent to M253.  Surely I1 must have had many male line branches in the past along that long smooth branch, but apparently they went extinct;  it is possible a branch (a new node along this smooth branch) will be discovered in the future.  Perhaps there was a population bottleneck.  This is supported by the calculations done by Yfull, with the result listed in that tree reference:  A formation date of I1 at 27,500 ybp but with a TMRCA of only 4,700 due to the large number of equivalent SNPs.

             In the Polish Project, I1 is almost all I1a (DF29), which comprises about 6.4% (high confidence range 4% - 9%) of the Polish male population based on the statistically adjusted Polish Project data (see New Results Table).  The website www.ydna.eu lists the I1 population of Poland at 8.5% based on a sample size greater than 1,000.


             I-CE.  (M223).  Update 25 Mar 2012.  ISOGG code is now I2a2a;  last year’s code for M223 was I2b1, still being used at FTDNA and the Polish Project.

             All the I-CE samples in the Polish Project fall into one of the 3 branches discussed in the following topics.

             The M223 clade is very well isolated in STR haplospace.  FTDNA is able to predict I2b1(M223) with high confidence using only the first 12 standard markers, for more than 90% of the samples.  Using 67 markers, I found that any reasonable definition does a good job of extracting M223 samples from Y-DNA STR data.  A good definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).

             STR isolation in the Polish Project is generally evidence of a single Polish clade.  It is possible that two or more clades with distant nodes in the Y-DNA tree might have similar STR values by coincidence.  In the case of Polish I-CE, since the larger I-CE world-wide clade is well isolated, my Polish I-CE type might well be a collection of multiple clades, perhaps including some clades that are not particularly concentrated in Poland.  My original M253CE type used 4 of the 8 I-CE samples back in 2010.  There are now 12 I-CE samples, and they form two types plus one cluster.  It may seem silly to split these into 3 branches, but there are new SNPs, discussed below, that justify the split as valid haplogroups.  These small types are interesting because they are preliminary evidence of small Polish clades.

             CE stands for Continental Europe, including Britain.

             The M223-Y-Clan project has lots of data;  I used this project data for reference.

             A good signature is (392, 437, 450) = (12, 14, 9), which distinguishes almost all M223 samples from others, allowing one mutation step.  (594) = (11) is also an excellent signature for M223, with the value 10 dominant outside M223, but this one is strange in the Polish Project, where 4 of the 12 samples have value >11;  this is evidence that I-C might comprise two clades.

             At Ysearch, the percent Polish samples for I-M223 is low.  The following 3 STR definitions, my proposed Polish branches, capture a small fraction of M223 at Ysearch.

             My Excel file I-CE.xls has analysis of this type and also analysis of the following three branches.  That file has ASD analysis, but ASD age is very misleading when calculated from samples that are a collection from multiple large old clades.  The three branches have too few samples to attempt age estimates.


             I-C.  (M223+ P78-).  (I-C Type Branch).  New topic 25 Mar 2012.  I-C type is a hypothetical subdivision of I-CE (M223).

             I-C type includes all 4 samples assigned to I-CE last year, plus one that was missed last year, plus 3 new ones, for 8 total at 67 markers in the Polish Project.  SBP has improved from 19% to 2.6% over the past year, so this is a clade with high confidence due to the excellent isolation, although there is a chance it may be two or more independent clades as discussed above.

             My Excel file I-CE.xls has analysis of this type in column CJ, SBP=2.6%.  My definition uses 67 markers, cutoff 20, gap 14.  There are no Polish Project samples in the gap from step 20 through 33, so this type is very well isolated.  This definition also isolates I-E type, 4 samples, steps 34 to 42, but there is a better definition for I-E, see the next topic. 

             There are no Polish Project samples at step 43 or 44.  There is only one I2b2 sample (not M223) as step 45.  Then there are no further samples at steps 46 through 52.  So this I-C definition also captures all of  the broader I-CE (M223), although surely a better I-CE world wide definition could be constructed.

             A good signature is (406, 487) = (10, 12), which itself distinguishes the 8 I-C samples in the Polish Project.

             Two of the I-C samples are I-D samples, discussed below.  Two other I-C samples have the same family name, very close in STR values.  The remaining 4 samples in I-C are not particularly close to each other in STR values.  The SNP data for each sample is included in column BX of the “Calculator sheet”;  4 of the samples tested negative for all 4 known haplogroup branches of I-M223.  So I-C seems to capture M223* plus P95 (below) in the Polish Project.

             My definition is also available at Ysearch, SB6YK.  On Ysearch there are plenty of samples from step 20 through 33, so this definition does not work world-wide.  The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.


             I-E.  (M223+ P78+).  (I-E Type Branch).  New topic 25 Mar 2012.  ISOGG now I2a2a3;  last year’s code for P78 was I2b1c, still being used at FTDNA and the Polish Project.

             My Excel file I-CE.xls has analysis of this type in column CM, SBP=13%.  My definition uses 67 markers, cutoff 19, gap 7.  There are no Polish Project samples in the gap from step 19 through 25, so this type is very well isolated.  Only the I-C samples are all at steps 26 to 44, so this definition also nicely separates I-C from I-E in the Polish Project.

             A good signature is (393, 459a, 446) = (15, 9, 10), allowing one mutation step, which distinguishes the four P78 samples in the Polish Project.

             Both the P78+ Polish Project samples are in the M223-Y-Clan Project, and there are 13 others, but there are many more P78- in M223-Y-Clan, so this is not a particularly large subdivision of M223.

             The other two I-E samples in the Polish Project that have not been tested for SNPs, but both have P78+ close matches on Ysearch, and no close matches from the other 3 branches of M223, so those are likely also P78+.

             There are two other known haplogroup branches of M223:  M379 has no positives in M233Y-Clan, and plenty of negatives, so it is very rare.  M284 has plenty of positives in M223-Y-Clan;  that branch is a large subdivision with a couple known branches of its own, but no samples in the Polish Project.

             My I-E definition is also available at Ysearch, QUXE3.  The Ysearch closest matches are I2b1c, so my definition is good at extracting P78 samples, but I suppose a better definition could be constructed for the world-wide P78 data.  On Ysearch there are plenty of samples from step 19 through 25, including some I2b1c beyond step 25, so this definition does not work world-wide.  The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.


             I-D.  (M223+ P95+).  (I-D Cluster).  New topic 25 Mar 2012.  ISOGG now I2a2a4;  last year’s code for P95 was I2b1d.

             There are only 3 samples P95+ in the M223-Y-Clan Project, and many P95-, so this is a small haplogroup.  Those 3 include one but not both of the Polish Project I-D.  Two of those 3 have Poland listed as origin, and the third has no origin listed, so this may be a Polish clade, but it is too soon to tell.  It is possible that I-C has a larger subdivision Polish branch, of which this I-D may be a branch, but this is just speculation until we get more data.

             (640) = (13) seems to be a signature for I-D, but one STR  marker should not be very reliable for prediction.

             I did not enter a definition into Ysearch.  The two I-D samples are highlighted bold blue in column CI of I-CE.xls.  Only one sample is P95+ in the Polish Project - the one that is also in the M223-Y-Clan Project, so I used that sample as the definition.  There is a sample at step 10, and none others out to step 22, so I tentatively assigned that step 10 sample to I-D, forming a cluster of two samples, SBP=25%, well isolated from others but not a type.


             I1a1b (L22).  Edited 29 Aug 2015.

             Input new topic 19 Aug 2015 by Paul Stone.

             The haplogroup defined by the SNP L22 is a major branch of I1a (DF29) and indeed a major branch of I1 (M253).  L22 is concentrated in Scandinavian Countries, but is also found elsewhere in Europe.  In Poland L22 is about 1.5 % (see ResultsTableNew.xls).

             L22 is a branch of CTS6364, which is also called I1a1 (ISOGG) and also equivalently called Z2336 (Yfull).

             The paragroup CTS6364+ L22- has the STR signature (390,385a,385b) = (20,14,14) although the signature alone is not a foolproof prediction.  This paragroup has recently been divided into a large number of new haplogroups, due to the flood of new SNPs.  A good place to view all these new divisions of I1a1 is http://www.yfull.com/tree/I-Z2336/.


             I1-P (Y6349). 

             Comment 24 Sep 2015.  For clarity, we’ll start using the name I1-P for this haplogroup, changing from I-P.

             Rewrite 29 Aug 2015.  Edited 11 Sep 2015.

             Input rewrite 19 Aug 2015 by Paul Stone.

             The recently discovered SNP Y6349 defines a haplogroup that seems to be equivalent to what we have been calling I-P type, changing to the name I1-P.  Marek Skarbek Kozietulski has recently started calling this haplogroup I1-Vistula, because the most of samples are from the basin of the Vistula river.

             Y6349 is more than half of that paragroup CTS6364+ L22- discussed in the previous topic:  I1-P about 1.0%;  paragroup including I1-P about 1.7% (see ResultsTableNew.xls).

             Samples (men) with predicted assignment to I1-P can verify their status by testing for Y6349 at FTDNA.  At Yfull, Y6349 is listed as the equivalent Y6340.  Another equivalent SNP, Y6354, is the one actually available for ordering at Yfull, so Y6354 is the recommended equivalent SNP test at Yfull.  See SNP ordering information.  Be sure to send me an email if you test at Yfull, so your results can be analyzed along with the other I1-P type samples.  Actually, Big Y is a better test (if cost is not an issue), because for I1-P samples Big Y is very likely to define a new branch, further dividing Y6349.

             Sequence for locating Y6349/Y6340 at the Yfull tree (end of August):  IJK> I1> DF29> Z2336 (do a word search for it)> Y3866> S4767> S7642> Y6340.

             History:  On 26 July 2011, I added this I-P Polish type for I1 haplogroup to this web page.  This type had already been known as a cluster for a few years.  Mayka pointed out to me that Nordtvedt listed it on the web.  Marek Skarbek Kozietulski has studied this cluster quite a bit, since he’s a member.  I mentioned this type briefly in my publication, where I was previously calling it Y type, considering it not high confidence based on the data available then in 2009.  I have also called this M253P type, because M253 defines the main branch I1.

             My analysis file is I-PType.xls.  That file was generated in 2012 with only 11 samples.  Recent data (Sep 2015) has an additional 4 samples that fit this type well, enhancing the confidence in this type.  My definition for I1-P type, from 2010, uses 54 markers, cutoff 4, gap 5, no samples in the gap from steps 4 through 8 in the Polish Project at 67 markers.  SBP came out 5.0%.  Marek informs me that he had identified 4 men who matched at 12 markers and actively recruited them to obtain all 67 markers and to join the Polish Project.  That means only 7 of these 11 samples should be used for statistical purposes.  SBP calculated on the basis of 7 samples is 8.7%.  This low SBP along with that large gap of 5 are compelling evidence that I1-P is a clade that is isolated in haplospace.  I used all 11 samples in my analysis file in order to best estimate the definition, which is also available at Haplotypes.xls.  That confident prediction from a few years ago looks like it will soon be confirmed by Y6349;  we need a few more samples to be tested, to confirm that all I1-P samples, and only I1-P samples, fall into Y6349/Y6340.

             Nordtvedt's I1 Tree has this I1-P type as I1*-P1, with related clusters I1*-P2 and AS4.

             A good signature for I1-P is (391, 392, 447) = (11, 12, 24), although this signature alone is not foolproof for distinguishing I1-P from all other I haplogroup samples.  My definition gives better than 80% confidence of assignment to I1-P for samples below step 8, and better than 90% confidence for samples below step 4.

             Here is some interesting speculation for which I do not have convincing statistical evidence:  Marek points out that a sample at step 4 on Ysearch is Danish, which adds to his evidence that there might be a related clade in Denmark, perhaps with a node in the I1 tree slightly older than the node for the I1-P Polish clade.  I do not know where that Danish sample falls in Nordtvedt’s tree.  I do not know where that Danish sample falls in the recent flood of new SNPs.

             Ysearch provides evidence of concentration in Poland.  My definition is available as WC8JD.  The PCI Table provides my results using Ysearch to identify clades concentrated in Poland;  I1-P comes in third in confidence.

             Age estimate of Y6340 at Yfull, based on the large number of equivalent SNPs:  Formation 3,400 years before present;  TMRCA 1,500 years before present.


             I1-Z63. New topic 23 Sep 2015 by Paul Stone.

             Haplogroup Z63 is also called I1a3 (ISOGG, Sep 2015)

             The Z63 SNP is estimated to have formed around 4,700 ybp.  In terms of raw numbers, the Z63 population is small compared to the large CTS6463 and Z58 subgroups and is similar in size to P109 or Z73.  The geographic range of Z63 is vast and stretches from Iceland to Spain and into the Balkans and central Russia.  The subgroup is primarily Continental in nature with Scandinavian Z63 samples being relatively few in number.  Z63 has the highest density in Central Europe based on empirical data.  Inside of and throughout Poland, several different Z63 lineages are represented.  SNPs downstream of Z63 found in Poland include BY351, FGC14480, L1237, PR683, S2078, S10360 and S15301 with others yet to be determined.


N Haplogroup Types


             N-G.  (N-L551).  (N-G Type).  Update 22 Mar 2012.  Introduced on 17 Oct 2010 as “N1c1(M178)-G type”.  The latest ISOGG code is N1c1d1a (L551).

             Mayka suggested this one, based on a suggestion by Andrzej Bajor, from his Rurikid Dynasty Project.  This type is concentrated in Lithuania, and Andrzej suggests that at least one member might be a male line descendant of Gediminas, the medieval Lithuanian Duke.  Hence the “G” code.

             This type has 9 samples at 67 markers very well isolated in the Polish Project with SBP = 8.9%.  See N-GType.xls.  The definition is also available at Haplotypes.xls and at Ysearch as RGE95, using 51 markers, cutoff 3 (samples < step 3).  All but one of the N-G samples can be extracted from the Polish Project using only the signature (392, 607, 557) = (15, 14, 13).

             This type should not be confused with another G type in the R1a haplogroup.

             That new L551 SNP verifies our prior prediction that G type corresponds to a clade.  All 9 of the predicted G type samples at 67 markers have tested L551+, and samples predicted just beyond G type are coming out L551-.  Of course, there will probably be a few exceptions as more data accumulates, but so far N-G type (STR match) is equivalent to L551 in the Polish Project.

             At Ysearch, N-G type is not as well isolated;  the SBP is 22% with cutoff 4, due to interference by what might be a Russian clade.  There are many Lithuanian samples matching my N-G definition (RGE95), including Lithuanian samples beyond the cutoff (step 3).  46% of the Ysearch samples below step 9 indicate Lithuanian origin.  L551 is too new to be included in Ysearch, so this paragraph refers to N-G type as defined by STRs.

             I do not know if the Polish Project N-G samples are an independent Polish sub-clade of a larger Lithuanian clade;  or if the Polish Project samples are just a random sample of individuals from a larger clade(s).  I have not taken the time to search other projects for STR matches to my N-G definition, or to search for more L551+ samples.  Someone might inform me before I get a chance to search.  Watch this topic for updates.

             The age of N-G type seems to be less than 1,000 years, perhaps only 500 years.  Check the “ASD” sheet in my analysis file.  ASD age is highly uncertain, particularly for such a small sample, but G type has little STR variance, so surely G represents a clade younger than 2,000 years old.  Isolation is evidence of an old node, with TMRCA much younger than the node.  The age of the L551 mutation can be anywhere in the time span older than the TMRCA of G type and younger than the node.  N-G type is well isolated in Lithuania and Poland, but N-G may have a relatively young node with those other clades world-wide with similar STR values.  Those other clades can be used to better constrain the age of the L551 mutation.


             N-M.  (N-L591).  (N-M Cluster).  Update 22 Mar 2012.  Mayka suggested this one also, introducing it at the Polish Project in Jan 2011, as “N1c1(M178)-M Cluster”.  The latest ISOGG code is N1c1d1b (L591).  Includes Mickevius (Mickewicz) descendants.  Hence the “M” code.  Also concentrated in Lithuania.  These two, N-G and N-M, are a small fraction of the M178 clade.

             I call this a cluster because it does not meet my criterion SBP<20% to be called a type.  Actually, the original proposed cluster is equivalent to what I am now calling Ma cluster, discussed below.  The recent new SNP named L591 is coming out with about twice as many samples, so we have adopted the “M” short code name for the STR data for L591;  this larger N-M cluster is so considered equivalent to N-L591.

             My analysis is available, N-MCluster.xls, 10 samples at 67 markers.  My best automatic definition for N-M, column CL, SBP=25%, is 80% accurate, missing one sample that is obviously L591 and predicting one sample that came out L591-, out of 10 predicted.  Actually, this result is a nice confirmation of my SBP method, because although the data has only 10% background (false positives captured by the definition), my SBP formula has an increase to account for statistical confidence;  hence 25% is a better upper confidence estimate of the background for so little data.  I bet as more data accumulates my best N-M definition will drift below SBP=20%, qualifying as a type.  Anyway, this is moot, because L591 is a better criterion for the clade, and there is a logical distinction between the N-M cluster (samples with STR correlation) and the L591 haplogroup.  My definition serves as a guide for priority for L591 testing.  Testing should be concentrated near the cutoff.

             Accordingly, I came up with an improved STR definition for L591, using a mask to manually adjust marker selection.  I’ll still call it by the short code N-M.  Column CC in that file.  SBP=50%, but SBP does not matter here, because the purpose of the definition is not to discover a hypothetical clade, but to predict samples for a known clade.  Most clades do not produce low SBP because most clades are not well isolated.  Let me elaborate with discussion of the statistical issues for N-M:

             Obvious issue:  There are three N-M samples with a very rare 6 step mutation at DYS446, from 17 to 11.  Without DYS446, two of these three marginally fit the N-M cluster (based on STRs).  These three seem to represent a subclade of L591 with modal STRs slightly drifted since their node. I marked them as “Mb” in that Excel file.  Only one of these has actually tested L591+.  Another one of these is that “obviously L591” sample that I mention above, the “obviously” based on this 6 step mutation, which is almost as good a marker as an SNP.  That “obviously” sample is an STR outlier at other markers, which need to be excluded from the L591 definition, assuming more samples like this will show up.  This seems obvious, but it needs verification with more data over the near future.

             Speculative issue:  There are two other outliers, which I labeled Mc and Md.  Tested L591+.  These may represent two clades with nodes only slightly younger than the TMRCA for L591, with independent modal drift.  Highly uncertain.  They might just be statistical outliers, due to the luck of random mutation.  Again, more data will tell.  For now, I adjusted the N-M definition to capture them, on the assumption that some future samples might come up with similar STR values.

             Another issue:  That one sample, mentioned above, fitting the M cluster very well but L591-, probably represents a clade with a node slightly older than L591, but similar STRs by coincidence;  there may be other such clades.  Again, this is speculative, but I adjusted my definition to exclude this one.

             Statistical speculation summary:  L591 does not seem very well isolated in haplospace, albeit more isolated than most young Y-DNA clades.  It seems the L591 tree has nodes close to the SNP age, both younger and older.

             My L591 definition is available in that Excel file, in Haplotypes.xls, and at Ysearch as 64RUG.

             This L591 clade seems to be concentrated in Lithuania.  The evidence is Ysearch - Lithuanian concentration of the N-M cluster.  L591 test data is not available yet at Ysearch.  My Ysearch analysis (data in a sheet in that Excel file) is similar to the G type analysis:  SBP not as good because of apparent interference from clades world-wide.  Using the N-M definition at Ysearch, there is Lithuanian concentration at steps well beyond the cluster cutoff, so there seems to be a larger Lithuanian clade.

             In the Polish Project, I spotted evidence of such a larger STR type, about double the size of N-M, including the all the N-M samples as a sub-clade.  I colored these samples green in column BX of N-MCluster.xls, using all 67 markers.  I dubbed this one N-L type.  That 67 marker evidence is not satisfactory because it captures a couple N-G samples.  In another file, not posted on-line, I came up with a satisfactory definition for N-L;  I provide it in the “Haplotypes & Masks” sheet, row 21, of N-MCluster.xls.  Mayka advises me that there are two new SNPs, L1025 and L1027, that are currently candidates for a haplogroup larger than L591.  We are waiting to see how those come out before introducing N-L.  That N-L definition cutoff provides a suggestion of where to prioritize SNP testing.

             The age of N-M (L591) comes out similar to the age of N-G type, probably less than 1,000 years; see that short paragraph in the N-G topic above.  My comments about isolation of N-G in the Polish Project do not apply to L591.  For N-M, it is important to exclude DYS446, because that one marker triples the age as calculated using ASD (STR variance), due to that 6-step deletion mutation mentioned above.  You can see this by editing cell BV21 in my mask in my “ASD” sheet in that file.  Another way to edit this is to edit the 446 value, to make the mutation count one or two, which is more representative of the age.  This is a good example of one of the caveats associated with age calculation based on STR variance.


             N-Ma.  New topic 20 Mar 2012.  This is the original “N1c1(M178)-M Cluster” cluster, explained in the previous topic.  Only 3 samples when introduced Jan 2011, SBP=36%.  Now there are 5 Ma samples, SBP=30%.  Although still not qualified as a type, there is better than a 30% chance this will improve over the next couple years as data accumulates.  Lithuanian concentration, same as N-G and N-M.  Again, I do not expect validity world-wide for N-Ma because of interference from other clades world-wide, but this might grow into a nice small, young Lithuanian clade.  Analysis is in N-MCluster.xls, where the 61 marker definition for Ma is in column CG.


Poland Concentration Index;  PCI


             New Topic 22 Feb 2014.










in Polish

Project %






I type






division of Z92

Z92Y type






division of


I1a1-P type






G type






P type






division of


J type






L366 type












             I introduced a preliminary version of this index in my 2009 Publication, page 161, with results in the far right column of Table 1, page 162.  In 2009 I defined this preliminary index as the percent of samples, of a given haplotype, that have the word “Poland” included in the “Origin” field for male line ancestry, for Y-DNA data at Ysearch.

             In this 2014 version, I extend the index to types and clusters.  In this topic, for brevity, let me use the general term “cluster” to mean a cluster, or a type, or a haplotype, or a haplogroup, or a paragroup, or a clade, or any other word for a category of Y-DNA STR data, although I make a technical distinction at the bottom of this topic.

             In this 2014 version, the PCI is statistically adjusted for data sample size, as explained a few paragraphs below.

             There are a number of reasons why % “Poland” from Ysearch is not an accurate measure of concentration in Poland.  I mention some of these reasons in my 2009 Publication.  I plan to add a longer discussion to this web page, with more detail about objections to statistics from Ysearch.

             However, the objections are not a very serious problem if we are interested in relative concentration in Poland.  Comparing clusters, we expect the clusters with higher % Poland at Ysearch to likely be more concentrated in Poland than those with lower % Poland.  This is one reason I do not use a % sign for PCI.  The other statistical reasons are explained below in this topic.

             Caveat:  You may use my PCI for another country, for example England, with a new “ECI” related to % “England” at Ysearch.  It would not be valid to compare the PCI numbers to the ECI numbers, because we do not expect equal joining probabilities.  Men with English male line ancestors do not necessarily join Ysearch in proportion to men with Polish male line ancestors.  However, we do not expect significantly different joining probabilities for men with Polish male line ancestors in different clusters.  There are exceptions, which I leave for future expansion of this discussion here (for example Polish Ashkenazi clusters).

             The problem of false positives:  We expect “false positives” - clusters with high % Poland just due to statistical probabilities (the luck of how many Polish and non-Polish men that would fit that particular cluster just happened to join Ysearch).  The more clusters we study the more false positives we expect to find.  The more clusters we study the more likely we will find a false positive that seems very highly concentrated in Poland.

             For small samples of data, the statistical uncertainty is larger, so we expect more false positives.  Suppose we check a large number of clusters for % Poland at Ysearch (or at any database), and suppose many of those clusters have fewer than 10 samples at Ysearch, and suppose some of those clusters have fewer than 5 “Poland” samples.  We will surely find false positives.  I discuss this sample size uncertainty in my 2009 Publication, where I used the lower bound of confidence range as a method to compensate for this statistical problem, particularly in small clusters.

             My PCI uses the lower bound of the 95% confidence range.  For example, a PCI = 25 means 95% confidence that the “true population % Poland” is 25% or greater.  By “true population % Poland” I mean the % value that would show up in a much larger database drawn from the same population in the same way (in the distant future at Ysearch, for example, if Ysearch is still popular in the distant future).

             If you are knowledgeable about statistical methods, you may quickly understand the details of my method from my file Ysearch.xls;  check the “Summary” sheet and the “Instructions” sheet.  There is a technical statistical explanation of PCI near the center of the “Instructions” sheet.  Check the other sheets for specific cluster results.  My automatic procedures use macros;  if you are concerned about macros your browser should allow you to open my file in “View Only” mode.

             Even if you are not knowledgeable, you might try following the “Instructions” sheet to evaluate your own clusters.

             The “Summary” sheet in the Ysearch.xls file has 5 example rows labeled “50% Tests”:  one row shows that a cluster with 100 “Poland” samples out of 200 Total has 50% Poland, and has a lower 95% confidence limit of 44.8%, so the PCI is 45.  However another row shows that a cluster with 5 “Poland” samples out of 10 Total also has 50% Poland, but has a lower 95% confidence limit of 25.3%, so the PCI is only 25.  In other words, smaller clusters get more downgrading to compensate for small sample statistics, but small clusters are allowed.

             This is the main reason I leave out the % - to avoid confusion - probability vs confidence - PCI is a mix of both.  PCI allows small clusters to be compared to large clusters, where the smaller clusters get adjusted to a lower index number.

             There is another issue:  selection bias:  In defining clusters, we chose the number of markers, and the cutoff, to best define the cluster.  The cutoff should be adjusted to best capture Poland samples, as demonstrated in the example sheets in Ysearch.xls.  Selection bias effectively reduces confidence, because we may be selecting parameters based on statistical flukes.

             Those “objections to Ysearch”, mentioned above but not listed, also reduce confidence.  Many of the objections cause variation of results, with more variation expected for smaller clusters

             So my net confidence is not 95%, but less.  This is my style here:  calculate PCI using 95% sample size confidence because sample size confidence can be easily calculated in an Excel sheet.  All those objections, including selection bias, reduce confidence, with larger effect expected for smaller clusters, but smaller clusters get more reduction in PCI.  I suppose my net confidence is about 80%, although this estimate is based on subjective experience - difficult to document with words.  I expect about 80% of my PCI predictions to slowly increase over the years as more data accumulates at Ysearch, with smaller clusters increasing faster in PCI.  I expect roughly about 20% of my PCI predictions to drop over the years, meaning roughly 20% overestimation of Poland concentration confidence.  On the other hand, this PCI method automatically ignores small clusters with lesser concentration in Poland, many of which will significantly increase in PCI as data accumulates.

             My Excel sheet automatically removes “Modal” Ysearch data, and excess “family set” data, as explained in the “Instructions” sheet, where the user is invited to manually edit the results, because human judgment is an improvement over automatic editing.

             The Polish Project is representative of the historical Commonwealth of Poland, a geographic area much larger than modern Poland.  PCI is weighted toward men with ancestry self-described as from Poland.  One of these days I might add here a discussion topic about this complex topic.  Check the web if you wish discussion now.

             My PCI index can be used for haplogroups, but technically I do not do this, because the haplogroup assignments are not up to date at Ysearch.  I like Ysearch because of the huge amount of STR data at 67 markers.  Many of my STR based types are “almost equivalent” to SNP based haplogroups.  For example, P type is almost equivalent to L260.  “Almost equivalent” means a few haplogroup samples are STR outliers and a few samples from other haplogroups marginally fit the type just below the cutoff.  In my tables this technical distinction between P type vs L260, and other equivalents, may not be obvious to you.


Instructions for Use of Ysearch


             Update rewrite 21 Sep 2015.

             Link to the site:  http://www.ysearch.org.  Brief description of Ysearch.

             Click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services.  Or, you can type in your data.  You end up with a “User ID”.

             Click on the Search for Genetic Matches tab to search for Ysearch members closest to you in STR values.

             Ysearch has a Research Tools tab to click, where you can type in other User ID’s for detailed comparison to your data.

             Ysearch does not keep up very well with new SNPs, so this is not the best place to find out your location in the Y-DNA tree.  It is better to join an FTDNA project, where the administrators help to figure out your terminal branch in the tree.  However, Ysearch has a lot more data, so you might find closer matches here.


             My Definitions.  I have entered a number of definitions into Ysearch, for the types of interest to me.  These are modal haplotypes;  they do not correspond to any real person.  These definitions use only some of the STR markers, so it may be misleading if you seem to match one of them closely.  Issues:  These are based on selected markers from the 67 marker set, so you need to have the full 67 marker data.  If you are a perfect match (Ysearch reports Genetic Distance = zero) then it is highly likely you belong to that type.  If your match is Genetic Distance 1 or 2, it is less likely but still a good bet you belong to that type.  If your match is 3 to 10, it depends on the type;  some types are more restrictive than others;  you need to read the documentation about that type, so search for it in this web page.

             Examples:  P type is 8U92G at Ysearch;  I type is EKVHX,  For a full list see Haplotypes.xls.


L260 and M458 Signatures


             For background explanation of L260 and M458, see the topic L260 and M458.

             This topic was written 17 Jan 2011.

             DYS385a.  The single STR marker called 385a is by far the best signature for predicting P type vs N type.  All 34 samples with L260+ result so far have the value 10.  All 25 samples with L260- result so far have the value 11.

             Mayka also independently noticed this and mentioned it to me.

             Usually, a signature with more STR markers predicts better.  In this case, discriminating P (L260+) from N (L260-), 385a=10 predicts best by itself.  No signature with 2 or more markers discriminates better.  In fact, just 385a=10 works as well as the 46 marker P type definition.

             This seems amazing, but is not entirely unexpected.  STR markers have lower mutation rates at lower values, and step down mutations are less frequent than step up.  Since N type has mostly 385a=11, step down to 10 should be less often than step up to 12.

             The mutation rate of 385a=10 in P type (L260+) seems very low.  At another of my web pages I postulate a rare SNP in the middle of a long STR chain to explain a low mutation rate, but such a postulate does not seem necessary in this 385a case because of the short STR chain value.  For the lower rate at lower STR values, I provide a reference to Whittaker (2003) in my publication.

             We can predict that future M458+ samples will be L260+ if 385a=10 and L260- otherwise.  The probability is 100%.  Exceptions are zero out of 59 L260 results so far.  I figure the confidence of this prediction at 94%:  Poisson 94% confidence interval for zero is the interval zero to 3.5;  (1-3.5/59) = 94%.  In other words, I am 94% confident that 3 or fewer samples out of the next 59 L260 measurements in the Polish Project will be exceptions to this new rule - that 385a=10 means L260+.  Exceptions will be found eventually, of course, due to rare independent mutations from 11 to 10.

             In the Polish Project, all 96 samples assigned to P type and all 15 samples assigned to P Borderline have the value 10 for 385a.  There are 89 samples assigned to N and N Borderline.  Only 7 of these have the value 12 for 385a;  the other 77 have the value 11.  In this case, predicting P type based on 385a=10, zero exceptions out of 100 samples, provides 97.8% confidence.

             I postulate that 385a has only a slightly higher mutation rate in the N branch, at value 11.  I postulate that those 7 N branch samples with 385a=12 belong to 2 or 3 subtypes in the N branch, 2 or 3 independent instances of a mutation from 385a=11 to 12.  Most of these belong to a hypothetical Ncm type.  The data is not sufficient yet to provide statistical evidence along these lines.

             385a does not work quite that well for discriminating P type from all of R1a.  Among the 91 M458- samples not tested for L260 there are 2 with 385a=9 and 4 with 385a=10.  None of those are expected to be L260+ because L260 is a subhaplotype of M458.  The 385a marker is still the best single marker for extracting P type from a full R1a database, including M458- samples from outside the M458 (P+N) haplogroup.  However, in this case, using 2 or 3 markers works better, and of course the definitions (46 markers for P, 45 markers for N) work much better than any short signature.

             A few samples with 385=(10,10) represent a hypothetical subtype within P.  I call this Pk.  I’ll discuss it more if and when there are enough samples for statistical significance.

             Other signatures.  Table 3 of my publication provides other signature markers.  DYS572=12 continues to be 2nd best for P type.  DYS 537 continues to be best for N type.

             My R1a page has a handy 3 marker signature table.  I announced this more than a year ago, as a handy prediction signature for the dominant types in R1a, using only the first 25 markers most common on the internet.  It still works well.  That signature uses (385a, 439, 447).  The values for P type (L260+) are (10,10,23).  The values for M type (M458+ L260-) are (11,11,23).  The values for K type (M458-) are (11,10,24).


37 Marker Network


             Lawrence Mayka (independently, March 2007) constructed a “median joining network” Network for the 37 marker samples of the Polish Project.  This network supports the definitions of the P & N clusters, and of the A subcluster.  The P cluster is the left side of Mayka’s network;  N is the top branch, and A is a small branch on the lower right.


Russian Sites


             29 March 2010 correspondence:  I mentioned Russian sites for R1a clusters in my publication.  It’s not easy for me to figure out which of those clusters correspond to my types.  Mayka worked out a correspondence on 29 March, warning me that the correspondence is not exact.  Some of the Russian clusters are broader than my types;  some are narrower.  Here are Mayka’s findings:

My Type code vs Russian cluster name:

A          Ashkenazi Jewish

B          Western Eurasian

C          Old European

D          Baltic - Carpathian

E           Northern Eurasian

F           Central Eurasian

G          Northern European

H          Western Carpathian

I            Northern Carpathian

N          Central European

P           Western Slavic

             19 Sep 2010 update:  A nice tree display of the Russian subdivision of R1a is at www.r1a.org.  Robert Sliwinski brought this site to my attention.

             My opinion:  R1a cannot be highly subdivided with confidence based on STR data.  This web site of mine is dedicated to estimating the confidence of each type that I study.  I try to indicate which types are speculative.  Even for the types with high confidence, the location of the nodes in the R1a tree will be uncertain until corresponding SNPs are discovered.  These Russian clusters, apparently by Klyosov, have plus / minus values for accuracy of TMRCA ages that are far to small, because there are serious caveats associated with systematic statistical uncertainties.


My Mountain Method

             Edit 13 Oct 2015.

             Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method.  For more explanation, see my publication.  By haplospace I mean multidimensional sets of STR values;  each haplotype is a point in haplospace.

             This topic is about STR analysis, but restricted to Y-DNA genetic genealogy.  For a more general introduction consider wiki STR Analysis.

             Men submit their Y-DNA data to various web sites.  There are lots of STR data available on the web.  A cluster is a set of samples with similar STR values.  Men are divided into STR clusters as hypothetical subdivisions of the haplogroups, based on similarities of STR values.  All such clusters are hypothetical.  Some will be validated by new SNP discoveries.  There are various statistical methods for estimating the confidence of STR clusters.  I published a method that I developed.  That publication has references to other methods.

             A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain.  The step is the genetic distance (mutation count) from the modal haplotype of the cluster.  I use the method of Ysearch to calculate step.  The cutoff is the next step just beyond the mountain.  A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1).  In other words, the cluster forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the parent haplogroup usually) at higher step numbers.

             The Statistical Background Percent (SBP) is an objective measure of the quality of the type.  Low SBP is taken as evidence that a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP (yet to be discovered).  Larger types with lower gaps have lower SBP.  SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade.  SBP is increased to account for the estimated probability of outliers from other clades.  An outlier is a sample that has very unusual STR values due to the luck of mutations.  SBP is also increased to account for the estimated probability of small foreign clades that just happen to have the same STR values but are not closely related to the type.  The SBP is also increased to provide the rough equivalent of the maximum in a confidence interval.  Small sample counts have wide confidence intervals.  So larger types (more samples) automatically get lower SBP.  For a valid clade, SBP should decrease with time as data accumulates in a database.  A very well isolated clade will have a low SBP even with only a few samples.  SBP < 5% is very rare - a very well isolated type, very likely to be a clade.  SBP < 25% is good enough to be announced on the web.  SBP > 25% is a cluster worth watching as data accumulates with time, although I avoid using the word type for SBP > 25%.  SBP > 50% is not statistically meaningful although such clusters might improve as data accumulates.  The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains.  The number of markers in the definition should be chosen to provide as small an SBP as possible;  my Excel tools provide automatic rank of markers as an aide;  human judgment can be used to include or exclude markers with obvious problems.  A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.

             I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff.  The definition of a type is the modal haplotype plus cutoff.  The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible if there is a tie.  The definition of a valid type may change slightly as data accumulates.

             I use the word clade in general, meaning a Y-DNA clade that may or may not be a defined official haplogroup.  All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP.  The modal value for a marker is the most common value in the cluster.  The modal haplotype is the set of most common values, usually the most common haplotype in a cluster.  Many people use the adjective “modal” as a noun, meaning “modal haplotype”;  so do I;  I tried to avoid that in this web document.

             Most of my types have been validated in the past few years by newly discovered SNPs that seem equivalent.  I say “seem equivalent” instead of “equal” because of two distinctions between types and haplogroups:

             Outliers:  First, types are defined by STR correlations, and outliers are expected because of the random luck of STR mutations.  Even if a type is very well isolated, with all type samples coming out positive for a newly discovered SNP, and with no samples positive for that SNP fitting the type, eventually outliers are bound to show up as new samples provide more STR and SNP data.  We expect to eventually discover a few outliers that do not fit the STR type but are positive for that SNP, and we expect to eventually discover outliers that are negative for that SNP but fit the STR type.

             Better Equivalents:  Second, as new branches are discovered, it is possible a better equivalent SNP might be discovered, slightly younger than the original equivalent, leaving out one or more of the oldest branches of that original equivalent haplogroup, where those oldest branches might have mostly original outliers that do not fit the type.  Conversely, a better older equivalent SNP might be discovered, including a few branches with nodes older than that original node, where those older branches might have most of the original outliers that fit the type.

             The rest of this topic provides discussions and more definitions that not part of my Mountain Method.  These are discussions and terms that I use often, so I provide them here for easy link reference from my web pages.  Some of these terms are not common in genetic genealogy.  Some of these I do not recall seeing used in genetic genealogy documents at all, so they might be my inventions, although I suppose other writers may have used these terms with similar meaning.  Some of these are common but I do not use them often.

             A bimodal marker has a second STR value with many samples - more than expected statistically - in addition to the most common modal value.  A multimodal marker is possible if there are more than two common values for the marker and if those common values are not distributed more or less symmetrically on both sides of the most common value.  (A Bessel distribution is statistically expected for a low fraction of random independent mutations at an STR marker.  A Bessel distribution is close to a Gaussian distribution for a high fraction of independent mutations.  A Bessel for a low fraction looks like a tent;  a Gaussian looks like a bell.)  Step up mutations are more common than step down for short STRs, so for example a modal 8 plus a few more 9 values than 7’s does not necessarily mean the 9’s are statistically significant;  experience helps to judge.  RecLOH and other issues at compound markers also cause confusion in this regard.  A bimodal marker is a hint that there may be a clade associated with that 2nd value, so genetic genealogists study clusters defined by one or a few STRs with such bimodal 2nd values.  The main modal value also sometimes makes a good signature at a bimodal marker.  In other words, a set of values using one or more bimodal or multimodal markers makes a good signature for a hypothetical cluster.

             In the past, I have sometimes called such clusters hypothetical types.  I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence that more than 80% of the samples belong to a clade that will someday be confirmed as a haplogroup by a newly discovered SNP.  Sometimes we make exceptions above 20%, for example when a cluster is regionally concentrated, or associated with an ethnic group.

             I had sometimes used “bimodal marker” for that second STR value, but I try to avoid that confusion.  It’s the STR marker that is bimodal, with two common values.

             There is no known way to calculate the % confidence that a cluster corresponds to a clade, but an experienced genetic genealogist can roughly estimate confidence based on experience.  I developed SBP so that 100% minus SBP expresses my confidence, but only for clusters with less than 30% SBP;  SBP breaks down around 50%.  I avoid publishing clusters in which I estimate less than 50% confidence, although I may mention some as speculative.

             Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous.  A type corresponds to a clade that experienced a population bottleneck - isolation or migration or very rapid population growth.

             A main branch of the Y-DNA tree is old, with data on the web for thousands of samples belonging, and with many known further branching divisions.  I like to use the word twig for a small young branch of the Y-DNA tree.  A terminal branch is a smallest known division of the tree;  a terminal branch might be a terminal haplogroup, or a subdivision of a terminal haplogroup - a type or a hypothetical cluster.  A terminal branch at one web site might not exist at another web site;  a terminal branch might be very small (one or only a few samples) or very large (many samples).

             Age (often years before present, ybp):  By definition, the TMRCA (Time of the Most Recent Common Ancestor) corresponds to the age of a node, or branching point, in the Y-DNA tree.

             (Sometimes more than two branches split off at one node, but we expect future SNPs might usually resolve that node into multiple nodes, each branching into two branches.)

             Some genetic genealogist use TMRC as the age of the corresponding haplogroup (or type, or cluster, or branch or clade).  I often do;  it’s usually good enough.  But there is a technicality that causes confusion:

             An SNP is probably older than the TMRCA of the haplogroup it defines, because there are usually many generations between old nodes, due to the statistical pruning of the Y-DNA tree (discussed above in the definition of segments of tree branches).  The probability is very low that an old SNP mutation happened in exactly the same generation as the TMRCA.  (An exception would be a recent private SNP found in an extended male line family.)

              Conceptually, we might prefer to consider the age of the SNP as the age of the corresponding haplogroup.  But there are usually multiple phyloequivalent SNPs for a haplogroup, and of course, they differ in age.  Methods to estimate TMRCA age do not provide distinct ages for all those SNPs.

             A third stipulation of the age of a haplogroup might be the age of the previous known node.  Then a haplogroup would include the male descendants of a MRCA plus his male ancestors in the immediately older known segment, but this is opposed to the traditional idea of a Y haplogroup being the male clade of descendants of the man who experienced the mutation for the corresponding SNP.

             The Yfull Tree solves this confusion by using two ages, the TMCA and the older “formed” age, which is the TMRCA of the previous known node.  Yfull estimates age by analysis of the number of SNPs per segment.  Before about 2013 or so, most Y clade age estimates were based on STR distributions, and assumed STR mutation rates.  Any method of age estimation has serious caveats.  Most of my xls on-line STR analysis files have a sheet that estimates age from STRs in various ways, but I’m not including that sheet in current analysis because the Yfull site does an adequate job of estimating age.

             I call the segments between nodes smooth branches, where there are no known nodes in that segment of the Y-DNA tree.  A long smooth branch in the Y-DNA tree is one way to visualize isolation in haplospace.  A type, because it is isolated, probably has a long smooth branch older than the MRCA of the type, with more than the usual number of phyloequivalent SNPs.  A smooth branch is necessarily a statistical estimate, because the number of SNPs is influenced by the luck of statistics.

             In addition, there may be small branches with living men who have not registered Y-DNA data on the web.  So a “smooth” segment really includes the possibility of small twigs along that apparently smooth segment.  The metaphor of a tree is appropriate, because a large branch segment with very few twigs looks smooth from a distance.  A  Y-DNA branch can be smooth in one database (like the Polish Project) and not smooth in a larger databases (like Ysearch, if significant branches in that segment are rare or absent in Poland).  This discussion applies to age of hypothetical STR clusters, but with lower confidence.


Family Sets;  Recruitment Bias;  Statistics on Frequency

             Edited 23 Jan 2015:

             Sometimes one individual recruits male line relatives to submit data to a Y-DNA database, for example to the Polish Project.  I call these family sets.  I count these together as one sample when compiling statistics on frequency.  By statistics on frequency I mean the number of samples per clade.  By clade I mean a haplogroup or type or cluster.  I do this adjustment for family sets because otherwise a small clade might get reported as too large.

             My Results page is an example of statistics on frequency where I adjust for such recruitment bias.

             I do not discourage such recruitment;  it is a great research technique.  I recruited my third cousin.  I don’t mind the effort of adjusting for such recruitment bias.

             My adjustment method:  I sort databases by name, and automatically flag name repetitions.  Then I examine the flagged data to see if the STR data is a very close match, which is a sign of recruitment.  Often I make contact by email when it is not obvious if the samples have been recruited.  Actually, even with email discussion, the actual correction may not be obvious;  for example it may be difficult to say if a particular recruited distant relative may have later joined the project anyway independently, in which case he should be counted.  So I may estimate 2 or more effective “independent” samples for some family sets.

             I also sort data by close STR matches and look for evidence of recruitment.  Recruitment can also be by close STR matches even with different family names.  For example, I recruited a man with a last name different than mine, where his daughter noticed that his 12 marker data (at ancestry.com) matched my 12 marker data (at familytreedna.com).  I paid for his 111 marker data which I submitted to multiple projects.  We match STRs very closely at 111.  I determined that his male line ancestors lived in Poland only 10 miles away from the village where my ancestors lived.  We are obviously distant male line relatives.  I don’t count him in frequency data because I recruited him.

             I also sort by email address, again looking for samples with very close STR matches.  Most samples that have the same email address are not in the same main haplogroup branch, even when the family name matches, which means most recruitment by family name brings in samples from different male lines.  I do not adjust for these, because I assume the recruited samples fall randomly into haplogroups according to frequency in Poland, so such recruited data is OK.  It is difficult to judge what to do when a pair of recruited samples are in different terminal branches that branch from a common larger branch.

             Sequential kit numbers, or nearly sequential, are  additional evidence of recruitment.

             I’m not trying to make perfect adjustments.  I’m mainly trying to catch all the large family sets.  I don’t bother people with emails about sample pairs that may or may not be due to recruitment;  I make my own judgment.  If I miss a few pairs, or if I discount a few pairs that are really independent, that just adds a little noise to the frequency data.


Age of SNPs

             New Topic 29 Mar 2015:

             The Yfull Tree includes ages of SNP nodes calculated from the number of SNPs in segments between nodes, using an average SNP mutation rate.  These serve as reasonable estimates for the ages of the corresponding haplogroups, although there are caveats, next topic.  Yfull calibrates SNP mutation rate to a very old haplogroup, of assumed age.  Yfull does not document the details of their method;  if they mix SNPs found by different methods (as most people do) that would introduce an inconsistency.  These Yfull ages are consistent insofar as they are calculated from SNPs as found by a single consistent method, although there still are caveats, next topic.  Other methods may come up with different numbers of SNPs, and different rates, so the Yfull ages may not be the same as those calculated by others.


Age Caveats

             Rewrite 29 Mar 2015.

             Ages can be calculated using either STRs or SNPs.  In either case, an average mutation rate is used to calculate the age from the observed mutations.

             There are several biases involved, and I mention some of them in this topic about age caveats.  I don’t emphasize ages in my web pages because of the uncertainties, but I occasionally discuss rough age calculations.

             With STRs, people generally use ASD (to account for back mutations) calculated for each marker, then average the markers.  I provide “ASD” sheets in my type files with a simple ASD method for calculating age, but again, I consider this a rough approximation.  I provide an introduction to STR age calculations in my Fall 2009 Publication.

             Some publications use a mutation rate from father-son data.  This method yields too high a rate insofar as somatic mutations (mutations in the cell lines leading to the test - cheek cells, for example) are included, so the calculated ages are too young.

             Chandler published a method for accurately determining relative STR mutation rates, and calibrated the first 37 standard STR markers, to father-son data.  Extension of Chandler’s 37 to more markers are available on-line, but without explanation.

             Mutation rates can be calibrated to an old haplogroup.  There are still remaining caveats, including the uncertainty in the age of that old haplogroup.

             With a fixed mutation rate, older haplogroups come out too young, or younger haplogroups come out too old.  This is due to the structure introduced by the “pruning” of haplogroup branches that go extinct.  To compensate, some people use adjustment factors for older haplogroups;  some people use calculation algorithms that are mathematically equivalent to a mutation rate that decreases with age.

             Almost all DNA damage is repaired by various cellular mechanisms, so the “mutation rate” measures only the damage that is not repaired.  Repair mechanisms vary from person to person due to variations in minor damage to the repair mechanisms.  An ancestral line mutation rate depends on the probability of a few ancestors with much higher than normal mutation rate, so mutation rate varies more between ancestral lines than due to simple random number statistics.

             It is important to calculate the +/- confidence range for data based on very few samples, using standard Poisson statistics for a small number of samples.  Many reports of age calculations include such confidence range.  For large samples, however, this is misleading.  The confidence range comes out small for large samples, with a small +/- on the age.  The various age caveats, however, provide a much larger uncertainty due to bias.  In other words, with a large sample, we have excellent confidence that another large sample, taken from the same population, will provide the same age result within a small confidence range.  However, insofar as any age calculation is more uncertain due to the biases introduced by the various caveats, those small confidence ranges are meaningless.

             I mention only some caveats here as examples.  There are more.  This web page is not the place for a review of all age caveats.  I have never seen an article with a thorough review of the caveats associated with age calculations of ancestral lines based on mutation rates.  I suppose genetic genealogists are rarely statistics experts, and I suppose statistics experts avoid publishing such articles, because the emphasis would be that most genetic genealogy age calculations are not quite right.

             Summary:  There is no known way to figure for all caveats with calculating age of Y-DNA nodes or haplogroups.  Nevertheless, rough age calculations can be interesting, as long as we realize the results are not very accurate.


Population Bottlenecks

             New Topic 30 Mar 2015:

             A “population bottleneck” means a reduction in population.

             Population bottlenecks generally reduce genetic variation, including STR variation.  Population bottlenecks are the primary reason for the “pruning of haplogroup branches” mentioned in the previous topic, leading to uncertainty in the calculation of age.  Population bottlenecks are the primary reason for the large number of equivalent SNPs for old nodes of the Y-DNA tree.

             Most male lines go extinct.  If you are not familiar with this statistical fact, let me provide an analogy:  In the card game “war”, a group of card players mostly loose cards, until one by one they loose all their cards and are eliminated from the game, until there is only one winner with all the cards.  In this analogy, the number of cards for one player represents the number of men in one haplogroup, and each card round represents on generation.  The human population, over long time periods in the past, lost male lines one by one, until there were fewer and fewer remaining Y-DNA haplogroups.  Going back far enough, there is only one MRCA and one haplogroup.

             These statistics go much faster in a smaller population.  A smaller population is not required for the statistics, but if there are population bottlenecks then the haplogroup pruning can be dominated by the bottlenecks.

             Various selection process can speed up pruning.  For example, if a small human tribe replaces a much larger population in a particular geographic region, then the effective population was the population of the original small tribe.

             Experts use the term “effective population” for analysis of any such selection method, with effective bottlenecks.

             We suppose the human population had bottlenecks in the past and we suppose much if not most of the Y-DNA tree pruning happened during bottlenecks.


Probability and Confidence


             Rewrite 21 Jan 2015.  Edited 6 Oct 2015.

             See Polish Project Assignments for a brief overall explanation of how assignments are done.

             The assignment guideline is minimum 80% probability and minimum 80% confidence for each individual sample.

             Probability and confidence can sometimes be calculated.

             Example:  Consider a situation where 10 samples match a type with an STR prediction method.  Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for the SNP, and 1 negative.  That means 9 of the 10 really belong to the haplogroup and that 1 mismatch must come from a different haplogroup that matched the STRs by the luck of mutations.  Next, consider a new sample that matches that same STR prediction method.  What is the confidence that the new sample will pass the SNP test for the haplogroup?  The probability is 90% because we know that 9 out of 10 previous samples like this matched the SNP.  However, 1 mismatch out of 10 is a very small sample.  As explained in my publication, I use Poisson statistics for quick calculation of confidence interval.  Poisson statistics is simple to calculate in Excel.  My tool Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.

             80% confidence interval for 1 sample is 0.11 to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence range of a match comparing to 9 out of 10 is 61.1% to 89%;  that lower number 61.1% means the 80% confidence ranges to lower than 80%, so minimum confidence is lower than 80%, although the probability is 90%.

             In other words, statisticians are not confident of probability calculated from too few samples.

             In this web page, I like to use a single number for confidence - the lower number of the confidence interval - matched to the confidence.  For example, by “>80% confidence” I mean an 80% confident interval from 80% to a higher number.  That higher number of the interval is not important here.

             Confidence can be lowered by biases;  it is generally difficult to calculate corrections for biases.  An example is selection bias:  suppose we analyze 100 candidate clusters looking for STR correlations, and one of the 100 clusters has excellent correlation with a few selected STR markers;  we might be tempted to have 99% confidence in the validity, since this one is the best one of 100.  However,  we will always find a best one when we study 100 candidate clusters.  Even if we generate data for 100 STR clusters on a computer generated database, using a random generation method with no real correlation, some of those 100 will look like valid clusters, and one of those 100 will look best.  In other words, if we do a lot of STR analysis that is sure to turn up some false STR clusters.

             My publication has more detailed discussion of my statistical method for types.  Briefly, I use SBP as a measure of confidence for predicting types based on STRs, including an adjustment for estimated biases.

             It gets more complicated:  Let’s go back to that 9 out of 10 SNP matching example above.  When matching samples to a type, each sample has a step number (number of STR mutations from the modal, also called genetic distance).  Now consider that new sample that has not yet been SNP tested:  If that new sample has a low step, lower than most of those 9 validated samples, the probability and confidence can be estimated as higher than 90%.  Conversely, a new sample at relatively high step can be estimated to have probability and confidence much lower than 80%.  To calculate probability and confidence vs step we need several samples in that type with the same step that have already been SNP tested;  but such data is almost never available.

             One more simple example:  If a new sample without SNP data matches STRs very closely to a previous sample with SNP tests, there is high probability and confidence the new sample will test positive for the same SNPs.  In fact, if the family name is the same the two samples may have come from relatives.  This simple observation is difficult to quantify, but there is a simple qualitative test:  With an STR database, it is easy to compare one new sample to the entire database, and sort for closest matches.  If 3 or 4 of the closest matches test positive for a particular SNP, and none of the other closest matches test negative for that SNP, we conclude that new sample can be predicted into the haplogroup for that SNP with high confidence.

             The gist of all this discussion:  We STR predictors occasionally calculate probabilities and confidence, but usually we estimate based on experience.  The verification comes with time:  as some of our predictions later get shown to be wrong we tend to get tighter with future estimates;  if almost all our predictions get verified we may loosen up a bit.

             A person who assigns samples to haplogroups based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, experience, educated guesses, and intuition.  A bookie’s estimates are usually tested by reality very quickly.  Probabilities of an STR estimator used to require years for verification or refutation by a newly discovered SNP;  the time now is months.

             You need to be skeptical of STR based predictions.  In the past, a number of STR based assignments have been shown wrong by new SNP discoveries.  This long web document is provided so you can read as much as you wish about our methods in the Polish Project, judging for yourself the reliability of our assignments.


Validation Comments


             Update 21 Jan 2015:

             I look forward to the discovery of SNPs validating more than 80%, probably more than 90%, of my type predictions.

             I introduced P, N, and K in the Fall of 2007, publishing this web page 6 Dec of that year.  I did not predict that P and N were brother clades, in fact it looked to me like P was closer to K.  I did not make predictions about the P, N, K structure because the statistics did not justify such predictions.  I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008.  I stated my overall confidence in the subtypes of K type as only 80%, and without high confidence that the various subtypes of K actually belonged to a single unique K clade.

             P type has been validated by the SNP L260.

             N type has been validated by the SNP CTS11962.

             K, which never qualified as a type, represents the R1a modal haplotype.  Today it is clear that K is not a clade.  Many of my predicted subtypes of K have been verified by SNPs, and so far none of them have been shown to be invalid.

             In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for those samples that do not fit P, N, or K.  R was never intended as a clade.  The R category is no longer used because there are many branches of R1a known today, so that each sample with sufficient STRs can be confidently predicted into a branch.

             This topic uses R1a as an example, but the same discussion applies to other haplogroup assignments.


References and Sources


             My publications have several references of general interest and relevance to my web documents.

             My Tools and data for STR analysis are Excel files.  These are available at the JoGG publication site as Supplementary Data:  www.jogg.info/52/files/cpcindex.htm.

             Polish Clades Update.  This folder is for update of Tools and for new data:  www.gwozdz.org/PolishCladesUpdate

             Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish).  This reference is listed in my publications.  I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type.  Link to English abstract:  Pawlowski 2002.

             Lawrence Mayka is the Administrator of the Polish Project.  Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions.  He reviews and makes suggestions regarding this Polish Project web page of mine.  He also reviewed the original drafts of my publications.  A number of my types were originally suggested to me as STR clusters by Larry.  Larry continues to provide data for this web page.  Many of my references to other websites in this document were suggested to me by Larry.

             Paul Stone is an administrator of the Polish Project, with emphasis on the I1 haplogroup.

             Cyndi Rutledge is the administrator of the R1aY-Haplogroup Project.  Larry and Cyndi had been sending me M458 test results when that SNP was new.  SNP results are now listed at project web pages.

             Lukasz Lapinski is an administrator of the R1a Project. 

             Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications.  Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his  publication in JoGG - some in Russian).  It was encouraging to me seeing independent identification of clusters by different methods.  He emailed to me an English version of one of his 2008 publications.  His Fall JoGG articles have references to his other publications.  Here is a web link:  Klyosov Home.

             Russian web sites:  Semargl,  http://www.r1a.org;  http://www.rodstvo.ru;  http://dnatree.ru/;  http://molgen.org/.  These have been active analyzing R1a, brought to my attention by others, particularly by Mayka, who worked out a correlation with my types.  These sites clearly have proposed subdivisions of R1a based on STR data, but I cannot quickly understand these due to the language barrier.  Klyosov seems to be active at these sites.  The sites make use of the FTDNA projects and Ysearch.

             Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG.  His excel files of data and tools are available at his web site.  Ken has been active in web discussions, suggesting many STR based clusters.

             FTDNA link:  www.familytreedna.com.  This is a commercial DNA testing company.  I make extensive use of the project databases maintained by FTDNA.  These are my primary sources of data.  Click on the “Projects” tab at the home page to look for projects.  Also, the project name can be substituted for /polish/ in the Polish Project link, below.  I do not work for FTDNA;  many other companies offer DNA tests;  I recommend FTDNA because I like the convenience of most DDNA data being available at the projects, particularly the Polish Project.

             WTY.  “Walk Through the Y”.  This is an obsolete commercial product by FTDNA, for reading more than 200,000 base pairs of your Y chromosome, in a search for new SNPs in your branch of the Y-DNA tree.  You can read about my WTY at another of my web pages.  WTY has been replaced by Big Y.

             Big Y:  Replacement for WTY.  Discussion of Big Y.

             Polish Project link:  https://www.familytreedna.com/groups/polish/about.  One of many FTDNA projects.  This is my primary source for Polish data.  The Polish Project tracks both Y-DNA and mtDNA.  The Y-DNA STR data that I use is at https://www.familytreedna.com/public/polish?iframe=yresults.  The Y-DNA SNP data is at https://www.familytreedna.com/public/polish?iframe=ysnp.

             R1a Project link:  https://www.familytreedna.com/groups/r-1a/about.  Newer R1a project, with multiple co-administrators, active in subdividing R1a data into hypothetical haplogroups.  The project home page has a summary chart of R1a SNP subdivision, and other reference links.

             R1aY-Haplogroup Project link:  www.familytreedna.com/public/R1aY-Haplogroup.  Original R1a project.  Has not been updated for a few years.

             Ysearch link:  www.ysearch.org.  Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.  I have Instructions for use of Ysearch.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  I am not associated with the company FTDNA.

             Yfull link:  http://www.yfull.com/.  Yfull SNP tree:  http://www.yfull.com/tree/

             Yhrd link:  www.yhrd.org.  A forensic Y-DNA data base.  Data is separate by city, with many Polish cities.  I relied on Yhrd to figure out the geography of the various haplotypes.

             Semargl.  R1a site by Vladimir Tangankin.  R1a tree in pie chart format using 111 marker data, Oct 2021:  http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009tmb700.png

Blowup:  http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009.png

             Sorenson link:  http://www.smgf.org/.  Another DNA testing company.

             ISOGG link:  http://isogg.org/tree/  Y-DNA tree SNPs and corresponding alphanumeric codes for the haplogroups.

             FTDNA Draft Tree link:  http://ytree.ftdna.com/index.php?name=Draft another Y-DNA tree with SNPs, but not updated in more than a year.

             recLOH:  A technical detail discussed in many publications, for example http://en.wikipedia.org/wiki/RecLOH.  I discuss this and other compound marker issues, and how step is calculated, in the “Documentation” sheet for my Calculator.xls tool.

             DYS389:  Another technical detail, also discussed on the web and in my Calculator.xls.  Briefly, 389II is the sum of 389I plus another STR, so 389II should be figured in terms of the delta value.

             Chandler mutation rates.  Mentioned in my publication.  From Chandler, Fall 2006 www.jogg.info, 37 markers.  67 marker extension on line at mutation rates.


Peter Gwozdz;  My Interest

Peter Gwozdz


             I’m a very rare type in Poland - E-L540.  My maternal 1st cousins are R1a1a.  That means my late maternal grandfather was R1a1a.  I became interested in Y-DNA in 2004.  My maternal family name is Iwanowicz.  I discovered a family with that name in my maternal grandfather’s home town in Poland.  They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins.  I brought a cheek swab kit when I visited them the second time in 2006.  Sure enough, the male son was a perfect 25 STR marker match to my 1st cousin.  I didn’t get around to checking the web for a year.  I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers.  That’s a hell of a lot of matches in the summer of 2007.  Most of these matches are Polish.  I did some research and found an article by Pawlowski (reference in my publication) about this most common Polish haplotype, which I now call P type.  That got me interested in doing more research, leading to this web page for others to see my results.  My experience, however, is a reminder that statistics can be misleading.  I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers.  In June 2010 I realized that the probability was really about 93%, because 13 out of the 14 then current Polish Project members who had 67 markers and who also matched P type perfectly at 12 markers were in fact P type as judged by all 67 markers.  My grandfather does not match P type at 67 markers.  My grandfather is that 14th one.  He matches the small clade that I named I type, after Iwanowicz.  I type has since been verified as haplogroup S18681, which is also concentrated in Poland.  That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about common Polish Y-DNA clades.  This web page was originally called “R1a”;  it got so many hits from Poland that I eventually renamed it to include all common Y-DNA clades.


Revision History

 2007 Dec 6  First web posting of this file

2007 Dec  Two revisions

2008  8 revisions

2009 33 revisions

2010 36 revisions

2011 26 revisions

2012 18 revisions

2013 3 revisions

2014 7 revisions

2015 Jan to Mar 8 revisions

2015 Aug 29 Rewrite of Abstract and I-P type.  Also new I1 and I-L22 topics.  Also minor update edits.

2015 Sep 21 Update rewrite of Introduction

2015 Sep 22 Deleted topic “Polish  Project R1a Assignment News”, an old 2010 item

2015 Sep 22 Removed Ysearch instructions for matching to my types;  links also removed;  no longer recommended

2015 Sep 22 Update rewrite of R1a Abstract

2015 Sep 23 Delete some obsolete topics, including “R1a Confusion” and “L260 and M458 Test Results”

2015 Sep 24 Rename I-P to I1-P.  New topic I1-Z63

2015 Oct 1 More edit of “Introduction” and “My Mountain Method” topics

2015 Oct 6 More edit of “My Mountain Method”, complete rewrite of “Polish Project Assignments”, new topic “Big Y”, edit of “Probability and Confidence”

2015 Oct 7 More edit of “Introduction, “Polish Project Assignments”, “Big Y”, and “Mountain Method”

2015 Oct 8 Proof read;  a few more edits

2015 Oct 12 Rewrite of “M458, L260, CTS11962”

2015 Oct 13 Rewrite of P type

2015 Oct 14 Rewrite of P type.  Rewrite of M458, L260, CTS11962

2015 Nov 13 Update of Stanaszek links, facebook & map