Y-DNA R1a Clades

22-Jul-10

Peter Gwozdz

pete2g2@comcast.net

 

            I have a similar web page for details about Y-DNA clades common in Poland:

                        http://www.gwozdz.org/PolishClades.html.

            My methods and results have been recently published.

            This web document was intended to be an analysis of the entire R1a haplogroup worldwide, with hypothetical division into types.  Sorry, I have not been finding the time to get this done.  I’ll be taking this page down in a few months when it becomes out of date.  Some of the information here may still be of interest, so I’m leaving this here for now.

            The identification of some R1a signatures is still is valid, based on data from December 2009.

            My primary interest is Y-DNA clades that are common in Poland, so I continue to expand and update my PolishClades document, including lots of R1a information.  Much of the information in this R1a document has already been updated in that PolishClades document.

            If you are interested in R1a discussion about samples from places other than Eastern Europe, I’m sorry, but I’m not covering your interest.

 

          Comment added 14 May 2011:  This web document about R1a is almost a year out of date.  Lots more data is available now.  I’m leaving this here because many of the detailed results are still valid, even if not up to date.  Also, my other web pages use links to some of the background information here.  Over the next year I’ll move that background information to other web pages.  I plan to remove this page sometime during 2011.

 

L260 and M458 News

 

            See the L260 and M458 News topic at the top of my Polish Clade page.  The Polish Project R1a Assignment News has a discussion that applies to R1a men from anywhere.

 

R1a Table

 

            This topic was written in December 2009 with only minor edits since.

 

 

 

Size

 

Group

Code

Type

% of R1a (Ysearch)

Comments

R1a1a*

 

K

26%

Most Common

R1a1a*

 

L

23%

Common in northwest Europe

R1a1a*

KB

 

25%

K Borderline

R1a

R

 

15%

Remainder

R1a1a7

 

N

10%

Common in Slavic Countries

R1a1a7

 

P

6%

Concentrated in Poland

 

U

 

 

Unassigned

 

            This Table is an introductory overview.  A longer list is below.

            Those Types are my own code letters, for brevity.  Please do not confuse these code letters with official haplogroups.  I have been using such code letters for R1a assignments in the Polish Project for 2 years.  The color coding is for ease of comparison on my web pages.

            KB is not a statistically defined type.  The modal haplotype for K type is essentially the modal haplotype for R1a as a whole, so many R1a samples come out close to K type.  I call these “K Borderline”, KB, which is the main trunk of the R1a tree.

            The R samples (Remainder) are distant in STR values from all the types I have identified.  The split between KB and R is my arbitrary choice.  Neither KB nor R have statistical validation as “types”.  So far none of these have M458+, so most of KB and R are surely R1a1a*, although R includes other rare groups.

            There is an unknown risk that a small R1a1a7 clade is hiding in the KB data, so continued testing of KB samples for the M458 mutation will have value.  A small foreign clade can theoretically be hiding within any STR defined type, no matter how statistically unlikely, so continued M458 testing improves confidence in all types.

            The percent sizes in this table are based on Ysearch samples with all 67 standard FTDNA markers.  Many samples with fewer markers can be assigned to types with statistical confidence.  I do that for the Polish Project.  I use U (Unassigned) for samples with fewer than 67 markers that cannot be assigned with 80% confidence.

            L, N, and P types are well defined statistically.  Statistical confidence intervals can be calculated for the size of these types, but those intervals are meaningless because Ysearch has much larger population bias.  No doubt L type is over represented in Ysearch and N type is under represented.  I’m pretty sure P type is over represented.

            In the next section I provide a method to match individual samples to my types.  A sample that matches well has high confidence of belonging to the type.  A sample near the cutoff of the definition has relatively lower confidence.

            The size of K type is statistically uncertain on Ysearch because I discovered it in Polish data.  I am confident of the validity of K type, but it does not have a clean cutoff on Ysearch, blending with KB samples due to non-Polish unidentified types.  I explain this later in this document and with more discussion in my publication.

            STR values are not available for the rare R1a groups.  I presume a very low percent of the samples that I classify as R based on STR values belong to those rare groups.  Aside from such rare exceptions, R1a1a* means R1a minus R1a1a7.  There may be a significant exceptions in remote regions;  in fact Underhill reports one such exception;  see population bias.

            No doubt there are more types to be discovered in KB and R.  I already found some, mentioned below, but left out of that first table to keep it simple.  See the next topic for a more detailed list of types.  There is no reason to expect that KB and R will be fully resolved into types using the standard 67 STR markers.  R1a1a7 does seem to be fully resolved by 67 STR markers except for rare ambiguous samples.

 

List of R1a Types

 

            This topic was written in December 2009 with only minor edits since.

Group

Type

Subtype

Ysearch

Markers

Cutoff

SBP

Size

Link

 

 

 

 

Out of 67

Less Than

<25% is good

% of R1a (Ysearch)

 

R1a1a*

K

 

MN8R3

34

4

71%

26%

K

 

 

A

FCUFG

67

15

4.3%

5.1%

A

 

 

I

EKVHX

59

8

25%

3.0%

I

 

 

B

 

65

11

50%

2.7%

B

 

L

 

24MB4

20

11

19%

23%

L

 

 

M

24MB4

20

4

14%

4.9%

M

 

C

 

 

26

7

20%

0.8%

C

R1a1a7

P

 

8U92G

36

5

9%

6.5%

P

 

 

Pc

RQK32

47

2

61%

1.1%

Pc

 

 

Pg

92HEK

21

2

68%

0.8%

Pg

 

N

 

3SEJK

45

7

28%

10%

N

 

            Those Types and Subtypes are my own code letters, for brevity.  Please do not confuse these code letters with official haplogroups.  I have been using such code letters for R1a assignments in the Polish Project for 2 years.  The color coding is for ease of comparison on my web pages.

            Those Links in the far right column jump to discussion of each type in my other web page, which emphasizes Poland.  It is not practical for me to rewrite each discussion here.  Although written for the Polish Project, those discussions are written in a manner that is valid world wide.

            The Ysearch links provide the full modal haplotypes, using the indicated number of STR markers, out of the standard FTDNA set of 67 markers.  I entered this data into Ysearch for our convenience.  All my modal haplotype definitions are available in the Excel file Haplotypes.xls, which also has experimental types not mentioned here, and very rare types like C type.  My Polish Project web page has Ysearch instructions for quickly comparing your haplotype to all my types at once.

            Statistical Background Percent (SBP) is my measure of quality for types.  Lower SBP is more likely to represent a valid clade - a predicted haplogroup - to be someday in the future verified by a new SNP discovery.  My published SBP values are better for P, Pc, and Pg because these are concentrated in Poland, so the SBP comes out better when calculated using only Polish data.  The data in this table are from Ysearch.

            K type provides a 26% SBP when restricted to Polish data.  On Ysearch, when I flag the Polish samples, it is obvious there are many at step 3 but few Polish samples at the cutoff step 4 just beyond the mountain of K type.  The number of samples at the cutoff step 4 has a big effect on causing large SBP.  On Ysearch step 4 is dominated by non-Polish samples.  That indicates there is at least one undiscovered clade, rare in Poland, that lands at step 4 in K type causing interference in the Ysearch data.

            L type is a recent finding, not yet optimized.  The optimized M modal haplotype produces L, but with optimization the L modal haplotype should come out slightly different.  Watch this web site for update.  K and L seem to be the largest types in R1a1a*, corresponding to the signatures that I found in the Underhill data.  Actually, what I call K, L, and M are the most common haplotypes in R1a using only the standard FTDNA 12 markers, so I have been watching the 67 marker data accumulate for these three for two years now.  The 12 marker data is not convincing but with all 67 markers K and L seem to be major divisions, with M a large subtype of L.

            L and M have been called the Viking haplotype.

            Size is given as a percent of samples that are indicated as R1a in Ysearch.  This analysis is from my 17 July 2009 download.  See Ysearch Download below.  You can quickly check my results by using those Ysearch ID codes in the table;  recent results might be slightly different due to new data.  Those percent sizes are very uncertain due to confidence intervals associated with small samples;  that download has 370 samples, so for example “I type” at 3% is only 11 samples, highly uncertain.  In addition, size is uncertain due to population bias.

            Size of R1a1a7:  P plus N are 16.5% of R1a.  This represents R1a1a7, because M458+ samples are cleanly falling into P and N types as M458 results come in.  The 70% confidence interval is 14% to 19%, but note this is only the statistical sampling confidence for data on Ysearch.  Such a size estimate may be wrong due population bias or due to hidden clades.

            Adding the types (left side of the right column), the total is 66%.  That means 34% of samples do not fit one of my types.  These should all be R1a1a* because only P type and N type are coming out R1a1a7.  Again, there might be exceptions due population bias or due to hidden clades.  See M458 Prediction for more discussion about this.

 

Fall Issue of JoGG

 

            The Fall issue of the Journal of Genetic Genealogy came out on 21 Nov.  My publication is split into two parts there:

            Part I is my “Mountains in Haplospace” method for evidence that certain “types” of STR clusters correspond to hypothetical haplogroups.

            Part II is the application of that method to Common Polish Clades.

            Supplementary Data is a folder index at the JoGG site with more details, including Excel based tools for analysis using my method.

            PolishCladesUpdate is my folder for future updates to those two articles, including Haplotypes.xls, my Excel file with my current modal haplotypes including experimental ones not worthy of mention here.

            PolishClades is my web page that serves as an introduction and summary for Polish results.

            Polish Assignments is my Excel file with detailed explanation of my assignment of Polish Project members to R1a types.

            This web page you are now reading will continue as an introduction, summary, and update for R1a worldwide.

 

Underhill

 

            A new article was published online, 4 Nov, essentially dividing R1a into two large groups, based on a new SNP, M458.

                        Abstract                       STR Data

I call this article “Underhill” here for short, because his is the lead name in the list of 34 authors for this major work.

            Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).  See R1a Subdivision for a brief summary of other groups, and for a clarification of what I mean by R1a1a*.

            Jump to UnderhillAnalysis for some analysis of data from this new article.

            Recent test results for M458 are consistent with (albeit not full proof of) my previous R1a subdivision into “types”.

 

Age of R1a1a7, with P Type and N Type

 

            Below in this document, I use P Type, N  Type, and the combined R1a1a7 as an example, to discuss the way age (TMRCA) is calculated along with the uncertainty due to caveats, including population bottlenecks.

            In my publication, I provide a longer discussion of the age of P type.

            P type is quite young, and large.  My best age estimate is 1,600 years although considering those caveats it may be 1,000 to 3,000 years old with significant population expansion less than 1,600 years ago.  Although considering a wide range of age, I judge the extremes to be unlikely.
            N type is older, 2,500 years, and again considering caveats perhaps 2,000 to 6,000 years including unlikely but possible extremes.

            The combined R1a1a7 comes out 3,200 years but in this case caveats are more likely, so I would consider the range 2,500 to 11,000 years.  Underhill puts R1a1a7 at 10,700 years old.  Actually, I set my maximum just beyond that as a courtesy.  I consider the high end of that range very unlikely, and I explain my reasoning at the end of the age caveats topic.  Briefly, I use the same method as the Underhill authors but they use the maximum adjustment for caveats, which I don’t think applies in this case.  Underhill does not mention the caveats, but I parse the Underhill wording below to demonstrate that they know about this.  One of the Underhill authors (Zhivotovsky) published an article where I first read about these caveats.

            Along with the discussions topics below is speculation about how P type may have come to be, again as an example.  Here is a link to that speculation paragraph.

 

Abstract

 

            This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to first read my Introduction.  That link goes to my companion web document, which has definitions of terms in boldface.  This Abstract uses links to those definitions at the first instance of a term.

            This web document explains my proposed subdivision of the R1a haplogroup, based on statistically correlated STR markers.  My statistical methods and results are published in the Fall Issue of JoGG.  The companion document has a summary of my method.  This document provides more detail and update.

            I use the word type to mean an STR cluster with statistical validity as established by my Mountain Method.  I expect my types to be validated some day by discovery of new SNPs that will qualify them as haplogroups.

            A new division of R1a became available in November, R1a1a* vs R1a1a7.

            I divide the haplogroup R1a1a7 into what I call P type and N type.  Within the paragroup R1a1a* I have identified K type and L type, but I have not yet divided much of R1a1a* into types.  K type is further divided into A type, B type, and I type.  L type has a subtype M type.  This web page also mentions a few rare types.  This web page also mentions candidate types and subtypes for which the statistical evidence is not strong.

            By paragroup R1a1a* I mean haplogroup R1a1a without haplogroup R1a1a7.  Almost all of R1a breaks out into these two major groups.  There are other rare groups.

            I introduced P, N and K types through the Polish Project in Nov 2007.  The recent Underhill article announced M458 as the defining SNP for the new haplogroup R1a1a7.  Test results are now coming in for M458.  The Polish Project samples I had assigned to P and N types are coming out in R1a1a7, and all others are coming out in R1a1a*.  This is consistent with (albeit not full proof of) my prior assignments.  For samples without the M458 test, the STR based types on this web page serve as M458 prediction.

            So far, R1a1a7 (M458+) samples cleanly fall into P type and N type.  P type is concentrated in Poland and N type is concentrated in Slavic lands.  Of course future results may turn up rare R1a1a7 data that matches neither P nor N, particularly if there are R1a1a7 clades localized to regions from which M458 has not been tested yet.  N type seems older and larger than P.  As data accumulates, both may subdivide into statistically significant subtypes.

            The TMRCA age of R1a1a7 is probably about 3,200 years but because of caveats in the way age is calculated it is reasonable to consider a range from 2,500 to 11,000 years.  I have a discussion regarding the older age estimates by Underhill.  P type seems to have formed in a recent population expansion less than 1,600 years ago.  N type is roughly 2,500 years old.

            R1a1a* (M458-) is not completely separating into high confidence types, but that is not surprising.  For Y-DNA to fall into types (unique STR correlations with statistical significance) there must have been a past population bottleneck, or migration, or isolation, or very rapid population growth of a founding male line ancestry.  If R1a1a* grew rapidly into several significant clades that survive to modern times the STR value ranges should overlap due to the random diffusion of mutations.  There are some obvious types.  One example in R1a1a* is A type, associated with Ashkenazi, the best isolated type I have analyzed in R1a.

 

Now For The Details

 

            Up to here, I have tried to write this web page as news and summary, with links to more discussion below.  I hope anyone having minimal familiarity with genetic genealogy jargon has understood.  From here on it gets more detailed.  I’m sorry about that, but the audience from now on is readers with genetic genealogy experience who want to know how I came to my conclusions.  If you cannot follow the remainder due to jargon, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

            If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.

 

M458Signatures.xls

 

            This and the following two topics were written in December 2009 with only minor edits since.  These three will not be updated, since the issue of signatures only needs preliminary evidence.  Updating is a lot of effort.  There is no particular advantage to better statistics, because more markers are easily available for high confidence M458 prediction.  It’s interesting however to document here the fact that M458 can be predicted by a short signature.

            The file M458Signatures.xls has a listing of sample counts for the best three markers from three sources:  Underhill Analysis, Ysearch Download, and M458 Detailed Results.

            That Excel file has all the haplotypes using DYS (385a,439,447).  They are sorted by 447 first and 385a last.

            By signatures I mean haplotypes using only a few key markers, handy for publication information.  This is Excel file is an example of a surprising value from only 3 markers.  This is a nice demonstration, but please realize SNP markers are the gold standard, and 67 markers do a much better job of classifying Y-DNA into a subdivision list.

            Summary:

43 total haplotypes ranging from 11,10,21 to 11,10,27

Underhill data:  68 M458+ and 222 M458- for 290 total but 18 have 447 blank leaving 278 for comparison

Ysearch data:  370 samples all are R1a with 67 markers available;  these do not have M458 results;  for comparison

            The following summary is for 29 Dec, the data used in the Excel file:

M458 Results:  56 M458+ and 60 M458- for 116 total but 1 had 447 blank (only 12 markers) leaving 115 for comparison

            Comments:  The haplotypes in that Excel file are color coded using the colors of the Signature Table in the following topic.  The first 25 standard FTDNA markers have these 3 signature markers.  Discussion is in the next topic.

 

Signatures

 

            Here are my >80% confidence R1a 3 marker signatures for predicting M458 results:

            These are arranged in order of frequency on Ysearch.

            For a list of all the haplotypes using only these 3 markers, see M458Signatures.xls.

Group

Code

DYS Signature

Comments

 

 

385a

439

447

 

R1a1a*

K

11

10

24

Most Common

R1a1a*

L

11

10

23

Common in northwest Europe

R1a1a7

N

11

11

23

Common in Slavic Countries

R1a1a*

K1N1

11

11

24

Between N and K

R1a1a*

K1

11

10

25

One Step from K

R1a1a7

P

10

10

23

Concentrated in Poland

R1a1a7

P1N1

10

11

23

Between P and N

            Those “Codes” are my own code names, for brevity.  Please do not confuse these codes with haplogroups. Notice I am not calling them “types” in this table.  To represent a statistically valid type, a modal haplotype must be statistically isolated.  These are not well isolated because they differ by only 1 or 2 mutation steps from each other.  However, those K, L, N, P signatures use the same values at these three markers as the corresponding type definitions.

            The following discussion is for 29 Dec, combining the Underhill published M458 data with recent M458 Results:

            Samples that match P or N at these 3 markers have >80% confidence of belonging to the corresponding type, and thereby >80% confidence of belonging to haplogroup R1a1a7.  Samples that match one of the other 3 marker haplotypes do not have high confidence of belonging to a particular type, but they do have >80% confidence of being M458- and thereby belonging to paragroup R1a1a*.

            For K and L signatures, 100% of the data so far is M458- for R1a1a*, and the 80% sampling confidence interval on that zero misfit is 97% to 100%.  For N, only 2 samples (one Underhill and one M458 Results) came out M458-, for 96% M458+, with 80% confidence lower value 89%.  For P, only 4 samples came out M458-, for 92% M458+, with 80% confidence lower value 83%.

            In other words, we have 83% to 97% confidence for each future sample prediction:  those that match the P or N signature will come out M458+ (R1a1a7), and those that match the K or L signature will come out M458- (R1a1a*).  The overall confidence is much better than 95% because I have down graded each haplotype prediction to account for uncertainties associated with small sample sizes at each haplotype.

            The 4th and 5th entries, N1K1 and K1, may be signatures of types that I have not yet identified.  These are rare in Poland, where I have been concentrating my research.  These two are common both on Ysearch and in the Underhill data.  Based on the latter, these two also have better than 80% overall confidence of M458-.

            It is interesting that the signature K1N1 comes out 88% M458- with 80% confidence lower value 78%.  This is despite the fact that it is only one step from N.  This is evidence that many of these belong to a type that is M458-.  This is consistent with the interpretation that K is larger and older than N, thereby having suffered more mutations, so there are more samples from K than from N in this K1N1 haplotype count.

            P is the most common signature in Poland, 6th on Ysearch.  The M458 results available to me are biased toward Poland, so P is the most common signature with 21 samples;  K and N are tied with 18 each.

            I arranged the table as the 7 most common haplotypes on Ysearch, because Ysearch should be most representative for readers of this web page.  M458Signatures.xls has data for all haplotypes.  These 7 represent 80% of the data on Ysearch.  Ysearch does not have M458 results, so the results quoted here are combined from Underhill with M458 Results.

            P1N1 is 100% M458+ with 80% confidence lower value 67% due to the sampling statistics, only 7 samples.  It cannot be determined using only 3 markers if these are P type or N type because this signature haplotype is between the two, but future matches are very likely M458+.  Anyway, each of these samples has more markers for determining P vs N.  The point of this topic is a demonstration of the value of these 3 markers alone.

            The other samples in the Excel file follow the same guidelines:  haplotypes one step from P or N but not close to K or L come out M458+.  Haplotypes one step from K or L but not close to P or N come out M458-.  Samples one step between (P or N) and (K or L) tend to come out mostly M458-.

            Summary:  A sample not tested yet for M458 can be predicted with roughly 80% confidence by comparison to the data by 3 marker haplotype in M458Signatures.xls.  Confidence by haplotype is available there, down graded for haplotypes with low sampling statistics.  80% of samples can be predicted using the table above, with the 7 most common haplotypes.  Overall confidence is much higher than 80%.

            These particular three markers are surprisingly stable in R1a, but they do mutate, so some samples that match a signature do not truly belong to the corresponding type.  I find it amazing that 3 marker signatures are this good.  My conclusion is that we are lucky that R1a1a7 is composed of two clades, P type and N type, which just happen to have MRCA’s who differed significantly from the modal haplotype of the parent R1a1a haplogroup in those STR values that we now measure.

            I published the first version of this signature table on my Polish Clades web page on 11 Nov, just a week after the Underhill article came out.

 

More Signatures

 

            For your information, here are better signature tables.   These signature modal haplotypes use only a few key markers, handy for publication information.  They work fine for predictions, but the full definitions are much better, available through those Ysearch ID codes and also in Haplotypes.xls.  I published that first table below a week after the Underhill publication although only the first 3 markers are in Underhill, because those 3 correlate well with Ysearch haplotype frequencies.  Since then, the M458 test results have confirmed (albeit not fully proven) the type definitions that I use.

 

Part

Code 

DYS Signature

 

 

385a

439

447

YCAIIb

406

594

481

565

R1a1a*

K

11

10

24

23

11

10

23

13

R1a1a*

L

11

10

23

21

11

11

23

12

R1a1a7

N

11

11

23

23

12

10

25

13

R1a1a7

P

10

10

23

23

12

10

25

13

 

Type

Subtype 

Ysearch

Markers

Cutoff

DYS Signature

 

 

 

 

 

19

385a

439

389-1

458

447

442

406

594

481

K

 

MN8R3

34

4

 

11

10

13

 

24

 

11

10

23

 

A

FCUFG

67

15

16

11

10

13

15

24

14

11

10

23

 

I

EKVHX

59

8

17

11

10

13

14

24

14

11

10

23

 

B

 

65

11

16

11

10

14

15

24

12

10

10

23

 

Type

Ysearch

Markers

Cutoff

DYS Signature

 

 

 

 

19

385a

439

389-1

389-2

459a

447

YCAIIb

406

594

565

M

24MB4

20

4

15

11

10

14

31

8

23

21

11

11

12

 

Type

Subtype

Ysearch

Markers

Cutoff

DYS Signature

 

 

 

 

 

19

385a

439

389-1

389-2

447

537

406

481

565

P

 

8U92G

36

5

17

10

 

13

30

23

12

12

25

 

 

Pc

RQK32

47

2

 

10

11

13

30

23

 

12

 

12

 

Pg

92HEK

21

2

17

10

10

 

 

23

12

12

25

13

N

 

3SEJK

45

7

 

11

11

13

29

23

11

12

25

13

 

 

R1a Subdivision

 

            Almost all of R1a divides into R1a1a* and R1a1a7.  R1a also has several known rare groups:  R1a*, R1a1*, R1a1aN, where N = 1 to 6 and 8.  There is also a very rare R1a1a7a.  By paragroup R1a1a* I mean haplogroup R1a without haplogroup R1a1a7;  technically, that’s incorrect because those groups should be excluded from R1a1a*, but so far I don’t see any way to identify these rare exceptions.

            Underhill is the announcement of the M458 definition of R1a1a7.  Underhill also has a recent summary of the SNP definitions for R1a.  Check ISOGG for updates.  Wikipedia  has a nice R1a entry with primary contribution by Andrew Lancaster.

            The rare R1a groups are not in my list of R1a types because I cannot find STR values for these.  It’s a shame the corresponding STRs are generally not published in SNP announcements.  I don’t know if the rare groups all together add up to 0.1% or 1% of R1a.  Surely they are less than 3%.  My percentage calculations in that list probably do not need adjustment because any Ysearch samples that might belong to these rare clades would probably have unusual STR values, not falling into one of my types, but still be counted in the totals. In my R1a Table, rare samples should be included in row R.  That row R might have a few percent from these rare groups, but I don’t know exactly how many.

            Underhill mentions 7 samples (men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.

 

Ysearch Population Bias

 

            A large Y-DNA database necessarily has regional bias.  I’m not criticizing Ysearch, any R1a database would have a bias without significant effort to collect data from all over Eurasia in a rigorous statistical manner.  United States men are more likely to join Ysearch, biasing data toward the European countries that contributed more immigrants to the US.  Men in developed countries are more likely to join than men in undeveloped countries.  I bet grandsons of immigrants, like me, are particularly more likely to join.  It is obvious from the Ysearch data that men with Jewish names seem particularly likely to join.  Ysearch clearly ends up with more Y-DNA data for men whose 19th century ancestors lived in Europe, with a representation gradient biased toward the northwest.  Perhaps there is an exception for the extreme southeast, India, where the gradient turns back up a bit.  It seems to me another up tick in the gradient of representation is at Poland, or more accurately Polish Americans.

            R1a is common in west Asia, but that part of R1a is underrepresented in Ysearch.  The far east of Europe is probably also underrepresented.  In that respect, those “Size” percentages in the Type List, and most data counts on this web page, are only rough estimates of the world-wide R1a distribution.  A number of web sites take on this issue of true distribution of haplogroups, beyond the scope of my web page here.  It seems to me N type, a type common in Slavic lands, is proportionally under represented in Ysearch.

            There may well be specific regions of the world hiding what I call STR “types”.  If R1a men with ancestry in such a region are unlikely to have many samples in Ysearch, or in other R1a databases, their STR combinations will not show up in “data mining” analysis from the web (such as my work).  Even a major DNA collection project, such as the effort documented in the Underhill article, might miss some R1a types.

            It is my opinion that P type, concentrated in Poland, is overestimated in Ysearch at almost half of R1a1a7, because of the bias mentioned above.  My publication estimates the size of Polish clades by Ysearch ratios.  This P type discussion is an example;  no doubt other types are over represented in Ysearch.

 

Hidden Clades

 

            P type is an excellent example for explanation of caveats that apply generally to STR based prediction.  Let me call these the “hidden clades” caveats.  I use the word “haplospace” for combinations of STR values.  Reminder:  I use the word type for a hypothetical haplogroup to distinguish different kinds of clades.

            P type and N type are far apart in haplospace but nevertheless come out together in the same haplogroup, R1a1a7.  In fact, P and N are both far away from the main R1a “tree trunk”, and they are far away in different directions in haplospace, with different markers mutated.  I provide STR ages of 1,600 years for P type and 2,500 years for N type (with many caveats).  P type is concentrated in Poland and N type is generally concentrated in Slavic lands.

            That means it is highly unlikely, although possible, that P type branched from N type.  I say “possible” because a male line from N type with an unusual accumulation of mutations might by luck have produced a man who is the MRCA of the P clade.  However, mutations would have been needed to restore the R1a modal values in all those markers that uniquely correlate with N, which is very unlikely.  More likely, P type represents an older branch.  The P node with N in the tree is the TMRCA of the combined R1a1a7.  It is possible another undiscovered clade has an older branch node in R1a1a7, but that’s not relevant to this discussion.  The node age is of course older than the TMRCA for P.  The node age is likely older than TMRCA for N because it is unlikely the man who founded N lived exactly at the node.

            That old P branch almost died away, because today P has very low STR diversity, significantly less diversity than N.  P type seems only 1,600 years old.  The simplest explanation is that a single man or his immediate descendants were highly successful in a male line population expansion starting about 1,600 years ago.  Another explanation is that P type is hiding as an old clade in a remote region of Eurasia, the remnant of a larger clade that flourished in that region long ago, from which very few descendant men have provided DNA data to US databases, and from which a large paternal tribe migrated to Poland about 1,000 years ago.  Another explanation is that plagues and other disasters between 3,000 and 1,000 years ago reduced the diversity of STR values in R1a1a7, causing extinction in all but two males lines, P and N, which by luck came out the way they are.  No doubt you are thinking of another scenario right now as you read this.  Whatever.  The point here is:  P and N types popped up as statistically compelling large types with unusual STR values, spreading out the R1a1a7 haplogroup into two mountains in haplospace.  If P type just happened to have had STR values close to the main R1a tree trunk, I would not be typing this web page today, because I would not have distinguished P type (my maternal grandfather’s original 12 markers) as a unique hypothetical clade.

            So P type is an example.  A large clade that can turn up isolated in haplospace, far away from its haplogroup, and even far away from a sister clade (N type).  That’s unusual, but it happens.  Most large clades are closer to their parent haplogroup in haplospace, difficult to distinguish, so most clades cannot be called “types”.  Because P type is unusual, it is highly unlikely it is hiding another large foreign clade that also just happened to be unusual and turned up in the same haplospace region.  Same for N type.  The M458 results are bearing this out.  However, smaller clades no doubt are more numerous.  P, or N, or any isolated type might be hiding a small foreign clades that just happened to come out in the same region of haplospace (about the same STR value combination).  By “foreign” I mean clades with old nodes, in the P example either an R1a1a7 node older than the P TMRCA, or even a small clade from R1a1a*.

            To compensate for such caveats, I included an increase in my formula for SBP.  Smaller SBP means a better type, so the increase is an automatic downgrade in quality to higher SBP.  However, I do not have a rigorous way to quantify these caveats, so please be aware that the increase is a rough estimate, justified in my publication.

            Summary:  Most clades hide near the main tree trunk, so do not qualify as types, which have clades isolated in haplospace.  Large types are unlikely to be pure clades because of the possibility of small hidden clades.  SBP is offered as a measure of clade purity, with an estimated accounting for hidden clades.

            Clades can also hide geographically, due to population bias in databases.

 

Age Caveats for Haplogroups and Types

 

            R1a1a7 is an excellent example for explanation of caveats that apply generally to clade age estimation.  For this section, I assume that R1a1a7 is only P type and N type.  It is possible another undiscovered clade has an older branch node in R1a1a7, in which case this discussion would need modification, but no essential change.

            These caveats apply equally to official haplogroups and also to types.  The types of course have additional uncertainty insofar as the types may not capture all STR outliers and instead capture outliers from foreign clades.  We expect types with relatively lower SBP to more likely be pure clades, with age estimation relatively more confident.

            Age is taken as the Time to Most Recent Common Ancestor (TMRCA).

            Published literature and web sites have lots of argument concerning the best way to statistically estimate TMRCA.  Since age is figured from mutation rates, there is also much discussion about how to calibrate mutation rates.  The caveats in this section apply to all methods, because they all depend upon STR variance due to mutations as a measure of age.  STR values diffuse with time, so more STR variance in the cluster of data for a clade represents older age.  “Variance” is the official statistical measure of how much something varies.  Average Squared Distance (ASD) is equivalent to variance.  Most people use ASD in genetic genealogy, as I do in my publications.  Some people figure age by methods other than ASD, but again those other methods are related to variation in STR values.  In this topic I use the word “variance” because is makes intuitive sense without imposing any mathematical equation.

            A number of publications and web sites misrepresent the confidence of age estimation by providing statistical sampling calculations, without mentioning the caveats due to systematic uncertainties that are there regardless of the statistical certainty in the data.  These caveats are not my ideas.  This topic is my review of issues in the literature, using P type and N type as examples.

            We are quite fortunate that P type and N type differ significantly from each other.  They are in different haplospace directions away from the R1a1a (K type) tree trunk.  It seems that population bottlenecks a few thousand years ago severely reduced the parent populations of P and N types, and by luck the most recent common ancestor (MRCA) for both P type and N type just happened to have a few unusual STR values.  In the population bottleneck topic I offer speculations about the details of how that may have come to pass.  Whatever the reason, both P type and N type are now easy to identify using STR markers.

            Population bottlenecks reduce STR variation in the long run, as explained in that topic.  It can be proven statistically that variance increases at a constant rate with time in a large growing clade (no bottleneck).  That means STR variation in a young, rapidly growing clade increases at a default rate.  If a clade suffers a severe population bottleneck, the STR variation suddenly decreases and then continues to grow again at the default rate, so the net growth is less than default. If a clade suffers only mild bottleneck pressure (slow growth or no growth) over a long period of time, the increase of STR variance will be slower than default.  This is well known, so experts use a “factor” in their standard age equations.  The factor artificially decreases mutation rates in order to increase the age.  Experts know this is not really a constant factor, but in fact a discontinuous slowdown of the increase of variance (apparent aging) of clades, and the slowdown varies from time to time and from clade to clade.  Nevertheless, experts publish papers that do not remind the reader of all this, using a standard factor, usually about 3, for old haplogroups.  No factor should be used for a rapidly growing young clade.  P type seems to me the paradigm example for young clades;  P type must have grown rapidly to get as big as it is in apparently less than a couple millennia.  No factor, or at most a factor not much larger than 1, should be used for P.  Perhaps a factor might be appropriate for N, but surely not as large as 3.  I may be wrong about P and N if they did suffer bottlenecks and grew rapidly nonetheless, but the traditional factor of 3 for old haplogroups is not reasonable here.

            Sometimes a population bottleneck increases variance instead of decreasing it.  R1a1a7 with P and N might be a perfect examples of that.  We cannot be sure of my example, but the possibility is obvious.  Usually, when there is a population bottleneck, the survivors have fewer male line ancestors than there were ancestors of the population before the bottleneck.  Each male line has less variance than the full population, so of course there is usually less variance afterward.  Even if one ancestor just happens to have extremely rare STR values, the data of his descendants will just shift the modal haplotype in his direction in haplospace.  It is very unlikely that there will be two MRCA after the population bottleneck that just happen to have extreme STR values in different directions in haplospace.  More likely, the new modal haplotype serves to reduce the variance in the combined population.  In the case of only two surviving clades, the new modal haplotype will be between them.  None of this is my idea.  It is a “Statistics 101” theorem that a small sample of data usually has less variance than the variance of the general population from which the sample is drawn.  But that’s usually.  The opposite happens occasionally - more variance sometimes after a population bottleneck - but with a long term trend toward less variance. 

            R1a1a7, with the two surviving MRCA’s of P and N, might be an example of an unusual increase in variance, since they are far apart.

            The equations (or computer programs) for estimating age assume the most probable.  Because P and N are far apart in haplospace, the age calculation necessarily produces a corresponding old age for the parent R1a1a7.

            What is commonly missed is the uncertainty here.  The age calculation of R1a1a7 is strongly influenced by the separation of P and N.  Individually P and N have some variance, and that contributes to the age calculation, but P and N are so far apart that so far I have seen only one sample out of about a hundred with step values half way between them.  In other words, the uncertainty in the age of R1a1a7 is has a component that is exactly the same as the uncertainly in estimating the TMRCA of two individuals chosen at random from a haplogroup.  This is not an analogy.  This is a fact.  There was a bottleneck (or a series of bottlenecks, or continuous population pressure over a long time) that reduced the variance of R1a1a7 until there were only two male lines left.  This is exactly the same as choosing two individuals randomly from a haplogroup.  If the P and N ancestors were separated in haplospace by the typical variance (one standard deviation) expected for the age of R1a1a7, then the age comes out exactly right.  But they are unlikely to be exactly typical.  They might be farther or closer apart than typical.  If they are farther, then R1a1a7 comes out too old - it is younger than it seems.  If they are closer, then R1a1a7 is younger than it seems.

            This topic is not unique to R1a1a7, but P and N make this a nice demonstration of caveats that are often missed in discussions on the web.

            I figure the TMRCA for P type about 1,600 years, and the TMRCA of N type about 2,500 years.  Other experts might get different numbers in the same ballpark.  Perhaps N type needs a factor, pushing the age out to about 3,000 years, but I doubt the factor should be closer to 2.  The TMRCA of the pair, which is the TMRCA of R1a1a7, is about 3,200 years, which might have a factor as large as 2, pushing the age out to 6,000 years.  I doubt it is that old, but I cannot be sure.

            But that factor is only one caveat.  Another caveat is that caveat related to only two men (MRCAs of P type and N type) that are widely separated in STR haplospace.  Because we do not know with confidence the TMRCA separating those two individual men, when they lived, I see no way to rule out a range from 2,500 to 11,000 years for TMRCA for R1a1a7.

            Underhill uses that standard factor of 3 in estimating, so the age in that publication is given as 10,700 years using Polish data.  The Underhill article does not mention that their R1a1a7 data cleanly separates into two clusters using only 3 markers.  I doubt all these experts are unaware of all this.  One of the Underhill authors, Zhivotovsky, published a paper with computer simulations demonstrating how young rapidly growing clades do not need a factor;  I review this paper and others in my publication.  The Underhill paper does not explicitly mention any caveats associated with age.  Instead, they use the word “coalescent time” instead of “age” or “TMRCA” along with Zhivotovsky’s mutation rate that has the factor of 3.  To me, this indicates that the Underhill article is avoiding caveats and just projecting with a standard formula.  I find it highly misleading to quote a “coalescent time” of “10.7 ± 4.1 KYA”, with statistical precision of about 40%, and not emphasizing that a factor of 3 is being used without any particular justification.  I agree that 10,700 years is a possible TMRCA age for R1a1a7, but because of the large factor being used I consider this an extremely high end estimate.

            I have another caveat.  A different subject.  Just a reminder for what I hope you consider obvious:  A TMRCA is generally quite a bit younger than a defining SNP.  I’ll again use R1a1a7 as and example:  The M458 mutation must be at least as old as the TMRCA for R1a1a7 because all descendants have it.  But that mutation is unlikely to have happened exactly the MRCA.  Surely M458 is younger than the TMRCA of R1a1a, because most R1a1a men do not have.  There is a span of time between TMRCA of R1a1a and TMRCA of R1a1a7, perhaps millennia.  Over this entire span of time there is a chain of men, a continuous line of sons, who are each an ancestor of that one MRCA of R1a1a7.  The M458 mutation could have happened anywhere in that span of time.  As more R1a1a haplogroups are defined in the next few years we may be able to narrow that span of time for M458 by the youngest R1a1a node without M458.  A good guess right now is that M458 is likely a millennium older than the MRCA for R1a1a7.

 

Population Bottlenecks

 

            Population bottlenecks generally reduce genetic variation, including STR variation.  Let me use R1a1a7 as an example.  Most of this section is a consensus of published DNA literature.  I make clear which statements would produce disagreement between experts, and which statements are my own ideas about R1a1a7.

            The most commonly cited population bottleneck is just the statistics of elimination contests.  An example is the card game “war” where a group of card players loose, one by one, as they loose all their cards, until there is only one winner with all the cards.  Similarly, a small human population, over a long enough time period, should loose male line families one by one, until there is only one MRCA.  P type does not qualify because it is too young and too large for such statistics.  I doubt even R1a qualifies, although a few experts may disagree on that.  This is a factor in the oldest Y-DNA haplogroups, a few experts might say the only factor, but I agree with most experts that there more to the story:

            Darwinian selection speeds up the process for most genes.  As far as we know, Y-DNA does not confer selective advantage for individual men, because very few genes are there.  Y-DNA selection for Y bearing sperm is known for some insects, but not seen in mammals.  Many people think human Darwinian evolution was competition between small families, which would speed up the elimination of male family lines.  Many experts strongly disagree with this notion.

            Sexual selection also speeds up the male line elimination process.

            Maybe you would consider the last two paragraphs human selection, not Darwinian selection.

            Experts use the term “effective population” for any selection method.  In the case of Y-DNA the effective male population is smaller than the full population insofar as some men contribute more than their proportional share of genes to the next generation, and some men contribute none.  Most experts would agree that R1a1a7 is too young to have been determined by simple continuous selection.  Experts argue with vehemence if major haplogroups like R1a were produced by simple selection.  Many people have philosophical problems with the idea of selection within the past 10,000 years.  I don’t.  I agree with most experts that simple selection was a factor in the formation of deep haplogroups like R, but the next paragraphs introduce more important factors, particularly for younger haplogroups.

            The four horsemen accelerate the process.  War, famine, pestilence, and plague.  These are often cited as the most common population bottlenecks for human DNA before modern times, temporarily reducing population so statistics works faster to eliminate male line families.  I have never seen a DNA publication disagree with this.  I have not even seen a religious disagreement, although no doubt they are out there.  P type and N type may have been produced this way, but I’m not sure, because they seem large and young.  I suppose some experts might disagree with me and say strongly yes they were produced this way, and some might say no.  R1a1a7, whose age is the TMRCA of P and N types combined, is a better candidate for being produced by disasters, but again I’m not sure, and experts would disagree.

            Cultural selection is probably the dominant cause of human population bottlenecks in the past several millennia, if not longer.  Successful cultures grow and displace other cultures.  A minority of experts might disagree, or might quibble to define “bottleneck” to exclude cultural selection.  Most genetic genealogist strive to identify specific cultures with specific haplogroups, although that’s beyond the scope of this web page.  I don’t have an opinion yet for P type or N type or R1a1a7 as far as association with specific cultures.  Here, I’ll agree with most experts that cultural selection is a key factor.

            It’s not clear to me how to limit cultural meaning of cultural selection, so a few paragraphs are needed for explanation.

            Some may consider a warlike culture to be just more wars, but there is a distinction here.  Any war will reduce population and speed up the loss of DNA variation.  Too many wars within a culture might reduce that DNA and allow the growth of others.  These last two sentences belong up there in the four horsemen paragraph.  I’m talking here about a young, rapidly growing culture that expands by conquest must faster than it assimilates other people, thereby increasing its DNA representation.  I consider this one type of cultural selection but I discuss it separately because some people may prefer a separate category.  I have seen discussions by experts with opinions about this.  Some say warlike cultures are irrelevant, some say it’s the whole story for humans, and most are in between.  I go with the latter, proposing that P type, N type, and R1a1a7 probably grew mostly or at least in part due to the warlike culture of the ancestors, although I do not see evidence to justify a strong conviction.

            Sexual selection is generally considered a Darwinian reduction of effective population, mentioned above.  I suppose if most cultures in the past practiced the keeping of harems, or the fathering of peasant children by feudal lords, I suppose those practices could be considered Darwinian selection of dominant men, reducing effective population size.  I can understand where you may disagree, calling this human selection of tyrants.  Insofar as specific cultures in the past, particularly young fast growing cultures, may have practiced more than typical sexual selection, that could be considered a form of cultural selection, whereby that specific culture established a new common haplogroup, initially with reduced STR variation, and a MRCA from an ancestor of that culture.

            Migration is not usually considered cultural selection.  But migration can enhance the genetic isolation produced by cultural selection.  Consider the following scenario.  Suppose P type grew rapidly by some kind of cultural selection.  Within the P culture there would be very little STR variation.  The surrounding region the population would have wider STR variation, not statistically distinguished from P.  It would take extreme growth for P to wipe out most of its neighbors who are distant relatives, in order to stand out as a “mountain in haplospace”.  However, suppose a large tribe from the P culture migrates as a group to a distant land, later to be called Poland, and suppose due to the trials and tribulations of history the entire parent region is wiped out, either by the four horsemen or by rapid growth of a neighboring culture.  P would look isolated, as it does today.  There is even a possibility of a remnant clade, similar to P type but with much wider STR variation, left over in a remote region of Eurasia, to be discovered as the P type homeland.  This is sounds like an attractive hypothesis to me, because it easily explains the unusual isolation of P type.  This explains why P type is so different from it’s sister, N type, and so different from its parent, R1a as represented by K type.  I saved this for last.  I see no compelling reason to believe it strongly.  It is just as easy to believe that P type arose in the region that is now Poland and simply out competed or killed its neighbors and distant relatives.  If a P type pocket is found far away however, that would be evidence of cultural selection enhanced by migration.

 

DYS 447

 

            The marker DYS 447 is remarkably powerful as a predictor for a*, demonstrated in the Underhill analysis below.  The Underhill article does not mention this.  The a* samples are mostly 447=24.  The 7* samples are almost all 447=23.

            At 447>23, only 3 samples are 7*.  97.8% of the 447>23 samples are a*.  This by itself is an excellent predictor.

            Of course that 97.8% is too good to be true.  There must have been 7* men over the years with a 23 to 24 mutation at 447 who produced male descendant clades today common only in the home region of each such man.  There is no way to predict how many such populations there are in the world.  The Underhill data only caught three such samples, two in Germany and one in Slovakia, but maybe there is a larger population somewhere else, so that 97.8% might be somewhat misleading.

            That 97.8% also has two statistical objections:  First, the 70% confidence interval for the number 3 is 1.3 to 6.0.  The 95% interval is 0.8 to 7.8.  That 7.8 reduces 97.8% to <92% with 2.5% probability of being correct - just due to sampling uncertainty.

            Second, there is selection bias.  One of the 39 Underhill markers necessarily has to come out best if checking for prediction value, so it makes sense that the best marker may not really be as good as it seems.  As a reader, you need to consider that many people are analyzing STR data and only writing web reports for those markers that have good prediction value.  Lucky positive results are more likely to be published, while negative results are usually not published.  There is no way to calculate how much such systematic statistical “type 1” bias lowers our confidence interval for that 97.8% number for DYS 447.

            The same kind of statistical caveats apply to the rest of my analysis on this web page.  I’ll not repeat this.

            That said, the 447 marker seems good enough to me that I’ll predict that better than 80% of new R1a samples with 447>23 will be a*, testing negative for M458.  This is the reason for that 447>23 rule in my signature table at the top of this page.  However, we can do better prediction by considering more markers, which I do with those type tables, and which I’ll do more with future updates to this page.

            All but 4 of the 68 Underhill 7* samples have 447=23.  However, there are a* samples with 447=23, so this 23 value is not a good indicator by itself.

            In the Underhill analysis section below, I show that two more markers, 385a with 439, can resolve 447=23.  The result is in the table at the top of this web page.

 

Structure of DYS 447

 

            Usually, STRs with few repeats (<12, for example) mutate at a lower rate than STRs with many repeats (>20, for example).  DYS447 is an exception because it is not a simple chain.

            Whittaker Figure 3 is a graph showing how mutation rate usually increases with STR length.  This makes sense.  A longer chain is more likely to slip and re-stick in a chain shifted configuration during the piecewise copying of the reverse strand during DNA replication.  This explains why 385a and 439 (next topic), with most common repeat numbers 10 and 11, are stable in R1a.

            The DYS 447 structure is mostly the “TAATA” motif repeated many times.  However, in the repeat chain at about 1/3 and 2/3 the length are two exception motifs “TAAAA”, which prevent the chain from matching and sticking incorrectly at those locations.  So the chain is less likely to slip and stick in those two places during replication.  In other words, 447 is not really a single chain of length 23 or 24.  It is sort of like a compound set of 3 smaller chains.

 

Kyrgyzstan;  Eurasia

 

            Underhill has a lot of data from Kyrgyzstan, which is dominated by the a* signature (385a,439,447) = (11,10,23).  This is a signature for what I call L.  L is a large category but not a type.  See the tables above.  At 67 markers, my signature for L type has 5 more markers that are not available in the Underhill data:  (YCAIIb,406,594,481,565) = (21,11,11,23,12).  An obvious hypothesis is that Kyrgyzstan is dominated by L, with all 8 markers.  This makes sense, because L, is large and looks old (lots of variation).  However, my L signature is based on Ysearch, where few men from Kyrgyzstan enter data.  It is possible Kyrgyzstan is dominated by a large clade that does not show up in Ysearch.  This is an example of a caveat that always applies to projections based on STR data analysis:  It is always possible there is  a  particular region (or any number of regions) dominated by an STR combination that does not show up in existing databases.

            Within L there is a specific M type, which is concentrated in northern Europe.  There is no reason to expect M type to be common in Kyrgyzstan.  If it is, then that would be strong evidence of a significant migration in the past

            East Asia, including the Indian subcontinent, are known to have R1a that is not well represented at 67 markers on the web.  The same caveat applies.

 

DYS 385a and DYS 439

 

            I could not find a single marker to help resolve 447 = 23 in the Underhill data.  The pair 385a and 439 do the job.  The result is in the signature table, above.  The details are in the UnderhillAnalysisThreeBest analysis, below.

 

Hypothesis Regarding the 385a, 439, 447 Trio

 

            One simple hypothesis explains the simple table for distinguishing M458:  The L signature (385a,439,447) = (11,10,23) is ancestral for R1a1a.  Out of all the 447 mutations from 23 to 24 that happened over the millennia in R1a1a, only one survived to form a substantial clade, leading to the K (11,10,24) signature.  The M458 mutation is relatively old, leading to R1a1a7, and it produced no substantial 447=24 clade.  Out of all the 385a and 439 mutations that happened over the millennia in R1a1a7, only two combinations survived to form substantial clades, leading to the N and  P signatures.  Other substantial undiscovered regional clades are possible.  There are surely many insubstantial clade exceptions (small male line descent clades with a mutation in these markers).

            Another hypothesis is reasonable, albeit a bit more complicated:  The 447=24 value might be ancestral.  Out of all the 447 mutations from 24 to 23 that happened over the millennia, three survived, one in R1a1a producing the L signature, and two producing the R1a1a7 N and P signatures.  The ancestral K category might seem younger than L because it went through a recent population bottleneck, and the ancestral (385a,439) values survived the bottleneck.  P and N differ at (385a,439) because they parted long ago, leaving enough time for mutations to be likely (but not necessary in these slow mutators).

            Many other hypotheses are possible.  These are only the simplest two I thought of.

            Whatever.  All we know for sure today is that 447>23 is a good indicator for distinguishing much of a* from 7*, and the trio do the job nicely for the 447=23 value.

 

Underhill Modal Haplotype

ZDUK8.  Ysearch link.  Posted by Lawrence Mayka on 5 Nov 2009.  This is the modal fit of the M458+ data by Underhill.

39 markers.  DYS461 Underhill value 10 has been changed to 11 to be compatible with the FTDNA equivalent.

 

Analysis of Underhill et. al. Data;  Three Best Markers

 

            290 samples in the 39 STR markers table S7.

            DYS 447 is the best marker;  272 samples have it.  123 have 447=24;  all but 3 are M458-, which is 98%, so this marker alone is an excellent signature for R1a1a*.  I have a discussion about why 447 is so stable.  However, 447=23 is a mix of M458+ and M458-, so 447 alone cannot split R1a.

            I could not find a pair of markers that correlate well enough to split R1a.

            I found a great trio, (385a,439,447), that splits R1a into M458+ and M458- with better than 80% confidence.  See Signatures for a summary.  Previous versions of this document had here a complete listing of all the 3 marker haplotypes from Underhill, with data counts.  That data has been moved to M458Signatures.xls.

            Reminder:  The 3 signature M458 predictions are a nice demonstration of the unusual power of only 3 markers.  However, 67 markers is much better for distinguishing M458, and of course the SNP is the gold standard.

            DYS 19 is an excellent marker for P type;  DYS 389-2 (actually 389-2 minus 389-1) is an excellent marker for N type.  P plus N are equivalent to M458+.  So M458 results can be predicted with better than 90% confidence using 5 markers.  See my list of More Signatures.  Those 5 are available in the Underhill data, and in Ysearch data using the standard 25 FTDNA marker set.  I did not make the effort to document this, providing instead 3 markers as a demonstration and 67 markers as my best method.  A marker with the Underhill 39 or the FTDNA 25 can be compared to my Ysearch codes because the Ysearch method will restrict analysis to the available markers.

 

Ysearch R1a 67 Marker Download

 

            Direct link:  www.ysearch.org.  Description of Ysearch.  Ysearch is not up to date for subhaplogroups within R1a.  I adjust Ysearch data, removing modal haplotypes and family sets, as explained in my published method.  Downloading and editing the data for this table is very tedious, so I do not refresh often.

This analysis is from my 17 July 2009 download.  All 67 marker data within 42 steps of K type, using a 67 marker modal:

            Archive copy:  www.jogg.info/52/files/PolishClades/KTypeYsearch.xls

                        It has a “Download” sheet with detailed explanation of the download method

            741 samples

            39 removed;  modals removed;  also family sets - all but one removed from each family set, to remove bias

            332 removed not R1a;  mostly Unknown and R1b;  a few R1

            370 net samples                        includes only one R1a1c* (should be updated to R1b1b2), at step 42

                                                            includes one each R1a1 (tested) most distant at steps 34, 30;  two at step 29

                                                            365 samples K step <29

            This file is useful for evaluating hypothetical types.  It is easy enough to update the evaluation on Ysearch, particularly for those types where I have entered the data into Ysearch codes.

            Previous version of this document had here a list of all the haplotypes using the 3 best markers, with number of samples each.  That data has been moved to M458Signatures.xls.

 

Discussion

 

            This topic is a collection of old discussions, which will be moved to new topics as I continue to write more.

            “A” type is small (few samples) but the best isolated type I have found so far, using all 67 standard FTDNA markers.  I have close to 100% confidence that the A cluster corresponds to a valid clade.  Some samples have 90% confidence as A type but only 80% confidence as K type, because the confidence in A as a type (low SBP) is higher than the confidence that A is truly a subtype of the current K definition.  K type is large (many samples), but not well isolated (higher SBP).  The K modal haplotype is close to the modal haplotype for all of R1a, so it is not surprising that there seem to be other small types with nodes not far from the K node.  B type seems to be like that, not very certain as a subtype of K.  “A” type is very young, with a definition much younger than the K definition, but it is difficult to be sure that A is a son of K, as opposed to a nephew - the young tip of another branch in the tree, where that branch has a node slightly older than the K definition.  It is likely that I’ll come up with a better definition for K type as more data becomes available, with better statistics, allowing me to identify more small types related to K.  Although I do not have high confidence that each of A, B, and I are truly subtypes of K, insofar as any of them are not sons of K, those would surely be nephews of K.

            My cutoff confidence method is not documented in my publication.  I explain this at my PolishClades site.  My intent is to move that explanation here and to expand it.  Briefly, these confidence estimates are partly calculated and partly estimated - conservatively - using my experience.  The basis:  samples at zero step are more likely to belong to a hypothetical clade, samples just one step short of the definition cutoff are less likely to belong, samples in between have intermediate confidence, and even samples at or beyond the definition cutoff have low confidence of being outliers (type members with high step count due to the luck of mutations).

            It is possible that a sample may show up that fits two different types.  Obviously, a sample should be evaluated in the type it fits best.

            Most M458- do not match one of my types.  Most of my experience is with Polish data, so I may find more types as I continue to analyze the data.  If you are reading this and know of a well isolated type that you would like added, please send me an email.  Since M458- is the parent paragroup R1a1a* with most of R1a, I expect that much of M458- will not resolve into types, for the reason mentioned at the end of my mountain summary.

            M type was brought to my attention by Larry Mayka, who informs me others have called this haplotype the Viking haplotype because of its concentration in northwest Europe.  I intend to add links to other web pages in cases like this.

            At the first 12 standard markers K is the most common R1a haplotype on Ysearch, and M is second.  However, many samples that match K or M at 12 markers do not resolve into the corresponding type.  Many samples come close to K type, falling at or just beyond the K type cutoff.  I call those K Borderline.  Many samples come close to M type, falling at or just beyond the M type cutoff.  I call those Mb but the Mb SBP is not low enough to be considered a valid type.  L type is third on Ysearch at 12 markers.

            When I say confidence, I include statistical sampling.  By 90% confidence I mean the worst case of the 90% confidence interval, and 80% means the worst of the 80% confidence interval.

 

Peter Gwozdz

Peter Gwozdz

pete2g2@comcast.net

            There is a Peter Gwozdz topic in my other web document, including a story about how I became interested in R1a.

 

Revision History

2009 Nov 14 first web posting

2009 Nov 18 update with M458 results and slight change of signature list

2009 Nov 19 split the table into signatures and two type tables;  major rewrite of the documentation

2009 Nov 20 update the M458 results with tables

2009 Nov 21 update M458 results, add topic about M447 being compound

2009 Nov 21 my method & results published in JoGG, Fall issue, today

2009 Nov 22 new K,L, M table;  drop G type, rename S as M type, edit L type, more 447 discussion added

2009 Nov 24 fix 429 type;  clarify L & M codes with another table;  significant rewrite of the documentation

2009 Dec 11 rewrite of the introductory sections at the top of this web page - not finished

2009 Dec 12 continue rewrite

2009 Dec 13 continue rewrite

2009 Dec 14 update M458 results

2009 Dec 17 finished rewrite for now

2009 Dec 24 rewrite Abstract, links to Introduction and Mountain Method in the companion document

2010 Jan 1 extensive rewrite including M458 results update and M458Signatures.xls

2010 Jan 4 redo of links for References and Sources

2010 Jan 5 M458 News topic

2010 Jan 10 links to PolishClade.html for discussions of the types

2010 Apr 26 L260 for P type

2010 May 27 move L260 & M458 results to the Polish Clades page

2010 Jun 1 delete M458 results here, link added to Polish Clades page

2010 Jul 22 comment at the top that this page is becoming out of date and will be discontinued