What is the calculation for determining “Ancestry Composition”, down to single percentage points, by the commercialized DNA kits?

16 views
0

What is the calculation for determining “Ancestry Composition”, down to single percentage points, by the commercialized DNA kits?

In: 1

Probably “you have X out of Y = Z% polymorphisms associated with geo/ethnic group A”. I’m sure they’ve got their actual methodology detailed somewhere though.

​

The general idea is that there are places in your DNA which hold “Single Nucleotide Polymorphisms” or SNPs. An SNP is a place where a single “character” in your DNA might be different to that same character for someone else. Because populations have similar DNA, there are SNPs or combinations of SNPs which can be used to differentiate every African person from every Chinese person.

Making this estimate more accurate (down to single percentage points) is something we do with complex statistics and machine learning. Commercialised DNA tests often vary in their exact methodology, but they are broadly similar to each other.

For [Ancestry.com](https://Ancestry.com), which is a very popular service using a common model, what they do is this:

1) Find more than 700,000 spots in your DNA that might have SNPs

2) Grab those same 700,000 spots from a “reference panel”, which is a large database of people who [Ancestry.com](https://Ancestry.com) knows the ethnicity of.

3) Using something called a “Hidden Markov Model” to estimate how well your SNPs match the reference panel. The short story of this is that they take the first chunk of your DNA and we make two guesses: first, what is the chance that this DNA is from each of the 43 areas for each of your parents. Second, how likely is it that my next guess is going to change. We then take the next chunk and use our prior guesses to inform our next ones. If it looks like both your parents are from Nigeria in one chunk it will take a lot of evidence to convince the model that one of them is from Sweden in the next chunk.

4) We now have a list of chunks and a list of guesses about which area your parents are from for each chunk. If you have four chunks (it’s much larger in reality) and those chunks guess that your parents are from Sweden & Sweden, Sweden & Japan, Sweden & Japan, France & Germany respectively then the final outcome is 4/8 Swedish, 2/8 Japanese, 1/8 French, and 1/8 German. Take this concept and extend it to 43 ethnicities and over a thousand chunks and you can see how accurate estimates may be made.

[The white paper describing this can be found here](https://www.ancestrycdn.com/dna/static/images/ethnicity/help/WhitePaper_Final_091118dbs.pdf) if you’re interested. There’s some fun stuff in there – for example people from Turkey have very low “precision” which tells us that [Ancestry.com](https://Ancestry.com) tells a lot of people who *aren’t* actually Turkish that they are. They think this might be because places around Turkey have very Turkish-looking DNA. You can imagine that there might be a lot of people crossing borders and having families with others around there. Japan, on the other end, appears to be very isolated in their DNA and [Ancestry.com](https://Ancestry.com) is nearly perfect at telling you if you are or are not Japanese.