WGS vs. Genotyping Arrays: What Actually Matters for Ancestry Analysis

Most people's first encounter with personal genomics is a spit kit from 23andMe or AncestryDNA. You mail a tube, wait a few weeks, and get back a pie chart telling you you're 43% British and 12% "Broadly European." Behind that pie chart sits a genotyping array, a chip that reads somewhere between 600,000 and 900,000 pre-selected SNP positions out of the roughly 3 billion base pairs in your genome. That's less than 0.03% of your DNA.

Whole Genome Sequencing (WGS) reads everything. Every base pair, every position. The two technologies have different strengths, different failure modes, and different cost profiles. Which one makes sense depends on what you're trying to do.

How genotyping arrays work

A SNP array is a glass slide dotted with hundreds of thousands of short DNA probes, each designed to hybridize to a known variant position. The Illumina Global Screening Array (GSA), the chip behind most DTC kits today, targets around 650,000 SNPs. The older Illumina OmniExpress hits about 700,000. 23andMe and AncestryDNA use custom versions of these chips with their own curated SNP lists, and those lists don't overlap nearly as much as you'd think. A 2018 analysis found that each company picks quite different SNP sets, with 23andMe skewing heavily toward low-frequency, high-population-differentiation variants (79% of their SNPs fell in this category) while AncestryDNA and MyHeritage took more conservative approaches.

The SNPs on these arrays were chosen in advance, mostly discovered by sequencing European-ancestry individuals. This introduces what's called ascertainment bias, a systematic skew worth understanding before comparing the two approaches.

Ascertainment bias: the array's fundamental tradeoff

When you pre-select SNPs based on what's polymorphic in Europeans, you get a skewed view of human genetic diversity. The allele frequency spectrum shifts toward intermediate frequencies. Rare variants, which are disproportionately population-specific and therefore the most informative for fine-scale ancestry, are systematically missed. Fst estimates get compressed. Tajima's D gets inflated. PCA plots get warped.

Lachance & Tishkoff (2013) compared high-coverage WGS of African hunter-gatherer populations with genotyping array data and showed that arrays produce a measurably different picture of demographic history and natural selection. SNPs on arrays tend to be older and shared across populations, exactly the ones that carry the least information about recent population-specific history.

The Human Origins array used in academic archaeogenetics was specifically designed to mitigate this by using SNPs ascertained in single individuals from diverse populations. It performs substantially better than commercial arrays for population genetics work. But it still covers only ~600,000 positions, and any fixed panel will miss variation relevant to populations not represented in the design.

That said, ascertainment bias doesn't affect all analyses equally. For many standard applications (GWAS, common-variant PRS, basic ancestry estimation), array data performs well because the methods were built around the properties of array data in the first place. The bias matters most when you're working across divergent populations or need the site frequency spectrum to be accurate.

What WGS brings to the table

A 30x WGS BAM file contains reads covering essentially every callable position in the genome. When you variant-call against GRCh37 or GRCh38 and then intersect with a reference SNP set like the AADR 1240k panel (~1.23 million SNPs), a well-processed WGS sample routinely achieves 99%+ overlap. With careful force-calling at known positions using tools like

bcftools mpileup

, you can push that to ~99.9% overlap with the AADR target positions. That's achievable in practice with standard 30x Illumina short-read data, proper alignment (BWA-MEM2), and appropriate variant calling pipelines.

Compare this to a DTC array, which gives you maybe 600k-900k positions total, of which perhaps 200k-300k overlap with the ~600k Human Origins SNPs or ~1.2M 1240k SNPs used in the AADR.

Beyond overlap counts, WGS has several distinct properties:

No ascertainment bias. Every variant is discovered from the sample itself, not pre-selected from a discovery panel. The allele frequency spectrum reflects reality rather than the design choices of the array manufacturer.

Structural variant and indel access. Arrays are SNP-only by design. Insertions, deletions, copy number variants, and inversions are invisible to them. These matter for functional genomics and increasingly for population structure work.

Imputation-free genotypes. Arrays rely on statistical imputation to fill in gaps between genotyped positions, using phased reference panels like 1000 Genomes or TOPMed. Imputation accuracy drops for rare variants and for populations poorly represented in the reference panel. WGS reads the actual bases.

Better IBD detection. Identity-by-descent segment calling from phased WGS data outperforms array-based IBD. The ancIBD tool, designed for ancient DNA, shows that WGS data can be imputed at roughly 3x lower coverage than 1240k capture data for equivalent IBD calling performance. More markers means better phasing, which means more accurate IBD segments.

Higher-resolution ROH. Runs of homozygosity analysis for estimating parental relatedness is more accurate with dense, unbiased marker coverage. You can detect shorter ROH segments and distinguish recent from ancient consanguinity more reliably.

Future-proofing. A BAM file is a permanent asset. As new reference panels, SNP sets, and analytical methods emerge, you re-extract from the same BAM. An array raw file gives you exactly the positions that were on the chip at the time, nothing more.

Where arrays hold their ground

Arrays have real advantages that aren't going away overnight.

Operational simplicity. You get a genotype file directly from the scanner: no alignment, no variant calling, no QC filtering. For a company processing millions of kits per year, this matters. The bioinformatics overhead of WGS is nontrivial, requiring compute infrastructure, storage, and expertise that arrays simply don't demand.

Reproducibility. Two samples run on the same chip version produce directly comparable genotype calls with minimal batch effects. WGS variant calls can vary depending on the aligner, caller, and reference build. Standardized pipelines (GATK Best Practices, DRAGEN) have reduced this, but haven't eliminated it entirely.

Cost at scale. A DTC genotyping test costs $99-199. A consumer 30x WGS runs $299-399 through providers like Nebula Genomics, SelfDecode, or Dante Labs. Lab-side sequencing costs are dropping fast (Complete Genomics' DNBSEQ-T20x2 does 30x genomes for under $100 in production, and Ultima Genomics announced an $80 genome in 2025), but the consumer price gap hasn't fully closed yet. For population-scale studies genotyping thousands of samples, arrays are still meaningfully cheaper per sample once you factor in compute and storage.

Legacy compatibility. Decades of GWAS, PRS, and ancestry reference databases were built on array genotypes. The entire analytical ecosystem around consumer ancestry (reference panels, ethnicity estimation algorithms, relative matching databases) was designed for array-density data. WGS can be subset to match, but the infrastructure inertia is real.

Proven track record for common-variant work. For GWAS of common diseases, polygenic risk scoring, and basic population stratification, arrays combined with imputation have been the standard for over a decade and continue to produce reliable results. The marginal gain from WGS in these applications is often modest.

What this means for ancestry analysis

For running ADMIXTURE, PCA, f-statistics, or qpAdm against the AADR, what you need is dense, high-quality genotype calls at the 1240k or Human Origins SNP positions. WGS gives you this with higher completeness, higher per-site accuracy (no hybridization artifacts, no probe-adjacent interference), and no ascertainment bias. If your goal is to do serious archaeogenetic or population-genetic analysis, WGS is the stronger starting point.

For consumer ancestry reports (the pie charts and ethnicity estimates), the gap is narrower. Those algorithms were designed around array data and work well with it. WGS still enables finer-resolution analysis: better IBD matching for relative finding, better phasing for chromosome painting, and the ability to call Y-chromosome and mtDNA haplogroups from the same dataset with full phylogenetic resolution instead of relying on a handful of diagnostic SNPs. But for a casual user who just wants a population breakdown, the array result is often good enough.

The right choice depends on what you want to do with your data. If you want a quick ancestry estimate and maybe some health reports, a $99 array test does the job. If you want research-grade genotypes compatible with the AADR, accurate IBD segments, full haplogroup resolution, and a dataset that won't become obsolete, WGS is worth the premium. The cost difference is shrinking, and the data quality difference is not.

WGS vs Genotyping Arrays