The spatiotemporal patterns of major human admixture events during the European Holocene

  1. Department of Molecular and Cell Biology, University of California, Berkeley
    BerkeleyUnited States
  2. Broad Institute of Harvard and MIT
    CambridgeUnited States
  3. Human Evolutionary Biology, Harvard University
    BostonUnited States
  4. Center for Computational Biology, University of California, Berkeley
    BerkeleyUnited States

Abstract

Recent studies have shown that admixture has been pervasive throughout human history. While several methods exist for dating admixture in contemporary populations, they are not suitable for sparse, low coverage ancient genomic data. Thus, we developed DATES (Distribution of Ancestry Tracts of Evolutionary Signals) that leverages ancestry covariance patterns across the genome of a single individual to infer the timing of admixture. DATES provides reliable estimates under various demographic scenarios and outperforms available methods for ancient DNA applications. Using DATES on~1100 ancient genomes from sixteen regions in Europe and west Asia, we reconstruct the chronology of the formation of the ancestral populations and the fine-scale details of the spread of Neolithic farming and Steppe pastoralist-related ancestry across Europe. By studying the genetic formation of Anatolian farmers, we infer that gene flow related to Iranian Neolithic farmers occurred before 9600 BCE, predating the advent of agriculture in Anatolia. Contrary to the archaeological evidence, we estimate that early Steppe pastoralist groups (Yamnaya and Afanasievo) were genetically formed more than a millennium before the start of Steppe pastoralism. Our analyses provide new insights on the origins and spread of farming and Indo-European languages, highlighting the power of genomic dating methods to elucidate the legacy of human migrations.

Introduction

Recent studies have shown that population mixture (or ‘admixture’) is pervasive throughout human history, including mixture between the ancestors of modern humans and archaic hominins (i.e., Neanderthals and Denisovans), as well as in the history of many contemporary human groups such as African Americans, South Asians, and Europeans bib50. Understanding the timing and signatures of admixture offers insights into the historical context in which the mixture occurred and enables the characterization of the evolutionary and functional impact of the gene flow. Many admixed groups are formed due to population movements involving ancient migrations that predate historical records. The recent availability of genomic data for a large number of present-day and ancient genomes provides an unprecedented opportunity to reconstruct population events using genetic data, providing evidence complementary to linguistics and archaeology.

To characterize patterns of admixture, genetic methods use the insight that the genome of an admixed individual is a mosaic of chromosomal segments inherited from distinct ancestral populations bib11. Due to recombination, these ancestral segments get shuffled in each generation and become smaller and smaller over time. The length of the segments is inversely proportional to the time elapsed since the mixture bib11bib40. Several genetic approaches – ROLLOFF bib40bib49, ALDER bib37, Globetrotter bib27, and Tracts bib19 – have been developed that use this insight by characterizing patterns of admixture linkage disequilibrium (LD) or haplotype lengths across the genome to infer the timing of mixture. Haplotype-based methods perform chromosome painting or local ancestry inference at each locus in the genome and characterize the distribution of ancestry tract lengths to estimate the time of mixture bib19bib27. This requires accurate phasing and inference of local ancestry, which is often difficult when the admixture events are old (as ancestry blocks become smaller over time) or when reference data from ancestral populations is unavailable. Admixture LD-based methods, on the other hand, measure the extent of the allelic correlation across markers to infer the time of admixture bib37bib40. They do not require phased data from the target or reference populations and work reliably for dating older admixture events (>100 generations). However, they tend to be less efficient in characterizing admixture events between closely related ancestral groups.

While highly accurate for dating admixture events using data from present-day samples, current methods do not work reliably for dating admixture events using ancient genomes. Ancient DNA samples often have high rates of DNA degradation, contamination (from human and other sources), and low sequencing depth, leading to a large proportion of missing variants and uneven coverage across the genome bib46. Additionally, most studies generate pseudo-haploid genotype calls – consisting of a haploid genotype determined by randomly selecting one allele at the variant site – that can lead to some issues in the inference. In such sparse datasets, estimating admixture LD can be noisy and biased (see Simulations below). Moreover, haplotype-based methods require phased data from both admixed and reference populations which remains challenging for ancient DNA specimens bib19bib27.

An extension of admixture LD-based methods, recently introduced by bib41, leverages ancestry covariance patterns that can be measured in a single sample using low coverage data. This approach measures the allelic correlation across neighboring sites, but instead of measuring admixture LD across multiple samples, it integrates data across markers within a single diploid genome. Using a set of ascertained markers that are informative for Neanderthal ancestry (where sub-Saharan Africans are fixed for the ancestral alleles and Neanderthals have a derived allele), bib41, inferred the timing of Neanderthal gene flow in Upper Paleolithic Eurasian samples and showed the approach works accurately in ancient DNA samples bib41. However, this approach is inapplicable for dating admixture events within modern human populations, as there are very few fixed differences across populations bib3.

Motivated by the single sample statistic in bib41, we developed DATES (Distribution of Ancestry Tracts of Evolutionary Signals) that measures the ancestry covariance across the genome in a single admixed individual, weighted by the allele frequency difference between two ancestral populations. This method was first introduced in bib43, where it was used to infer the date of gene flow between groups related to Ancient Ancestral South Indians, Iranian farmers, and Steppe pastoralists in ancient South and Central Asian populations bib43. In this study, we evaluate the performance of DATES by carrying out extensive simulations for a range of demographic scenarios and comparing the approach to other published genomic dating methods. We then apply DATES to infer the chronology of the genetic formation of the ancestral populations of Europeans and the spatiotemporal patterns of admixture during the European Holocene using data from ~1100 ancient DNA specimens spanning ~8000–350 BCE.

Results

Overview of DATES: model and simulations

DATES estimates the time of admixture by measuring the weighted ancestry covariance across the genome using data from a single diploid genome and two reference populations (representing the ancestral source populations). DATES works like haplotype-based methods as it is applicable to a single genome and not like admixture LD-based methods, which by definition require multiple genomes to be co-analyzed; but unlike haplotype-based methods, it is more flexible as it does not require local ancestry inference. There are three main steps in DATES: we start by first learning the genome-wide ancestry proportions by performing a simple regression analysis to model the observed genotypes in an admixed individual as a linear mix of allele frequencies from two reference populations. For each marker, we then compute the likelihood of the observed genotype in the admixed individual using the estimated ancestry proportions and allele frequencies in each reference population (this is similar in spirit to local ancestry inference). This information is, in turn, used to compute the joint likelihood of shared ancestry at two neighboring markers, accounting for the probability of recombination between the two markers. Finally, we compute the covariance across pairs of markers located at a particular genetic distance, weighted by the allele frequency differences in the reference populations (Materials and methods).

Following bib41, we bin the markers that occur at a similar genetic distance across the genome, rather than estimating admixture LD for each pair of markers, and compute the covariance across increasing genetic distance between markers. The estimated covariance is expected to decay exponentially with genetic distance, and the rate of decay is informative of the time of the mixture bib11bib40. Assuming the gene flow occurred instantaneously, we can then infer the average date of gene flow by fitting an exponential distribution to the decay pattern (Materials and methods). In cases where data for multiple individuals is available, we compute the likelihood by summing over all individuals. To make DATES computationally tractable, we implement the fast Fourier transform (FFT) for calculating ancestry covariance as described in ALDER bib37. This provides a speedup from to that reduces the typical runtimes from hours to seconds with minimal loss in accuracy (Appendix 1—figure 2).

To assess the reliability of DATES, we performed simulations where we constructed 10 admixed diploid genomes by randomly sampling haplotypes from two source populations (Materials and methods). Briefly, we simulated individual genomes with 20% European and 80% African ancestry by using phased haplotypes of northern Europeans (Utah European Americans, CEU) and west Africans (Yoruba from Nigeria, YRI) from the 1000 Genomes Project, respectively bib3. As reference populations in DATES, we used closely related surrogate populations of French and Yoruba respectively, from the Human Genome Diversity Panel (HGDP) bib35. We first investigated the accuracy of DATES by varying the time of admixture between 10 and 300 generations. For comparison, we also applied ALDER bib37 to these simulations. Both methods reliably recovered the time of admixture up to 200 generations or ~5600 years ago, assuming a generation time of 28 years bib41, though DATES was more precise than ALDER for older admixture events (>100 generations) (Figure 1—figure supplement 1, Appendix 1—table 4). Further, DATES shows accurate results even for single target samples (Figure 1A, Figure 1—figure supplement 2A) and even when few reference individuals are available for dating (Figure 1A, Figure 1—figure supplement 2B). However, the use of large numbers of reference samples, if available, can improve the inference. In DATES, allele frequencies of the reference populations are used for computing the likelihood as well as the weighted pairwise ancestry covariance across the genome (Materials and methods). With large samples, allele frequencies of the reference populations are more reliably computed, which in turn, can improve the precision of inferred dates (Figure 1A, Figure 1—figure supplement 2B).

#' @width 10
#' @height 5
library(ggplot2)
library(maps)
library(ggrepel)

library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)

par(mfrow=c(1,2))

# Panel A
data=read.table(file = "data/Figure1_data_panelA", header = T)
dd=data[data$n==1,]
plot(x=dd$admixture_time,y=dd$d_mean, col="darkorange2",pch=16,xlab = "True admixture time (gen)",
     ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$d_mean+dd$d_se)),
     main="(A) Variation in sample size",cex.axis=1.2,cex.lab=1.2); grid ();
segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean+dd$d_se),col="darkorange2");
segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean+dd$d_se),y1 = (dd$d_mean+dd$d_se),col="darkorange2");
segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean-dd$d_se),col="darkorange2");

dd=data[data$n==10,]
points(x=dd$admixture_time,y=dd$d_mean, col="green3",pch=16,xlab = "True admixture time (gen)",
     ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$d_mean+dd$d_se)),
     main="(A) Variation in sample size",cex.axis=1.2,cex.lab=1.2); grid ();
segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean+dd$d_se),col="green3");
segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean+dd$d_se),y1 = (dd$d_mean+dd$d_se),col="green3");
segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean-dd$d_se),col="green3");
lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2)
legend("topleft",legend = c("n=1","n=20"),col=c("darkorange2","green3"),pch = 16,bg = "white");

#panel B
data=read.table(file = "data/Figure1_data_panelB")
kol=c("tomato","darkviolet","forestgreen")
len1=c("missing proportion=10%","missing proportion=30%","missing proportion=60%")
len=c(10,30,60)
c=1;
dd=data[data$V2==10,]
plot(x=dd$V1,y=dd$V3, col="tomato",pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
     ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4)),
     main="(B) Effect of data quality",cex.axis=1.2,cex.lab=1.2); grid ();
lines(x=dd$V1,y=dd$V5, col=dd$V8,pch=6,type="p")
segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="tomato");
segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="tomato");
segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="tomato");
c=2;
dd=data[data$V2==30,]
points(x=dd$V1+2,y=dd$V3, col="darkviolet",pch=16,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
       ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4))); grid ();
segments(x0 =dd$V1+2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="darkviolet");
segments(x0 =dd$V1+2-2,x1 = dd$V1+2+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="darkviolet");
segments(x0 =dd$V1+2-2,x1 = dd$V1+2+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="darkviolet");
dd=data[data$V2==60,]
c=3
points(x=dd$V1+4,y=dd$V3, col="forestgreen",pch=17); grid ();
segments(x0 =dd$V1+4,x1 = dd$V1+4,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="forestgreen");
segments(x0 =dd$V1+4-2,x1 = dd$V1+4+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="forestgreen");
segments(x0 =dd$V1+4-2,x1 = dd$V1+4+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="forestgreen");
lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2)
legend("topleft",legend = len1,col=kol,pch = c(15,16,17),bg = "white");

Simulation results.

We constructed n admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry using ~380,000 genome-wide SNPs for admixture dates ranging between 10 and 200 generations. To minimize any issues with overfitting, we used French and Yoruba from the Human Genome Diversity Panel as reference populations in DATES (Distribution of Ancestry Tracts of Evolutionary Signals). We show the true time of admixture (X-axis, in generations) and the estimated time of admixture (±1 SE) (Y-axis, in generations). Standard errors were calculated using a weighted block jackknife approach by removing one chromosome in each run (Materials and methods). (A) Effect of sample size: We varied the sample size (n) of target group between 1 and 10 individuals. (B) Effect of data quality: To mimic the features of ancient genomes, we generated n=10 target individuals with pseudo-haploid genotypes and missing genotype rate as 10% (orange), 30% (purple), and 60% (green). See Figure 1—figure supplements 1!number(9) for additional simulations to test the performance of DATES.

#' @width 8
#' @height 6
data=read.table(file = "data/Figure1_Supplement1")
plot(x=data$V1,y = data$V3,col="deepskyblue",pch=15,
     xlab = "True admixture time (generations)",ylab = "Estimated admixture time (generations)",
     main = "Varying time of admixture up to 300 generations",xlim=c(0,300), ylim = c(0,max(data$V3[1:30]))); grid (nx=5,ny=4)
segments(x0 =data$V1,x1 = data$V1,y0 =(data$V3+data$V6),
         y1 = (data$V3-data$V6),col="deepskyblue")
segments(x0 =data$V1-2,x1 = data$V1+2,y0 =(data$V3+data$V6),
         y1 = (data$V3+data$V6),col="deepskyblue")
segments(x0 =data$V1-2,x1 = data$V1+2,y0 =(data$V3-data$V6),
         y1 = (data$V3-data$V6),col="deepskyblue")
lines(x=data$V1,y=data$V1,col="darkgrey",lty=2)

Varying time of admixture up to 300 generations.

We simulated data for 10 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry and varied the time of admixture between 10 and 300 generations. The X-axis shows the true time of admixture, and the Y-axis shows the estimated time of admixture (±1 SE) inferred using DATES (Distribution of Ancestry Tracts of Evolutionary Signals).

#' @width 10
#' @height 6
par(mfrow=c(2,4))
data=read.table(file = "data/Figure1_Supplement2A")
kol=c("darkorange2","green3","violetred1","royalblue1","coral","yellow3","cyan2","darkgoldenrod1","darkorchid1","dodgerblue1",
      "paleturquoise4");
len=c("n=1","n=5","n=10","n=20")
arr=c(1,5,10,20)
c=1;
for(i in arr)
{
  k=which(arr==i);
  dd=data[data$V2==i,]
  plot(x=dd$V1,y=dd$V3, col=kol[c],pch=16,xlab = "True admixture time (gen)",
       ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$V3+dd$V4))); grid ();
  segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2)
  legend("topleft",legend = len[k],col=kol[c],pch = 16,bg = "white");
  c=c+1
}
# Panel B - reference sample size
data=read.table(file = "data/Figure1_Supplement2B")
kol=c("darkorange","green3","cyan2","violetred1","royalblue1")
len1=c("reference pop size=1","reference pop size=5","reference pop size=10","reference pop size=20")
len=c(1,5,10,20)
#par(mfrow=c(3,2))
c=1;
for(i in len)
{
  #len=1
  k=which(i==len);
  dd=data[data$V1==i,]
  plot(x=dd$V2,y=dd$V3, col=kol[c],pch=17,xlab = "True admixture time (gen)",
       ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4))); grid ();
  segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]);
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  lines(x=dd$V2,y=dd$V2,col="darkgrey",lty=2)
  legend("topleft",legend = len1[k],col=kol[c],pch = 17,bg = "white");
  c=c+1;
}
title("(A) Effect of sample size of target population", line = -1.5, outer = TRUE,cex=1.3)
title("(B) Effect of sample size of reference populations", line = -25, outer = TRUE,cex=1.3)

Impact of sample size of the target (admixed) and reference populations.

We simulated admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry and applied DATES with m reference samples of French and Yoruba ancestry. (A) Effect of sample size of target population. Each panel shows the results of simulations with n target individuals shown in the legend and m=28 French and m=21 Yoruba reference samples from each source group. (B) Effect of sample size (m) of reference populations. Each panel shows the results of simulations with n=10 target individuals and m reference samples from each source group shown in the legend. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

#' @width 10
#' @height 6
par(mfrow=c(2,4))
data=read.table(file = "data/Figure1_Supplement3A")
len=c("α=0.01","α=0.05","α=0.2","α=0.4")
arr=c(0.01,0.05,2,4)
kol=c("green","darkorchid","yellowgreen","blue","deeppink","red2","purple")
#print (arr)
c=1
for(i in arr)
{
  k=which(arr==i);
  dd=data[data$V2==i,]
  plot(x=dd$V1,y=dd$V3, col=kol[c],pch=18,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", 
       ylim = c(0,max(dd$V3+dd$V4)),cex=1.5); grid ();
  #main="Admixture time for varying θ"
  segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2)
  legend("topleft",legend = len[k],col=kol[c],pch = 18,bg = "white");
  c=c+1;  
}
title("(A) Impact of admixture proportion on the estimated time of admixture", line = -1.5, outer = TRUE,cex=1.3)
# panel B- proportions of admixture
data=read.table(file = "data/Figure1_Supplement3B")
len=c("α=0.01","α=0.05","α=0.1","α=0.2","α=0.4")
c=1;
for(i in arr)
{
  #i=5
  if(i<1) {rr=i} else {rr=i/10}
  k=which(arr==i);
  dd=data[data$V2==i,]
  lim=max(dd$V6)+rr/4
  plot(x=dd$V1,y=dd$V6, col=kol[c],pch=19,xlab = "True admixture time (gen)",
       ylab = "Estimated Theta (θ)",las=1,ylim = c(0,lim)); grid ();
  #main="Admixture prorportion inference"
  segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V6+dd$V10),y1 = (dd$V6-dd$V10),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V6+dd$V10),y1 = (dd$V6+dd$V10),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V6-dd$V10),y1 = (dd$V6-dd$V10),col=kol[c]);
  lines(x=dd$V1,y=rep(rr,NROW(dd)),col="red",lty=2)
  legend("bottomright",legend = len[k],col=kol[c],pch = 19,bg = "white");
  c=c+1;
}
title("(B) Impact of admixture proportion on estimated ancestry proportion", line = -25, outer = TRUE,cex=1.3)

Impact of admixture proportion.

We simulated data for 10 admixed individuals with European (CEU) ancestry (α) in the range of 1–40% (the rest derived from Africans). We ran DATES (Distribution of Ancestry Tracts of Evolutionary Signals) to infer the time of admixture and ancestry proportion. (A) Impact on the estimated time of admixture: Each panel shows the estimated date of admixture for a different value of shown in the legend. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. (B) Impact on estimated ancestry proportion: Each panel shows the estimated proportion of admixture for a different value of shown in the legend. The red dashed horizontal line further indicates the value of used. The true time of admixture is shown on X-axis with the inferred proportion of admixture on Y-axis.

#' @width 9
#' @height 9
par(mfrow=c(2,2))
data=read.table(file = "data/Figure1_Supplement4")
kol=c("darkorange","green3","cyan","violetred1")
len1=c("Fst(Yoruba-Yoruba)=0.000","Fst(Yoruba-BantuKenya)=0.009",
       "Fst(Yoruba-San)=0.103")
arr=c("Yoruba","BantuKenya","San")
c=1;
for(i in arr)
{
  dd=data[data$V2==i,]
  plot(x=dd$V3,y=dd$V4, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
       ylim = c(min(dd$V4-dd$V5),220)); grid ();
  segments(x0 =dd$V3,x1 = dd$V3,y0 =(dd$V4+dd$V5),y1 = (dd$V4-dd$V5),col=kol[c]);
  segments(x0 =dd$V3-2,x1 = dd$V3+2,y0 =(dd$V4+dd$V5),y1 = (dd$V4+dd$V5),col=kol[c]);
  segments(x0 =dd$V3-2,x1 = dd$V3+2,y0 =(dd$V4-dd$V5),y1 = (dd$V4-dd$V5),col=kol[c]);
  lines(x=dd$V3,y=dd$V3,col="darkgrey",lty=2)
  legend("topleft",legend = len1[c],col=kol[c],pch = 15,bg = "white");
  c=c+1;
}
title("Impact of divergence between the ancestral population and reference populations used in DATES", line = -1.5, outer = TRUE,cex=1.3)

Impact of divergence between the ancestral population and reference populations used in DATES (Distribution of Ancestry Tracts of Evolutionary Signals).

We simulated 10 admixed individuals with 80% European (CEU) and 20% African (YRI) ancestry. We applied DATES to infer the timing of admixture using reference populations. In each panel, we show the estimated dates of admixture using French and a group that is increasingly divergent from Yoruba (shown in the legend as the FST with Yoruba). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

#' @width 12
#' @height 6
par(mfrow=c(2,4))
data=read.table(file = "data/Figure1_Supplement5A")
len1=c("CEU/YRI", "CEU/CHB","CEU/MXL","CEU/TSI")
len2=c("French/Yoruba(0.154)", "French/Tujia(0.110)",
       "French/Maya(0.037)","French/Italian(0.004)")
len=c("YRI","CHB","MXL","TSI")
for(i in len)
{
  # i="ITU"
  dd=data[data$V1==i,]
  k=which(i==len);
  plot(x=dd$V2,y=dd$V3, col="navy",pch=19,xlab = "True admixture time (gen)",
       ylab = "Estimated admixture time (gen)",ylim = c(0,250),
       main=paste("True source populations:",len1[k],sep=" ")); grid ();
  segments(x0 =dd$V2 ,x1 = dd$V2, y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="navy");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 =(dd$V3+dd$V4),col="navy");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 =(dd$V3-dd$V4),col="navy");
  lines(x=dd$V2,y=dd$V2,col="darkgrey")
  kol=c("white",dd$V7[1])
  legend("topleft",legend = paste(" References",len2[k],sep="="),
         col="navy",pch = c(19),bg = "white");
}
title("(A) Impact of divergence between the two source populations (n=10)", line = -1, outer = TRUE,cex=1.3)
# panel B- for target n=1
data=read.table(file = "data/Figure1_Supplement5B")
for(i in len)
{
  dd=data[data$V1==i,]
  k=which(i==len);
  plot(x=dd$V2,y=dd$V3, col="red",pch=16,xlab = "True admixture time (gen)",
       ylab = "Estimated admixture time (gen)",ylim = c(0,250),
       main=paste("True source populations:",len1[k],sep=" ")); grid ();
  segments(x0 =dd$V2 ,x1 = dd$V2, y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="red");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 =(dd$V3+dd$V4),col="red");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 =(dd$V3-dd$V4),col="red");
  lines(x=dd$V2,y=dd$V2,col="darkgrey")
  kol=c("white",dd$V8[1])
  legend("topleft",legend = paste(" References",len2[k],sep="="),
         col="red",pch = c(16),bg = "white");
}
title("(B) Impact of divergence between the two source populations (n=1)", line = -23, outer = TRUE,cex=1.3)

Impact of divergence between the two source populations.

We simulated n admixed individuals with 20% European (CEU) and 80% ancestry from a range of populations with increasing relatedness to Europeans (shown in the legend as the FST to Europeans). Specifically, the other reference population we used was either West Africans (YRI), East Asians (CHB), South Americans (MXL) or Southern Europeans (TSI). We used the following reference populations for the inference: French (for all simulations) with one of the other references as either Yoruba, Tujia, Maya, or Italian, respectively. We show results for varying target sample sizes of (A) n=10 and (B) n=1. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. We note the inferred dates for CEU/TSI mixtures were not significant for older timescales and hence not shown.

#' @width 11
#' @height 11
par(mfrow=c(2,2))
data=read.table(file = "data/Figure1_Supplement6")
len=c(2,4,6,8)
a=c("A","B","C","D")
par(mfrow=c(2,2))
for(i in len)
{
  dd=data[data$V1==i,]
  k=which(i==len)
  rr=paste("(",paste(a[k],paste("Admixed pop (as target and reference):",i*10,"%CEU+",(10-i)*10,"%YRI"),sep=") "),sep="")
  
  plot(x=dd$V2,y=dd$V4, col="deepskyblue",pch=15,xlab = "True admixture time (gen)",
       ylab = "Estimated admixture time (gen)",
       ylim = c(0,max(data$V4+data$V5)),main=rr); grid ()
  lines(x=dd$V2,y=dd$V8, col="orange2",pch=19,type="p")
  segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V4+dd$V5),y1 = (dd$V4-dd$V5),col="deepskyblue");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V4+dd$V5),y1 = (dd$V4+dd$V5),col="deepskyblue");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V4-dd$V5),y1 = (dd$V4-dd$V5),col="deepskyblue");
  segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V8+dd$V9),y1 = (dd$V8-dd$V9),col="orange2");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V8+dd$V9),y1 = (dd$V8+dd$V9),col="orange2");
  segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V8-dd$V9),y1 = (dd$V8-dd$V9),col="orange2");
  
  lines(x=dd$V2,y=dd$V2,col="darkgrey",lty=2)
  legend("topleft",legend = c("Refpops:French and Admixed","Refpops:Yoruba and Admixed"),text.col = c("deepskyblue","orange2"),pch = c(15,19),
         col = c("deepskyblue","orange2"),bg = "white");
  
}

Impact of using the admixed individuals themselves as one of the reference groups in DATES (Distribution of Ancestry Tracts of Evolutionary Signals).

We simulated data for 10 admixed individuals with European (CEU) ancestry () in the range of 20–80% (the rest derived from Africans [YRI]) using CEU and YRI as reference populations. Using a non-overlapping set of CEU and YRI individuals, we generated 10 additional individuals that we used as reference samples in DATES. For each simulation, we ran DATES with Europeans (French) and a non-overlapping set of simulated admixed individuals as the reference populations (shown in blue), or Yoruba and simulated admixed individuals (shown in orange). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

#' @width 12
#' @height 9
par(mfrow=c(3,4))
data=read.table(file = "data/Figure1_Supplement7A")
kol=c("darkorange","green3","cyan","violetred1","royalblue1","orangered1","darkorchid1",
      "yellowgreen","slateblue2","tomato","palevioletred2","red","orange")
len1=c("missing prop=10%","missing prop=20%","missing prop=40%","missing prop=60%")
len=c(10,20,40,60)
c=1;
for(i in len)
{
  dd=data[data$V2==i,]
  k=which(i==len);
  plot(x=dd$V1,y=dd$V3, col=kol[k],pch=17,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
       ylim = c(0,220)); grid ();
  lines(x=dd$V1,y=dd$V5, col=dd$V8,pch=6,type="p")
  segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[k]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[k]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[k]);
  lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2)
  legend("topleft",legend = len1[k],col=kol[k],pch = c(17,15),bg = "white");
  c=c+1;
}
title("(A) Diploid genotypes with missing data for n=10 admixed individuals", line = -1.5, outer = TRUE,cex=1.3)

data=read.table(file = "data/Figure1_Supplement7B")
data1=read.table(file = "data/Figure1_Supplement7C")
kol=c("darkorange","green3","cyan","violetred1","royalblue1","orangered1","darkorchid1","yellowgreen","slateblue2","tomato","palevioletred2")
c=1;
for(i in len)
{
  dd=data[data$V2==i,]
  k=which(i==len);
  plot(x=dd$V1,y=dd$V3, col=kol[c],pch=16,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
       ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4,250))); grid ();
  segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]);
  segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]);
  lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2)
  legend("topleft",legend = c(len1[k]),col=kol[c],pch = 16,bg = "white");
  c=c+1;
}
title("(B) Pseudo-haploid genotypes with missing data for n=10 admixed individuals", line = -25, outer = TRUE,cex=1.3)

c=1;
for(i in len)
{
  dd1=data1[data1$V2==i,]
  k=which(i==len);
  plot(x=dd1$V1,y=dd1$V3, col=kol[c],pch=22,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
       ylim = c(min(dd1$V3-dd1$V4),max(dd1$V3+dd1$V4,250))); grid ();
  segments(x0 =dd1$V1,x1 = dd1$V1,y0 =(dd1$V3+dd1$V4),y1 = (dd1$V3-dd1$V4),col=kol[c],lty=2);
  segments(x0 =dd1$V1-2,x1 = dd1$V1+2,y0 =(dd1$V3+dd1$V4),y1 = (dd1$V3+dd1$V4),col=kol[c],lty=2);
  segments(x0 =dd1$V1-2,x1 = dd1$V1+2,y0 =(dd1$V3-dd1$V4),y1 = (dd1$V3-dd1$V4),col=kol[c],lty=2);
  lines(x=dd1$V1,y=dd1$V1,col="darkgrey",lty=2)
  legend("topleft",legend = c(len1[k]),col=kol[c],pch = 16,bg = "white");
  c=c+1;
}
title("(C) Pseudo-haploid genotypes with missing data for n=1 admixed individuals", line = -47, outer = TRUE,cex=1.3)

Impact of sample size and data quality of target samples.

We simulated data for n admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry. In each panel, we varied three key features of the data from the target population, notably sample size (n=1 or 10), type of genotypes (diploid or pseudo-haploid) and missing genotype rate (between 10% and 60%). (A) Diploid genotypes with missing data for n=10 admixed individuals. Each panel shows the results of x% of missing diploid genotypes (shown in the legend). (B) Pseudo-haploid genotypes with missing data for n=10 admixed individuals. Each panel shows the results of x% of missing pseudo-haploid genotypes (shown in the legend). (C) Pseudo-haploid genotypes with missing data for n=1 admixed individuals. Each panel shows the results of x% of missing pseudo-haploid genotypes (shown in the legend). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

#' @width 10
#' @height 10
par(mfrow=c(3,3))
data=read.table(file = "data/Figure1_Supplement8",header=T)
kol=c("darkorange","darkorchid1","cyan","deeppink")
ref_used=c("FY","FB","FS")
len1=c("Reference missing prop=0%","Reference missing prop=20%","Reference missing prop=40%")
len=c(20,40,60)
rlen=c(20,40)
for(w in ref_used)
{
  sub=data[data$references==w,]
  for(r in len)
  {
    c=1;
    rd=sub[sub$missing_target==r,]
    w=which(r==len);
    dd=rd[rd$missing_ref==0,]
    plot(x=dd$admixture_time,y=dd$dates_n10, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",main=paste("Missing proportion in the target: ",paste(len[w],"%",sep=""),sep = ""),xlim=c(0,210),ylim = c(0,250)); grid ();
    segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]);
    segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10+dd$dates_n10_se),col=kol[c]);
    segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n10-dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]);
    lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2)
    c=c+1; err=3
    for(i in rlen)
    {
      dd=rd[rd$missing_ref==i,]
      k=which(i==len);
      points(x=dd$admixture_time+err,y=dd$dates_n10, col=kol[c],pch=15); grid ();
      segments(x0 =dd$admixture_time+err,x1 = dd$admixture_time+err,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]);
      segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10+dd$dates_n10_se),col=kol[c]);
      segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n10-dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]);
      lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2)
      c=c+1; err=err+5
    }
    legend("topleft",legend = c(len1),col=kol[1:4],pch = 15,bg = "white");
  }
}
title("(a) Reference populations of French and Yoruba (FST(true, reference) ~ 0) for target population n=10 ", line = -1, outer = TRUE,cex=1.3)
title("(b) Reference populations of French and Bantu Kenya ( FST(true, reference) ~ 0.009) for target population n=10", line = -26, outer = TRUE,cex=1.3)
title("(c) Reference populations of French and San ( FST(true, reference) ~ 0.103) for target population n=10", line = -51, outer = TRUE,cex=1.3)

Impact of data quality of target and reference populations as a function of divergence between true and reference populations used in DATES (Distribution of Ancestry Tracts of Evolutionary Signals).

We simulated data for n=10 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry with pseudo-haploid genotypes. The reference populations used also had pseudo-haploid genotypes. We further varied three key features of the data, missing genotype rate in reference populations, missing genotype rate in target populations, and divergence between true source populations and reference population used for the analysis. In each row, we show the admixture dates using reference populations with increasing divergence to true source population (FST shown in the row title). In each column, we varied the missing genotype rate in the target population (shown in the column title). Further, each panel shows results of missing data in the reference genomes (shown in the legend). (a) Reference populations of French and Yoruba (FST(true, reference)0). (b) Reference populations of French and Bantu Kenya (_FST_(true, reference)0.009). (c) Reference populations of French and San (FST(true, reference)~0.103). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

#' @width 10
#' @height 10
par(mfrow=c(3,3))
data=read.table(file = "data/Figure1_Supplement9",header=T)
kol=c("darkorange","darkorchid1","cyan","deeppink")
ref_used=c("FY","FB","FS")
len1=c("Reference missing prop=0%","Reference missing prop=20%","Reference missing prop=40%")
len=c(20,40,60)
rlen=c(20,40)
for(w in ref_used)
{
  sub=data[data$references==w,]
  for(r in len)
  {
    c=1;
    rd=sub[sub$missing_target==r,]
    w=which(r==len);
    dd=rd[rd$missing_ref==0,]
    plot(x=dd$admixture_time,y=dd$dates_n1, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",
         main=paste("Missing proportion in the target: ",paste(len[w],"%",sep=""),sep = ""),xlim=c(0,210),
         ylim = c(0,250)); grid ();
    segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]);
    segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1+dd$dates_n1_se),col=kol[c]);
    segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n1-dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]);
    lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2)
    c=c+1; err=3
    for(i in rlen)
    {
      dd=rd[rd$missing_ref==i,]
      k=which(i==len);
      points(x=dd$admixture_time+err,y=dd$dates_n1, col=kol[c],pch=15); grid ();
      segments(x0 =dd$admixture_time+err,x1 = dd$admixture_time+err,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]);
      segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1+dd$dates_n1_se),col=kol[c]);
      segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n1-dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]);
      lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2)
      c=c+1; err=err+5
    }
    legend("topleft",legend = c(len1),col=kol[1:4],pch = 15,bg = "white");
  }
}
title("(a) Reference populations of French and Yoruba (FST(true, reference) ~ 0) for target population n=1 ", line = -1, outer = TRUE,cex=1.3)
title("(b) Reference populations of French and Bantu Kenya ( FST(true, reference) ~ 0.009) for target population n=1 ", line = -26, outer = TRUE,cex=1.3)
title("(c) Reference populations of French and San ( FST(true, reference) ~ 0.103) for target population n=1 ", line = -51, outer = TRUE,cex=1.3)

Impact of small sample size and data quality of target and reference populations as a function of divergence between true and reference populations used in DATES (Distribution of Ancestry Tracts of Evolutionary Signals).

We simulated data for n=1 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry with pseudo-haploid genotypes. The reference populations used also had pseudo-haploid genotypes. We further varied three key features of the data, missing genotype rate in reference populations, missing genotype rate in target populations, and divergence between true source populations and reference population used for the analysis. In each row, we show the admixture dates using reference populations with increasing divergence to true source population (FST shown in the row title). In each column, we varied the missing genotype rate in the target population (shown in the column title). Further, each panel shows results of missing data in the reference genomes (shown in the legend). (a) Reference populations of French and Yoruba (FST(true, reference)0). (b) Reference populations of French and Bantu Kenya (_FST_(true, reference)0.009). (c) Reference populations of French and San (FST(true, reference)~0.103). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis.

Next, we tested DATES for features such as varying admixture proportions and use of surrogate populations as reference groups. By varying of European ancestry proportion between ~1% and 50% (the rest derived from west Africans), we observed DATES accurately estimated the timing in all cases (Figure 1—figure supplement 3A). However, the inferred admixture proportion was overestimated for lower admixture proportions (<10%) (Figure 1—figure supplement 3B). Thus, we caution against using DATES for estimating ancestry proportions. DATES works reliably for dating admixtures between related groups such as Europeans and Mexicans (FST~ 0.03), though it was unable to distinguish mixtures of Southern and Northern Europeans (FST< 0.005) (Figure 1—figure supplement 5).

We found DATES is robust to the use of highly divergent surrogates as reference populations. For example, the use of Khomani San as the reference population instead of the true ancestral population of Yoruba (FST ~ 0.1) provides unbiased dates of admixture (Figure 1—figure supplement 4). In this regard, for ancient DNA where sometimes only sparse data is available, one can also use present-day samples as reference populations to increase the quality and sample size of the ancestral groups. In principle, as long as the allele frequencies in the reference samples are correlated to the ancestral allele frequencies, the inference of admixture dates should remain unbiased (Materials and methods). In practice, however, recent demographic events (e.g., strong founder events or admixture from additional sources, etc.) in the history of the present-day samples could lead to significant deviation from the ancestral allele frequencies. Thus, the reference populations should be carefully chosen.

Another idea is to use the admixed populations themselves as one of the reference populations as demonstrated by the single reference setup in ALDER bib37. Admixed individuals have intermediate allele frequencies to the ancestral populations and thus weighted LD or ancestry covariance can be computed with only one reference population (albeit, with reduced power). bib37, showed that the use of admixed populations as one of the references does not bias the rate of decay of the weighted LD (i.e., time of admixture), though the amplitude of the decay curve (not used in DATES) can be biased under some scenarios. To verify DATES provides reliable results under this setup, we applied DATES with a single reference population and used the admixed population as the other reference. Like ALDER, our inferred dates of admixture were accurate and comparable to using two reference populations. (Figure 1—figure supplement 6).

An important feature of DATES is that it does not require phased data and is applicable to datasets with small sample sizes, making it in principle useful for ancient DNA applications. To test the reliability of DATES for ancient genomes, we simulated data mimicking the relevant features of ancient genomes, namely small sample sizes (n=1–20), large proportions of missing genotypes (between 10% and 60%), and pseudo-haploid genotype calls (instead of diploid genotype calls) in reference and/or target samples. DATES showed reliable results under various setups, even when only a single admixed individual was available (Figure 1B, Figure 1—figure supplements 1!number(9)). In contrast, admixture LD-based methods require more than one sample and do not work reliably with missing data. For example, ALDER estimates were very unstable for simulations with >40% missing data. For older dates (>100 generations), there was a slight bias even with >10% missing genotypes (Appendix 1—figure 5). This is expected as LD calculations leverage shared patterns across samples, thus variable missingness of genotypes across individuals leads to substantial loss of data leading to unstable and noisy inference. We also generated data for combinations of features including small sample sizes, pseudo-haploid genotypes with large proportions of missing genotypes in both target and reference samples, and use of highly divergent reference samples. We found DATES yielded reliable results with large amounts (~40–60%) of missing data, either in the target or references, even with highly divergent reference populations (Figure 1—figure supplement 8). This was also true when a single target sample was available, though as expected, the inference becomes noisier for older dates and large fractions of missing data (Figure 1—figure supplement 9). The robust performance of DATES in sparse datasets highlights a major advantage for ancient DNA applications.

DATES assumes a model of instantaneous gene flow with a single pulse of mixture between two source populations. However, many human populations have a history of multiple pulses of gene flow. To test the performance of DATES for multi-way admixture events, we generated admixed individuals with ancestry from three sources (East Asians, Africans, and Europeans) where the gene flow occurred at two distinct time points (Appendix 2—figure 1). By applying DATES with pairs of reference populations, we observed that DATES recovered both admixture times for target populations that had equal contributions from all three ancestral groups (Appendix 2—figure 2). In the case of unequal admixture proportions from three ancestral groups, DATES inferred the timing of the recent admixture event in most cases. In some cases, however, the inferred dates were intermediate to the two pulses when the ancestry proportion of the recent event was low (Appendix 2—figure 3). This confounding could be eliminated if the reference populations were set up to match the model of gene flow. For example, the inferred times of admixture were accurate if the two references used in DATES were: reference 1: the source population for the recent event and reference 2: pooled individuals from both ancestral populations that contributed to the first admixture event, or the intermediate admixed group formed after the first event (Appendix 2—table 1). This highlights how the choice of reference populations can help to tune the method to infer the timing of specific admixture events more reliably.

Finally, we explored the impact of more complex demographic events, including continuous admixture and founder events using coalescent simulations (Appendix 2). In the case of continuous admixture, DATES inferred an intermediate timing between the start and the end of the gene flow period, similar to other methods like ALDER and Globetrotter (bib27; bib37; Appendix 2—table 2). In the case of populations with founder events, we inferred unbiased dates of admixture in most cases except when the founder event was extreme (Ne ~ 10) or the population had maintained a low population size (Ne < 100) until the present (i.e., no recovery bottleneck) (Appendix 2—figure 4, Appendix 2—table 3). In humans, few populations have such extreme founder events, and thus, in most other cases, our inferred admixture dates should be robust to founder events bib55. We note that while DATES is not a formal test of admixture, in simulations, we find that in the absence of gene flow, the method does not infer significant dates of admixture even if the target has a complex demographic history (Appendix 2—figure 6, Appendix 2—figure 7).

Comparison to other methods

We assessed the reliability of DATES in real data by comparing our results with published methods: Globetrotter, ALDER, and ROLLOFF. These methods are designed for the analysis of present-day samples that typically have high-quality data with limited missing variants. In addition, Globetrotter uses phased data which is challenging for ancient DNA samples. Thus, instead of rerunning other methods, we took advantage of the published results for contemporary samples presented in bib27. Following bib27, we created a merged dataset including individuals from HGDP bib35bib5bib28 (Materials and methods). We applied DATES and ALDER to 29 target groups using the reference populations reported in Table S12 in bib27, excluding one group where the population label was unclear. Interestingly, the majority of these groups (25/29) failed ALDER’s formal test of admixture; either because the results of the single reference and two reference analyses yielded inconsistent estimates or because the target had long-range shared LD with one of the reference populations (Appendix 1—table 5). Using DATES, we inferred significant dates of admixture in 20 groups, and 14 of those were consistent with estimates based on Globetrotter. In the case of the six populations that disagreed across the two methods, most of the populations appear to have a history of multiple pulses of gene flow either involving more than two populations (e.g., Brahui bib48) or multiple instances of contact between the same two reference groups (e.g., Mandenka bib51) or the model of admixture differed (e.g., recent ancient DNA studies suggest present-day Bulgarians have ancestry from western hunter-gatherers [HGs], Near Eastern farmers, and Steppe pastoralists from Eurasia bib24 but were modeled as a mixture of Polish and Cypriots in Globetrotter). In case of complex admixture scenarios, the inconsistencies across the two methods are hard to interpret as Globetrotter and DATES could be capturing different events or the weighting of both events could differ. Finally, the estimated admixture timing based on DATES, ROLLOFF, and ALDER (assuming two-way admixture regardless of the formal test results) were found to be highly concordant (Appendix 1—table 5).

Fine-scale patterns of population mixtures in ancient Europe

Recent ancient DNA studies have shown that present-day Europeans derive ancestry from three distinct sources: (a) HG-related ancestry that is closely related to Mesolithic HGs from Europe; (b) Anatolian farmer-related ancestry related to Neolithic farmers from the Near East and associated to the spread of farming to Europe; and (c) Steppe pastoralist-related ancestry that is related to the Yamnaya pastoralists from Russia and Ukraine bib1bib24bib53. Many open questions remain about the timing and dynamics of these population interactions, in particular related to the formation of the ancestral groups (which were themselves admixed) and their expansion across Europe. To characterize the spatial and temporal patterns of mixtures in Europe in the past 10,000 years, we used 1096 ancient European samples from 152 groups from the publicly available Allen Ancient DNA Resource (AADR) spanning a time range of ~8000–350 BCE (Materials and methods, Supplementary file 1A). Using DATES, we characterized the timing of the various gene flow events, and below, we describe the key events in chronological order focusing on three main periods.

Holocene to Mesolithic

Pre-Neolithic Europe was inhabited by HGs until the arrival of the first farmers from the Near East bib23bib29. There was large diversity among HGs with four main groups – western hunter-gatherers (WHGs) that were related to the Villabruna cluster in central Europe, eastern hunter-gatherers (EHGs) from Russia and Ukraine related to the Upper Paleolithic group of Ancestral North Eurasians (ANEs), Caucasus hunter-gatherers (CHGs) from Georgia associated to the first farmers from Iran, and the GoyetQ2-cluster associated to the Magdalenian culture in Spain and Portugal bib15bib18bib30bib52bib53. Most Mesolithic HGs fall on two main clines of relatedness: one cline that extends from Scandinavia to central Europe showing variable WHG-EHG ancestry, and the other in southern Europe with WHG-GoyetQ2 ancestry bib52. The latter is already present in the 17,000 BCE El Mirón individual from Spain, suggesting that the GoyetQ2-related gene flow occurred well before the Holocene. However, the WHG-EHG cline was formed more recently during the Mesolithic period, though the precise timing remains less well understood.

To characterize the formation of the WHG-EHG cline, we used genomic data from 16 ancient HG groups (n=101) with estimated ages of ~7500–3600 BCE. We first verified the ancestry of each HG group using qpAdm that compares the allele frequency correlations between the target and a set of source populations to formally test the model of admixture and then infer the ancestry proportions for the best-fitted model bib24. For each target population, we chose the most parsimonious model, that is, fitting the data with the minimum number of source populations. Consistent with previous studies, our qpAdm analysis showed that most HGs from Scandinavia, the Baltic Sea region, and central Europe could be modeled as a two-way mixture of WHG- and EHG-related ancestry (Supplementary file 2A). To confirm that the target populations do not harbor Anatolian farmer-related ancestry (that could lead to some confounding in estimated admixture dates), we applied D-statistics of the form D(Mbuti, target, WHG, Anatolian farmers) where target = Mesolithic HGs. We observed that none of the target groups had a stronger affinity to Anatolian farmers than WHG (Supplementary file 2B). Together, these results suggest that the mixtures we date below reflect pre-Neolithic contacts between the HGs.

To infer the timing of the mixtures in the history of Mesolithic European HGs, we applied DATES to HGs from Scandinavia, the Baltic regions, and central Europe using WHG- and EHG-related groups as reference populations. DATES infers the time of admixture in generations before the sample lived. Accounting for the average sampling age of the specimens and the mean human generation time of 28 years bib41, we inferred the admixture time in years before present or in BCE (Materials and methods). We report the average dates (or median, where specified) in BCE in the main text and provide additional details in Figure 2 and Supplementary file 1B including the sample sizes, dates in generations, and BCE for each population. Among HGs, we inferred that the earliest admixture occurred in Scandinavian HGs from Norway and Sweden with a range of average dates of ~80–113 generations before the samples lived (Figure 2—figure supplement 1). This translates to admixture dates of ~10,200–8000 BCE, with the most recent dates inferred in Motala HGs from Sweden suggesting substantial substructure in HGs (Figure 2). In the Baltic region, we inferred the range of admixture dates of ~8700–6000 BCE in Latvia and Lithuania HGs, postdating the mixture in Scandinavia. In southeast Europe, the Iron Gates region of the Danube Basin shows widespread evidence of mixtures between HG groups and, in the case of some outliers, the mixture of HGs and Anatolian farmer-related ancestry as early as the Mesolithic period bib14. Further, these groups showed a strong affinity to the WHG-related ancestry in Anatolian populations, suggesting ancient interactions with Near Eastern populations bib14. We applied qpAdm to test the model of admixture in Iron Gates HG and found that the parsimonious model with WHG- and EHG-related ancestry provides a good fit to the data. Further, when we tested the model with Anatolian-related ancestry using Anatolian HG (AHG) as an additional source population, the AHG ancestry proportion was not significant (Supplementary file 2A). Applying DATES to Iron Gates HG with WHG and EHG as reference populations, we inferred this group was genetically formed in ~9200 BCE (95% confidence interval: 10,000–8400 BCE). Our samples of the Iron Gates HGs include a wide range of C14 dates between 8800 and 5700 BCE. We confirmed our dates were robust to the sampling age of the individuals as we obtained statistically consistent dates when all samples were combined as one group or when subsets of samples were grouped in bins of 500 years (Figure 2—figure supplement 2). The most recent dates of ~7500 BCE were inferred in eastern Europe in Ukraine HGs, highlighting how the WHG-EHG cline was formed over a period ~2000–3000 years (Figure 2, Supplementary file 1B).

#' @width 10
#' @height 10
sf::sf_use_s2(FALSE)
data=read.table(file = "data/Figure3_data_map",header = T)
world <- ne_countries(scale = "medium", returnclass = "sf")
Europe <- world[which(world$continent == "Europe"),]
# Hunter Gatherer distribution panel
dd=data[data$col=="deepskyblue",]
ggplot(Europe) + geom_sf() + coord_sf(xlim = c(-15,40), ylim = c(35,70), expand = FALSE) + 
  geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=4,show.legend = F) +
  scale_fill_manual(values ="deepskyblue") +
  theme(plot.title = element_text(colour = "black"))

#Neolithic Farmer spread
dd=data[data$col=="orange2",]
ggplot(Europe) + geom_sf() + coord_sf(xlim = c(-15,40), ylim = c(35,70), expand = FALSE) + 
  geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=4,show.legend = F) + 
  theme(plot.title = element_text(colour = "black"))

# MLBA bronze age samples
dd=rbind(data[data$col=="green3",],data[data$col=="lightpink2",])
ggplot(world) + geom_sf() + coord_sf(xlim = c(-15,90), ylim = c(30,80), expand = FALSE) + 
  geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=3,show.legend = F) + 
  theme(plot.title = element_text(colour = "black"))


# Admixture time plots 
par(mai = c(1,1,0.1,3))
data=read.table(file = "data/Figure3_data_admixturetimes")
fossil_range=read.table(file = "data/Figure3_data_fossilages")
# Hunter Gatherer mixture panel
dd=rbind(data[data$V8=="deepskyblue",])
target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),]
group_names=unique(target_fossil$V2)
target_fossil$V2 <- as.character(target_fossil$V2)
target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950
for(i in group_names)
{
  k=which(dd$V1==i)
  dd[k,9]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
  dd[k,10]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6)))
  dd[k,11]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
}
plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p",
     col=as.character(dd$V8),pch=20,cex=2, ylim = c(4000,12200), xlim = c(0.5,length(dd$V1)+0.5),
     xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main="");
grid(nx=8,ny=10);
segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7),
         y1 = (dd$V6-dd$V7),col=as.character(dd$V8),lty = 1)
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7),
         y1 = (dd$V6+dd$V7),col=as.character(dd$V8))
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7),
         y1 = (dd$V6-dd$V7),col=as.character(dd$V8))
axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1)))
name_lab=gsub(".SG","",dd$V1)
points(x=seq(1,length(dd$V1),1),y=dd$V11,type = "p",col="grey20",pch=1,cex=1.1)
text(1:length(dd$V4), rep(3500, length(dd$V4)),
     labels= name_lab, col="black", srt=25, xpd=TRUE, adj=1,cex=1)
axis(2, seq(4000,12000,1000), lty = 1,col = "black",tck="y",labels = rep('',length(seq(4000,12000,1000))))
text(rep(-0.2,max(4000,12000,500)), seq(4000,12000,1000),labels= seq(4000,12000,1000),
     col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8)
abline(v=3.5,lty=3,col="grey20")
abline(v=6.5,lty=3,col="grey20")
text(c(1,4,7),rep(12000,5), labels= c("Scandinavian HG","Baltic region HG","Central European HG"),
     col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=1)
legend(8.9,12000,legend = c("DATES in BCE ± 1SE","Average C14 radiocarbon age"),col=c("grey20","grey20"), 
       pch=c(20,1),lty=c(1,-1),xpd=T,bty='n')

# Neolithic Farmer mixture
par(mai = c(1.4,1,0.1,0.5))
dd=data[data$V8=="orange2",]
target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),]
group_names=unique(target_fossil$V2)
target_fossil$V2 <- as.character(target_fossil$V2)
target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950
for(i in group_names)
{
  k=which(dd$V1==i)
  dd[k,9]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
  dd[k,10]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6)))
  dd[k,11]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
}
plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p",
     col=as.character(dd$V8),pch=20,cex=2, ylim = c(2000,6800), xlim = c(1,length(dd$V1)+0.5),
     xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main="");
grid(nx=8,ny=10);
segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7),
         y1 = (dd$V6-dd$V7),col=as.character(dd$V8),lty = 1)
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7),
         y1 = (dd$V6+dd$V7),col=as.character(dd$V8))
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7),
         y1 = (dd$V6-dd$V7),col=as.character(dd$V8))
axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1)))
points(x=seq(1,length(dd$V1),1),y=dd$V11,type = "p",col="grey20",pch=1,cex=1)
name_lab=gsub("_published","",gsub("_published.DG","",gsub(".SG","",dd$V1)))
text(1:length(dd$V4), rep(1800, length(dd$V4)),
     labels= name_lab, col="black", srt=35, xpd=TRUE, adj=1,cex=0.7, cex.lab=0.7)
axis(2, seq(2000,6800,500), lty = 1,col = "black",tck="y",labels = rep('',length(seq(2000,6800,500))))
text(rep(-3.5,max(2000,6800,500)), seq(2000,6800,500),labels= seq(2000,6800,500),
     col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8)
abline(v=2.5,lty=3,col="grey20")
abline(v=20.5,lty=3,col="grey20")
abline(v=23.5,lty=3,col="grey20")
abline(v=28.5,lty=3,col="grey20")
abline(v=36.5,lty=3,col="grey20")
abline(v=42.5,lty=3,col="grey20")
abline(v=47.5,lty=3,col="grey20")
abline(v=56.5,lty=3,col="grey20")
abline(v=63.5,lty=3,col="grey20")
text(c(-1.3,10,20.6,23.8,28.6,31.8,37,43,48,53,57,63.53),rep(6500,5), 
     labels= c("Balkans","Hungary","Czech","Germany","Poland","Ukraine",
               "France","Italy","Spain","Portugal","Britain countries","Scandinavia"),col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8)
text(c(20.6,23.6),rep(6200,5), labels= c("Republic","Austria"),
     col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8)

# Bronze age steppe mixture
par(mai = c(2.6,1,0.1,2.8))
dd=rbind(data[data$V8=="green3",],data[data$V8=="lightpink2",])
dd$V8=gsub("springgreen","green3",dd$V8)
target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),]
group_names=unique(target_fossil$V2)
target_fossil$V2 <- as.character(target_fossil$V2)
target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950
pch_str=as.numeric(dd$V9)
for(i in group_names)
{
  k=which(dd$V1==i)
  dd[k,10]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
  dd[k,11]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6)))
  dd[k,12]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6))
}
plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p",
     col=dd$V8,pch=dd$V9,cex=1.5, ylim = c(500,5000), xlim = c(1,length(dd$V1)+0.5),
     xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main="");
grid(nx=8,ny=10);
segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7),
         y1 = (dd$V6-dd$V7),col=dd$V8,lty = 1)
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7),
         y1 = (dd$V6+dd$V7),col=dd$V8)
segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7),
         y1 = (dd$V6-dd$V7),col=dd$V8)
axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1)))
points(x=seq(1,length(dd$V1),1),y=dd$V12,type = "p",col="grey20",pch=1,cex=0.8)
name_lab=gsub(".SG","",dd$V1)
text(1:length(dd$V4), rep(350, length(dd$V4)),
     labels= name_lab, col="black", srt=45, xpd=TRUE, adj=1,cex=1)
axis(2, seq(500,5000,500), lty = 1,col = "black",tck="y",labels = rep('',length(seq(500,5000,500))))
text(rep(-2,max(500,5000,500)), seq(500,5000,500),labels = c(seq(500,5000,500)),
     col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=1)
abline(v=7.5,lty=3,col="grey20")
abline(v=26.5,lty=3,col="grey20")
text(c(1.6,15,29),rep(4900,5),
     labels= c("Late Neolithic","Chacolithic to Bronze Age","Middle to Late Bronze Age"),col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8)
legend(35.5,5000,legend = c("Corded Ware complex","Bell Beaker complex"),
       col=c(rep("green3",2)),pch=c(15,17),lty=c(1,1),xpd=T,cex=1)
Timeline of admixture events in ancient Europe.

We applied DATES (Distribution of Ancestry Tracts of Evolutionary Signals) to ancient samples from Europe. In the right panel, we show the sampling locations of the ancient specimens, and in the left panel, we show the admixture dates for each target group listed on the X-axis. The inferred dates in generations were converted to dates in BCE by assuming a mean generation time of 28 years bib40 and accounting for the average sampling age (shown as gray dots) of all ancient individuals in the target group (Materials and methods). The top panel shows the formation of western hunter-gatherer (WHG)-eastern hunter-gatherer (EHG) cline (in blue) using Mesolithic hunter-gatherers (HGs) as the target and EHG and WHG as reference populations. The middle panel shows admixture dates of local HGs and Anatolian farmers (in orange) using Neolithic European groups as targets and Anatolian farmers-related groups and WHG-related groups as reference populations. The bottom panel shows the spread of Steppe pastoralist-related ancestry (in green) estimated using middle and late Neolithic, Chalcolithic, and Bronze Age samples from Europe as target populations and early Steppe pastoralist-related groups (Afanasievo and Yamnaya Samara) and a set of Anatolian farmers and WHG-related groups as reference populations. For the middle to late Bronze Age (MLBA) samples from Eurasia, we used the early Steppe pastoralist-related groups and the Neolithic European groups as reference populations. The cultural affiliation (Corded Ware Complex [CWC], Bell Beaker complex [BBC], or Steppe MLBA cultures) of the individuals is shown in the legend. See Figure 2—figure supplements 1 and !number(2) we applied DATESfor decay curves for all samples and stratified datesfor Iron Gates HGs.

#' @width 10
#' @height 10
NRMSD <- function(y, yfit,na.rm = TRUE) {
  # y is the vector of empirical values
  # yfit is the vector of fitted values
  if (length(y) != length(yfit)) stop('y and yfit should have the same length')
  if (na.rm) {
    isna = is.na(y) | is.na(yfit)
    y = y[!isna]
    yfit = yfit[!isna]
  }
  nrmsd = sqrt(mean( (yfit - y)**2, na.rm = na.rm )) * (max(yfit, na.rm = na.rm) - min(yfit, na.rm = na.rm))**(-1)
  return(nrmsd)
}
#Usage: nmrds=round(NRMSD(data$V2,data$V3),4)
admix_dates=read.table(file = "data/Figure3_supplement1_admixture_dates")
par(mfrow=c(6,4),oma=c(0,0,3.5,0))
layout(matrix(seq(1,24,1), nrow = 6), heights=c(1,1))
# HG decay curves
dates_files=read.table(file = "data/Figure3_supplement1_HG_curves")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files/"; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'.',fixed=TRUE))[1]
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  title_name=gsub("EHG_WHG_","",mm1)
  bce=admix_dates[admix_dates$Population==title_name,]
  main_t=paste("Target:",paste(title_name,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:EHG-WHG")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="deepskyblue"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=2)
  len=paste(paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$DATES_mean.BCE.,bce$DATES_SE.BCE.,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n"))
  legend("topright",legend = len,col = kol,bty='n',cex=0.7)
}
### Farmer formation 
data=read.table(file = "data/Figure3_supplement1_decay_files/AAF.fit")
jout=read.table(file = "data/Figure3_supplement1_decay_files/AAF.jout")

plot(x=data$V1,y=data$V2,type="p",col="darkorchid",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
     las=1,main = paste("Target: Early Anatolian farmers (n=9)","References: Levant_pooled-Iran_N_pooled",sep="\n")
     ,cex=1,ylab ="",cex.main=0.8)
title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) 
lines(x=data$V1,y=data$V3,type = "l",col="darkorchid",lty=2)
nmrsd=round(NRMSD(data$V2,data$V3),4)
len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "),
                paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+8071,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"),
          paste("NRMSD",nmrsd,sep="="),sep = "\n")
legend("topright",legend = len,col = "darkorchid",lty=c(2,-1,-1),cex=0.7,bty='n')

dates_files=read.table(file = "data/Figure3_supplement1_FF_curves")
mt=c("Iran_N-Anatolian_N","Iran_N-Anatolian_N")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1]
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  title_name=gsub("published_merged","Anatolian Farmer",gsub("Anatolia_N_Serbia_Iron_Gates_HG_","",mm1))
  bce=admix_dates[admix_dates$Population==title_name,]
  main_t=paste("Target:",paste(title_name,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n",mt[i])
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="firebrick"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main =main_t ,cex=1,cex.main=0.8)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1)
  len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n")
  legend("topright",legend = len,col = kol,bty='n',cex=0.7)
}
### Farmer spread
dates_files=read.table(file = "data/Figure3_supplement1_Neolithic_curves")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1]
  refm=paste(unlist(strsplit(as.character(mm),'-',fixed=TRUE))[2],"-",unlist(strsplit(as.character(mm),'-',fixed=TRUE))[3])
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  bce=admix_dates[admix_dates$Population==mm1,]
  main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:",gsub("Turkey_N","AnatolianN",refm))
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="orange2"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1)
  len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n")
  legend("topright",legend = len,col = kol,bty='n',cex=0.7)
}

### Steppe formation- EBA
dates_files=read.table(file = "data/Figure3_supplement1_Steppe_formation")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1]
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  bce=admix_dates[admix_dates$Population==mm1,]
  main_t=gsub("_pub","",paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Iran_N_pooled-EHG_pooled"))
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="deeppink"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main = main_t,cex=1.2,cex.main=0.8)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1)
  len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n")
  legend("topright",legend = len,col = kol,bty='n',cex=0.7)
}
### Steppe formation - MLBA
dates_files=read.table(file = "data/Figure3_supplement1_Steppe_MLBA")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1]
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  bce=admix_dates[admix_dates$Population==mm1,]
  main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Steppe EBA-Neolithic groups")
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="lightpink"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1)
  len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n")
  legend("topright",legend = len,col = kol,bty='n',cex=0.7)
}
### Steppe spread
dates_files=read.table(file = "data/Figure3_supplement1_Steppe_spread")
for (i in 1:nrow(dates_files)) 
{
  file=dates_files$V1[i]
  path="data/Figure3_supplement1_decay_files/" ; var=paste(path,file,sep = "/")
  mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1]
  mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1]
  data=read.table(file = var)
  jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/")
  estimate=read.table(file = time)
  name=gsub("estimate_","",gsub(".fit", "", file))
  title_name=gsub("Afanasievo_Anatolia_N_","",mm1)
  bce=admix_dates[admix_dates$Population==title_name,]
  main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Steppe groups-(WHG+AnatoliaN)")
  len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ")
  nmrsd=round(NRMSD(data$V2,data$V3),4)
  if(nmrsd>0.7) {kol="grey20"} else {kol="green3"};
  plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
       ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.7)
  lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1)
  len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"),
                  paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"),
            paste("NRMSD",nmrsd,sep="="),sep = "\n")
  legend("topright",legend = len,col = kol,bty='n',cex=0.8)
}
DATES (Distribution of Ancestry Tracts of Evolutionary Signals) ancestry covariance decay curves.

We show the weighted ancestry covariance decay curves generated using DATES for all the target groups analyzed in the study. Each subplot shows the decay curve for one target population with the associated reference groups shown in the title. For each target, in the legend, we show the inferred average dates of admixture (±1 SE) in generations before the individual lived, in BCE that accounts for the average age of all the individuals in the target and the mean generation time of human populations (see Materials and methods). We also show the normalized root-mean-square deviation (NRMSD) values for all fitted curves and the plots with NRMSD >0.7 are shown in gray. For consistency, we use the same colors as Figure 2.

#' @width 9
#' @height 6
dd=read.table(file = "data/Figure3_supplement2")
plot(seq(1,NROW(dd$V1),1),dd$V8,col=as.character(dd$V10),pch=15, 
     ylim =c(3000,19000),las=1, xaxt='n',xlab="",ylab="Admixture time in years BCE",
     main = "Admixture dates in Iron_Gates samples grouped by c14 age in bins of 500 years"); grid()
segments(x0 =seq(1,NROW(dd$V1),1),x1 = seq(1,NROW(dd$V1),1),y0 =(dd$V8+2*dd$V9),
         y1 = (dd$V8-2*dd$V9),col=as.character(dd$V10),lty = 1)
segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V8+2*dd$V9),
         y1 = (dd$V8+2*dd$V9),col=as.character(dd$V10))
segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V8-2*dd$V9),
         y1 = (dd$V8-2*dd$V9),col=as.character(dd$V10))
points(seq(1,NROW(dd$V1),1),dd$V7)
segments(x0 =seq(1,NROW(dd$V1),1),x1 = seq(1,NROW(dd$V1),1),y0 =(dd$V2-1950),
         y1 = (dd$V3-1950))
segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V2-1950),
         y1 = (dd$V2-1950))
segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V3-1950),
         y1 = (dd$V3-1950))
axis(1, 1:NROW(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', NROW(dd$V1)))
text(1:NROW(dd$V1), rep(2500, NROW(dd$V4)),
     labels= gsub("IronGates-","",dd$V1), col="black", srt=25, xpd=TRUE, adj=1,cex=1)
legend("topright",legend = c("WHG-EHG admixture per c14 bin","WHG-EHG admixture in all samples","average c14 ages"),
       col = c("blue","cyan2","grey40"),lty=c(1,1,1),pch=c(15,15,1))
Timing of western hunter-gatherer (WHG) and eastern hunter-gatherer (EHG) admixture in Iron Gates hunter-gatherer (HG) samples.

The time of admixture in Iron Gates HG samples grouped in bins of C14 age of 500 years. The C14 age is shown on X-axis and the admixture time in BCE for corresponding samples is shown on the Y-axis.

Early to middle Neolithic

Neolithic farming began in the Near East – the Levant, Anatolia, and Iran – and spread to Europe and other parts of the world bib23bib34bib53. The first farmers of Europe were related to Anatolian farmers, whose origin remains unclear. The early Neolithic Anatolian farmers (Aceramic Anatolian farmers) had majority ancestry from AHG with some gene flow from the first farmers from Iran bib14. AHG, in turn, had ancestry from Levant HG (Natufians) and some mysterious HG group related to the ancestors of WHG individuals from central Europe – a gene flow event that likely occurred in the late Pleistocene bib14. Using qpAdm, we confirmed that early Anatolian farmers could be modeled as a mixture of AHG and Iran Neolithic farmer-related groups (Supplementary file 2C). To learn about the timing of the genetic formation of early Anatolian farmers, we applied DATES using Iran Neolithic farmer-related individuals and other reference as groups with AHG ancestry. Since there are limited samples of AHG ancestry, we instead used pooled individuals of WHG-related and Levant Neolithic farmer-related individuals to represent the main ancestry components of AHG. We note that the application of DATES to three-way admixed groups such as early Anatolian farmers can lead to intermediate dates between the first and second pulse of gene flow unless the reference populations are chosen carefully (Appendix 2—table 1). Our setup with pooled reference populations should recover the timing of the most recent event (in this case, the gene flow from CHG or Iran Neolithic-related groups) reliably. We infer the Iran Neolithic farmer-related gene flow occurred ~10,900 BCE (12,200–9600 BCE) (Figure 3), predating the origin of farming in Anatolia bib9. During the subsequent millennia, these early farmers further admixed with Levant Neolithic groups to form Anatolian Neolithic farmers who spread towards the west to Europe and in the east to mix with Iran Neolithic farmers, forming the Chalcolithic groups of Seh Gabi and Hajji Firuz (Supplementary file 2C). Using DATES, we inferred that these Chalcolithic groups were genetically formed in ~7600–5700 BCE (Supplementary file 1B).

#' @width 6
#' @height 12
dd=read.table(file = "data/Figure2_Data_map" ,header = T)
world <- ne_countries(scale = "medium", returnclass = "sf")
ggplot(world) + geom_sf() + coord_sf(xlim = c(-25,120), ylim = c(35,80), expand = FALSE) + 
  geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=18,cex=4,show.legend = F) +
  geom_text_repel(data=dd,aes(x=long+10,y=lat),label=dd$sample,cex=4)+
  theme_bw()
# decay curves
par(mai = c(1,1.1,0.5,0.5), mfrow=c(1,2))
data=read.table(file = "data/Figure2_Data_AnatoliaFarmer.fit", header = F)
jout=read.table(file = "data/Figure2_Data_AnatoliaFarmer.jout",header = F)
plot(x=data$V1,y=data$V2,type="p",col="darkorchid",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
     las=1,main = "Early Anatolian farmers",cex=2,cex.main=1.5,cex.axis=1.2,cex.lab=1.2,
     ylab ="")
title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) 
lines(x=data$V1,y=data$V3,type = "l",col="darkorchid",lty=2)
nrmsd=round(NRMSD(data$V2,data$V3),4)
len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "),
                paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+8071,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"),
          paste("NRMSD",nrmsd,sep="="),sep = "\n")
legend("topright",legend = len,col = "darkorchid",lty=c(2,-1,-1), cex=1,bty='n')

data=read.table(file = "data/Figure2_Data_SteppeFarmer.fit",header = F)
jout=read.table(file = "data/Figure2_Data_SteppeFarmer.jout",header = F)
plot(x=data$V1,y=data$V2,type="p",col="deeppink",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)",
     las=1,main = "Early Steppe Pastoralists",cex=2,cex.main=1.5,cex.axis=1.2,cex.lab=1.2,ylab="")
title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) 
lines(x=data$V1,y=data$V3,type = "l",col="deeppink",lty=2)
nrmsd=round(NRMSD(data$V2,data$V3),4)
len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "),
                paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+2881,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"),
          paste("NRMSD",nrmsd,sep="="),sep = "\n")
legend("topright",legend = len,col = "deeppink",lty=c(2,-1,-1), cex=1,bty='n')

Genetic formation of early Anatolian farmers and early Bronze Age Steppe pastoralists.

The top panel shows a map with sampling locations of the target groups analyzed for admixture dating. The bottom panels show the inferred times of admixture for each target using DATES (Distribution of Ancestry Tracts of Evolutionary Signals) by fitting an exponential function with an affine term , where d is the genetic distance in Morgans and = (t+1) is the number of generations since admixture (t) (Materials and methods). We start the fit at a genetic distance (d) >0.5 cM (centiMorgans) to minimize confounding with background LD and estimate a standard error by performing a weighted block jackknife removing one chromosome in each run. For each target, in the legend, we show the inferred average dates of admixture (±1 SE) in generations before the individual lived, in BCE accounting for the average age of all the individuals and the mean human generation time, and the normalized root-mean-square deviation (NRMSD) values to assess the fit of the exponential curve (Materials and methods). The bottom left shows the ancestry covariance decay curve for early Anatolian farmers inferred using one reference group as a set of pooled individuals of western hunter-gatherer (WHG)-related and Levant Neolithic farmers-related individuals as a proxy of Anatolian hunter-gatherer (AHG) ancestry and the second reference group containing Iran Neolithic farmer-related individuals. The bottom right shows the ancestry covariance decay curve for early Steppe pastoralists groups, including all Yamnaya and Afanasievo individuals as the target group and eastern hunter-gatherer (EHG)-related and Iran Neolithic farmer-related groups as reference populations.

In Europe, the Anatolian Neolithic farmers mixed with the local indigenous HGs contributing between ~40% and 98% of ancestry to the Neolithic Europeans. To elucidate the fine-scale patterns and regional dynamics of these mixtures, we applied DATES to time transect samples from 94 groups (n=657) sampled from 16 regions in Europe, ranging from ~6000 to 1900 BCE and encompassing individuals from the early Neolithic to Chalcolithic periods (Supplementary file 1A). Using qpAdm, we first confirmed that the Neolithic Europeans could be modeled as a mixture of European HG-related ancestry and Anatolian farmer-related ancestry and inferred their ancestry proportions (Supplementary file 2D). For most target populations (~80%), we found the model of gene flow between Anatolian farmer-related and WHG-related ancestry provided a good fit to the data (p-value > 0.05). In some populations, we found variation in the source of the HG-related ancestry and including either EHG- or GoyetQ2-related ancestry groups improved the fit of the model. In five groups, none of the models fit, despite excluding outlier individuals whose ancestry profile differed from the majority of the individuals in the group (Supplementary file 2E). To confirm that the target populations do not harbor Steppe pastoralist-related ancestry, we applied D-statistics of the form D(Mbuti, target, Anatolian farmers, Steppe pastoralists) where target = Neolithic European groups. We observed that four groups had a stronger affinity to Steppe pastoralists compared to Anatolian farmers, and hence we excluded these from further analysis (Supplementary file 2F). After filtering, we applied DATES to 86 European Neolithic groups using WHG-related individuals and Anatolian farmers as reference populations.

Earlier analysis has suggested that farming spread along two main routes in Europe, from southeast to central Europe (‘continental route’) and along the Mediterranean coastline to Iberia (‘coastal route') bib20bib21bib52. Consistent with this, we inferred one of the earliest timings of gene flow was in the Balkans around 6400 BCE. Using the most comprehensive time-transect in Hungary with 19 groups (n=63) spanning from middle Neolithic to late Chalcolithic, we inferred the admixture dates ranged from ~6100 to 4500 BCE. Under a model of a single shared gene flow event in the common ancestors of all individuals, we would expect to obtain similar dates of admixture (before present) after accounting for the age of the ancient specimens. Similar to bib36, we observed that the estimated dates in middle Neolithic individuals were substantially older than those inferred in late Neolithic or Chalcolithic individuals bib7. This would be expected if the underlying model of gene flow involved multiple pulses of gene flow, such that the timing in the middle Neolithic samples reflects the initial two-way mixture and the timing in the Chalcolithic samples captures both recent and older events. Interestingly, bib36, and other recent studies have documented increasing HG ancestry from ~3% to 15% from the Neolithic to Chalcolithic period bib24bib36bib52, suggesting that there was additional HG gene flow after the initial mixture. This highlights that the interactions between local HGs and incoming Anatolian farmers were complex with multiple gene flow events or continuous admixture between these two groups, which explains the increasing HG ancestry and more recent dates in Chalcolithic individuals (Supplementary file 2D).

Mirroring the pattern in Hungary, we documented the resurgence of HG ancestry in the Czech Republic, France, Germany, and southern Europe. In central Europe, we inferred that the Anatolian farmer-related gene flow ranged between ~5600 and 5000 BCE across Germany and Czech Republic, with some exceptions. For instance, in the Blätterhöhle site from Germany, the inferred dates were more recent (~4000 BCE), consistent with the occupation of both HGs and farmers in this region until the late Neolithic bib36. In eastern Europe, using samples related to the Funnel Beaker culture (TRB; from German Trichterbecher) from Poland, we dated the Anatolian farmer-related gene flow occurred on average ~4700 BCE (5300–4200 BCE). Following the TRB decline, the Baden culture and the Globular Amphora culture appeared in many areas of Poland and Ukraine bib15. These cultures had close contact with the Corded Ware complex (CWC) and Steppe pastoralists’ societies, though we found a parsimonious model without Steppe pastoralist-related ancestry provides a good fit to the GAC individuals (Supplementary file 2D). Applying DATES, we inferred the Anatolian farmer and HG-related mixture in GAC ranged between ~4700 and 3900 BCE, predating the spread of Steppe pastoralists to eastern Europe bib1bib24.

Along the Mediterranean route, we characterized Anatolian farmer-related gene flow in Italy, Iberia, France, and the British Isles. Using samples from five groups in Italy, we inferred the earliest dates of gene flow of ~6100 BCE, and within the millennium, the Anatolian farmer-related ancestry spread from Sardinia to Sicily (Figure 2). In Iberia, the Anatolian farmer-related mixture ranged from ~5700 to 4300 BCE and showed evidence for an increase in HG ancestry from ~9% to 20% after the initial gene flow. In France, previous studies have shown that Anatolian farmer-related ancestry came from both routes, along the continental route in the north and along the costal route in the south bib52. This is reflected in the source of the HG ancestry, which is predominantly EHG and WHG-related in the north and includes WHG and Goyet-Q2 ancestry in the south bib52. Consistently, we also observed that the admixture dates in France were structured along these routes, with the median estimate of ~5100 BCE in the east and much older ~5500 BCE in the south (Supplementary file 1B). In Scandinavia, we inferred markedly more recent dates of admixture of ~4300 BCE using samples from Sweden associated with the TRB culture and Ansarve Megalithic tombs, consistent with a late introduction of farming to Scandinavia bib39.

Finally, we inferred recent dates of admixture in Neolithic samples from the British Isles (England, Scotland, and Ireland) with the median timing of ~5000 BCE across the three regions. Interestingly, unlike in western and southern Europe, we obtained overlapping dates across eight groups including early to late Neolithic samples from British Isles. This is consistent with previous studies that suggest there was no resurgence in HG ancestry during the Neolithic in Britain bib8. Thus our dates can be interpreted as the time of the main mixture of HGs and Anatolian farmers in this region, implying that the farmer-related ancestry reached Britain a millennium after its arrival in continental Europe. By 4300 BCE, we find that Anatolian farmer-related ancestry is present in nearly all regions in Europe.

Late Neolithic to Bronze Age

The beginning of the Bronze Age (BA) was a period of major cultural and demographic change in Eurasia, accompanied by the spread of Yamnaya Steppe pastoralist-related ancestry from Pontic-Caspian steppes across Europe and South Asia bib24. The archaeological record documents that the early Steppe pastoralists cultures of Yamnaya and Afanasievo, with characteristic burial styles and pottery, appeared ~3300–2600 BCE bib42. These groups were formed as a mixture of EHG-related individuals and CHG-related groups associated with the first farmers from Iran bib30bib43bib56. Using qpAdm, we first tested how well this model fits the data from eight early Steppe pastoralist groups, including seven groups associated with Yamnaya culture and one group related to the Afanasievo culture (Materials and methods). For all but two Yamnaya groups (from Hungary Baden and Russia Kalmykia), we found this model provides a good fit to the data (Supplementary file 2G). We note that the samples from Kalmykia in our dataset were shotgun sequenced, and in the qpAdm analysis, we are mixing shotgun and capture data that could potentially lead to technical issues. To understand the timing of the formation of the early Steppe pastoralist-related groups, we applied DATES using pooled EHG-related and pooled Iranian Neolithic farmer-related individuals. Focusing on the groups with the largest sample sizes, Yamnaya Samara (n=10) and Afanasievo (n=19), we inferred the admixture occurred between 40 and 45 generations before the individuals lived, translating to an admixture timing of ~4100 BCE (Supplementary file 1B). We obtained qualitatively similar dates across four Yamnaya and one Afanasievo groups, consistent with the findings that these groups descend from a recent common ancestor (we note for the Ozera samples from Ukraine, the dates were not significant). This is also further supported by the insight that the genetic differentiation across early Steppe pastoralist groups is very low (FST ~ 0.000–0.006) (Supplementary file 2H). Thus, we combined all early Steppe pastoralist individuals in one group to obtain a more precise estimate for the genetic formation of proto-Yamnaya of ~4400–4000 BCE (Figure 3). These dates are noteworthy as they predate the archaeological evidence by more than a millennium bib2 and have important implications for understanding the origin of proto-Pontic Caspian cultures and their spread to Europe and South Asia.

Over the following millennium, the Yamnaya-derived ancestry spread across Europe through CWC and Bell Beaker complex (BBC) cultures. Present-day Europeans derive between ~10% and 60% Steppe pastoralist-related ancestry, which was not seen in Neolithic samples. To obtain a precise chronology of the spread of Steppe pastoralist-related ancestry across Europe, we analyzed 109 late Neolithic, Chalcolithic, and BA samples dated between 3000 and 750 CE from 18 regions, including samples associated with the CWC and BBC cultures. We first confirmed that most target samples had Steppe pastoralist-related ancestry, in addition to European HG-related and Anatolian farmer-related ancestry using qpAdm. We excluded 20 groups that could not be parsimoniously modeled as a three-way mixture even after removing individual outliers. After filtering, we retained 79 groups for dating Steppe pastoralist-related gene flow across Europe (Supplementary file 2I and J). As BA Europeans have ancestry from three distinct groups, we applied DATES using the following two reference populations, one group including early Steppe pastoralists (Yamnaya and Afanasievo) and the other group that is a the proxy for the ancestral Neolithic Europe population using pooled samples of WHG-related and Anatolian farmer-related individuals.

To learn about the spread of CWC culture across Europe, we used seven late Neolithic and Bronze age groups, including five associated with CWC artifacts. Using DATES, we inferred that the oldest date of Steppe pastoralists gene flow in Europe was ~3200 BCE in Scandinavia in samples associated with Battle Axe Culture in Sweden and Single Grave Culture in Denmark that were both contemporary to CWC. The samples from Scandinavia showed large heterogeneity in ancestry, including some individuals with majority Steppe pastoralist-related ancestry (and negligible amounts of Anatolian farmer-related ancestry), consistent with patterns expected from recent gene flow bib38. Strikingly, we inferred the timing of admixture in central Europe (Germany and the Czech Republic) and eastern Europe (Estonia and Poland) to be remarkably similar. These dates fall within a narrow range of ~3000–2900 BCE across diverse regions, suggesting that the mixed population associated with the Corded Ware culture formed over a short time and spread across Europe rapidly with very little further mixture (Supplementary file 1B).

Following the Corded Ware culture, from around 2800 to 2300 BCE, Bell Beaker pottery became widespread across Europe bib17. Using 19 Chalcolithic and BA samples, including 10 associated with Beaker-complex artifacts, we inferred the dynamics of the spread of the Beaker complex across Europe. We inferred the oldest date of Steppe pastoralist-related admixture was ~3200 BCE (3600–2800 BCE) in early Bronze Age (EBA) Mallorca samples from Iberia. We note the EBA Mallorca sample is not directly associated with Beaker culture, but qpAdm modeling suggests that this individual is clade with the small subset of Iberian Beaker-complex-associated individuals who carried Steppe pastoralist-related ancestry bib16. Most individuals from Iberia, however, had negligible Steppe pastoralist-related ancestry suggesting the Beaker culture was not accompanied by major gene flow in Iberia despite the earliest dates (Supplementary file 2I). In central and western Europe, where Steppe pastoralist gene flow was more pervasive, we inferred the median date of the mixture was ~2700 BCE with the oldest dates in the Netherlands, followed by Germany and France (Figure 2). There was, however, large heterogeneity in the dates across Europe and even within the same region. For example, comparing two BA groups from the Netherlands suggests a wide range of dates ~3000 BCE and 2500 BCE, and four groups from Germany indicate a range of ~2900–2700 BCE. From central Europe, the Steppe pastoralist-related ancestry spread quickly to the British Isles, where people with Steppe pastoralist ancestry replaced 90% of the genetic ancestry of individuals from Britain. Our estimates for the time of gene flow in Bell Beakers samples from England suggest that the gene flow occurred ~2700 BCE (2770–2550 BCE). Our estimated dates of admixture are older than the dates of arrival of this ancestry in Britain bib44 and, interestingly, overlap the dates in central Europe. Given that a significant fraction of the Beaker individuals were recent migrants from central Europe, we interpret our dates reflect the admixture into ancestors of the British Beaker people, occurring in mainland Europe bib44.

The middle to late Bronze Age (MLBA) led to the final integration of Steppe pastoralist-related ancestry in Europe. In southern Europe, EBA samples had limited Steppe pastoralist-related ancestry, though present-day individuals harbor between ~5% and 30% of this ancestry bib24. Using pooled samples of MLBA from Spain, we inferred major mixture occurred ~2500 BCE in Iberia. We inferred a similar timing in Italy using individuals associated with the Bell Beaker culture and EBA samples from Sicily (Supplementary file 1B). In Sardinia, a majority of the BA samples do not have Steppe pastoralist-related ancestry. In a few individuals, we found evidence for Steppe pastoralist-related ancestry, though in most cases, this ancestry proportion overlapped 0 and the inferred dates of admixture were very noisy (Supplementary file 2I). Using Iron Age samples from Sardinia, we inferred the gene flow occurred ~2600 BCE, though there is a large uncertainty associated with this estimate (3700–1490 BCE). In other parts of continental Europe and the British Isles, the Steppe pastoralist-related ancestry got diluted over time, as evidenced by more recent dates in LBA (late Bronze Age) than EBA or MBA (middle Bronze Age) samples in Germany, England, and Scotland, and an increase in Neolithic farmer ancestry during this period (bib45; Supplementary file 1B).

Finally, the CWC expanded to the east to form the archaeological complexes of Sintashta, Srubnaya, Andronovo, and the BA cultures of Kazakhstan. Samples associated with these cultures harbor mixed ancestry from the Yamnaya Steppe pastoralist-related groups (CWC, in some cases) and Neolithic individuals from central Europe (Supplementary file 2K; bib43). Applying DATES to eight MLBA Steppe pastoralist groups, we inferred the precise timing for the formation of these groups beginning in the third millennium BCE. These groups were formed chronologically, with the date of genetic formation of ~3200 BCE for Sintashta culture, followed by ~2900 BCE for Srubnaya and Andronovo cultures. In the central Steppe region (present-day Kazakhstan), we obtained median dates of ~2800 BCE for the expansion of Steppe pastoralist-related ancestry in four Kazakh cultures of Maitan Alakul, Aktogai, and Kairan. By ~2700 BCE, most of these cultures had almost 60–70% Yamnaya Steppe pastoralist-related ancestry (Supplementary file 1B). These groups, in turn, expanded eastwards, transforming the genetic composition of populations in South Asia.

Discussion

We developed DATES that measures ancestry covariance patterns in a single diploid individual genome to estimate the time of admixture. Using extensive simulations, we show that DATES provides accurate estimates of the timing of admixture across a range of demographic scenarios. Application of DATES to present-day samples shows that the results are concordant with published methods – ROLLOFF, ALDER, and Globetrotter. For sparse datasets, DATES outperforms published methods as it does not require phased data and works reliably with limited samples, large proportions of missing variants, as well as pseudo-haploid genotypes. This makes DATES ideally suited for the analysis of ancient DNA samples. We illustrate the application of DATES by reconstructing population movements and admixtures during the European Holocene. We confirm and extend signals that were previously identified such as the resurgence of HG ancestry during the Neolithic and provide new details about the genetic formation of the ancestral populations of Europeans and the spread of CWC and BBC cultures across Europe. Together, our analysis provides a detailed timeline and insights into the dynamics of the Neolithization of Europe and the spread of Steppe pastoralist-related ancestry across Europe.

First, we document that the Mesolithic HGs formed as a mixture of WHG and EHG ancestry ~10,200–7400 BCE. These dates are consistent with the archaeological evidence for the appearance of lithic technology associated with eastern HGs in Scandinavia and the Baltic regions bib22bib32. Next, we studied the timing of the genetic formation of Anatolian farmers. The earliest evidence of agriculture comes from the Fertile Crescent, the southern Levant, and the Zagros Mountains of Iran and dated to around 10,000 BCE. In central Anatolia, farming has been documented c. 8300 BCE bib4bib6. It has been long debated if Neolithic farming groups from Iran and the Levant introduced agriculture to Anatolia or HGs in the region locally adopted agricultural practices. The early Anatolian farmers can be modeled as a mixture of local HGs related to Caucasus HGs or the first farmers from Iran bib14. By applying DATES (assuming a single instantaneous admixture), we inferred that the Iran Neolithic gene flow occurred around 10,900 BCE (~12,200–9600 BCE). An alternate possibility is that there was a long period of gradual gene flow between the two groups and our dates reflect intermediate dates between the start and end of the gene flow. An upper bound for such a mixture comes from the lack of Iran Neolithic ancestry in AHGs at 13,000 BCE, and a lower bound comes from the C14 dates of early Anatolian farmers, one of which is directly dated at 8269–8210 BCE bib14. In either case (instantaneous admixture or gradual gene flow), the genetic mixture that formed Anatolian farmers predates the advent of agriculture in this region bib4bib6. This supports the model that AHGs locally transitioned to agricultural subsistence, and most probably, there was cultural diffusion from other regions in Near East (Iran and Levant) bib14. Future studies with more dense temporal sampling will shed light on the demographic processes that led to the transition from foraging to farming in the Near East, and in turn, elucidate the relative roles of demic and cultural diffusion in the dispersal of technologies like agriculture across populations.

Using data from 16 regions in Europe, we reconstruct a detailed chronology and dynamics of the expansion of Anatolian farmers during the Neolithic period. We infer that starting in ~6400 BCE, gene flow from Anatolian farmers became widespread across Europe, and by ~4300 BCE, it was present in almost all parts of continental Europe and the British Isles. These dates are significantly more recent than the estimates of farming based on archaeological evidence in some parts of Europe, suggesting that the local HGs and farmers coexisted for more than a millennium before the mixture occurred bib24bib36. In many regions, after the initial mixture, there was a resurgence of HG ancestry, highlighting the complexities of these ancient interactions. We note that our results are consistent with two previous genetic studies, bib36, and bib52, that applied genetic dating methods to a subset of samples we used in our analysis. bib36, used a modified version of ALDER to infer the timing of admixture in three regions (n=151). We obtained statistically consistent results for all overlapping samples (within two standard errors) (Appendix 1—table 6). An advantage of our approach over the modified ALDER approach is that we do not rely on helper samples (higher coverage individuals combined with the target group) for dating; unless these have a similar ancestry profile, they could bias the inferred dates. Our results are concordant with bib52, that used a previous version of DATES to infer the timing of Neolithic gene flow in 32 groups (vs. 86 groups in our study). We find the performance of both versions of DATES is similar, though some implementation details have improved (Appendix 1—table 1).

The second major migration occurred when populations associated with the Yamnaya culture in the Pontic-Caspian steppes expanded across Europe. Our analysis reveals the precise timing of the genetic formation of the early Steppe pastoralist groups – Yamnaya and Afanasievo – occurred ~4400–4000 BCE. This estimate predates the archaeological evidence by more than a millennium bib2 and suggests the presence of an ancient ‘ghost’ population of proto-Yamnaya around this time. Understanding the source and location of this ghost population will provide deep insights into the history of Pontic-Caspian cultures and the origin of Indo-European languages that have been associated to have spread with Steppe pastoralists ancestry to Europe and South Asia bib24bib33. Starting in ~3200 BCE, the Yamnaya-derived cultures of CWC and BBC spread westwards, bringing Steppe pastoralist-related ancestry to Europe. Our analysis reveals striking differences in the spread of these two cultures: the CWC formation is similar across diverse regions separated by thousands of kilometers, suggesting a rapid spread after the initial formation of this group, while the spread of BBC culture was more complex and heterogeneous across regions. We find the earliest evidence of Steppe pastoralist-related ancestry in Iberia around 3200 BCE, though this ancestry only becomes widespread after 2500 BCE. In central Europe, the gene flow occurred simultaneously with archaeological evidence and was coexisting with the CWC in some parts bib57bib44. Finally, in the British Isles, the Bell Beaker culture spreads rapidly from central Europe and replaces almost 90% of the ancestry of individuals in this region bib44.

Recent analysis has shown remarkable parallels in the history of Europe and South Asia; with both groups deriving ancestry from local indigenous HGs, Near Eastern farmers, and Steppe pastoralist-related groups bib43. Interestingly, however, the timing of the two major migrations events differs across the two subcontinents. Both mixtures occurred in Europe almost a millennium before they occurred in South Asia. In Europe, the Neolithic migrations primarily involved Anatolian farmers, while the source of Neolithic ancestry is closer to Iran Neolithic farmers in South Asia. The Steppe pastoralist-related gene flow occurred in the context of the spread of CWC and BBC cultures in Europe around 3200–2500 BCE; in South Asia, this ancestry arrived with Steppe MLB A cultures that were formed much later in 1800–1500 BCE bib43. The Steppe MLBA groups have ancestry from Steppe pastoralist derived groups and European Neolithic farmers following the eastward expansion of CWC groups between ~3200 and 2700 BCE. Understanding the origin and migration paths of the ancestral groups thus helps to illuminate the differences in the timeline of the spread of Steppe pastoralists across the two subcontinents of Eurasia.

Genomic dating methods like DATES provide an independent and complementary approach for reconstructing population history. By focusing on the genetic clock based on recombination rate, we provide an independent estimate of the timing of evolutionary events up to several thousands of years. Our analysis also has advantages over the temporal sampling of ancient DNA, in that we can obtain direct estimates of when a population was formed, rather than inferring putative bounds for the timing based on the absence/presence of a particular ancestry signature (which may be sensitive to sampling choice or density). Genetic approaches provide complementary evidence to archaeology and linguistics as they date the time of admixture and not migration. Both dates are similar in many contemporary populations like African Americans and Latinos, though this may not be generally true bib27. This is underscored by our dates for the Anatolian farmer-related mixture, which postdates evidence of material culture related to agriculture by almost two millennia in some regions. This suggests that European HGs and farmers resided side by side for several thousand years before mixing bib7bib54. This underscores how genetic dates can provide complementary evidence to archaeology and help to build a comprehensive picture of population origins and movements.

Materials and methods

Dataset

We analyzed 1096 ancient European samples from 152 groups restricting to data from 1,233,013 autosomal SNP positions that were genotyped using the Affymetrix Human Origins array (the V44.3 release of the AADR; https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data). We filtered this dataset to remove samples that were marked as contaminated, low coverage, outliers, duplicates, or first- or second-degree relatives. We grouped individuals together from a particular culture or region. Details of sample affiliation and grouping used is described in Supplementary file 1A.

Modeling admixture history

We applied qpAdm from ADMIXTOOLS to identify the best fitting model and estimate the ancestry proportions in a target population modeled as a mixture of n ‘reference’ populations using a set of ‘Outgroup’ populations bib24. We set the details: YES parameter, which reports a normally distributed Z-score to evaluate the goodness of fit of the model (standard errors were estimated with a Block Jackknife). For each target population, we chose the most parsimonious model, that is, fitting the data with the minimum number of source populations. We excluded models where the p-value < 0.05 indicating a poor fit to the data. Details of the qpAdm analysis for each group are reported in Supplementary file 2. We also applied D-statistics in some cases using qpDstat in ADMIXTOOLS with default parameters.

DATES: model and implementation

DATES leverages the weighted ancestry covariance patterns across the genome of an admixed individual to infer the time of admixture. This method extends the idea introduced in ROLLOFF and ALDER and bib41 to be applicable to dating admixture events between modern human populations using a single genome.

Basic model and notation

Assume we have an admixed individual C with ancestry from source populations A and B, with ancestry proportion of and , respectively. This mixture occurred generations ago. First, we model the genotypes of C as a linear mix of allele frequencies of populations A and B. For any SNP i, let the genotype of C be and allele frequency in A and B be and We can then infer the mixing fraction from population A by solving the simple linear regression by minimizing the residuals.

Let be the probability of observing in C given the observed genotype in A, and be the probability of observing in C given the observed genotype in B

We can then compute the likelihood Li of observing a genotype in the admixed individual

For a pair of neighboring markers S1, S2 located at a genetic distance of d Morgans, the probability of no recombination between the two markers is given by Accounting for recombination, the log-likelihood that the two markers have the same ancestry is then given by:

Let Ki represent the ancestry at marker Si. Expanding as a power series in θ, the coefficient of is QK1K2, where

We can compute the ancestry covariance, A(d), across pairs of markers S1, S2 separated by distance d as

where S(d) is a set of markers S1, S2 located d Morgans apart.

The ancestry covariance is expected to follow an exponential decay with d with the rate of decay depending on the time since admixture ().

The factor of (t+1) comes from the insight that in the first generation following admixture, the admixed population derives one chromosome from each ancestral group. The mixing of chromosomes only begins in the following generations as the chromosomes recombine. This means that if we fit t generations, we are likely to underestimate the time of admixture. We note that previous methods like ALDER and ROLLOFF, however, incorrectly fit t generations to infer the time of mixture. In practice, however, this has little effect on the inference except maybe in case of very recent admixture dates. We infer the time of the mixture by fitting an exponential distribution with affine term using least squares. DATES is applicable for dating admixture in a single individual. When multiple individuals from an admixed population are available, DATES computes the log-likelihood by summing over all individuals.

Application to real data

We applied DATES using genome-wide SNP data from the target population and two reference populations. To infer the allele frequency in the ancestral populations more reliably, where specified, we pooled individuals deriving the majority of their ancestry from the population of interest (Supplementary file 1A). We computed the weighted ancestry covariance between 0.45 cM (centiMorgans) (to minimize the impact of background LD) and 100 cM, with a bin size of 0.1 cM. We plotted the weighted covariance with genetic distance and obtained a date by fitting an exponential function with an affine term , where d is the genetic distance in Morgans and = (t+1) is the number of generations since admixture (t). We computed standard errors using weighted block jackknife, where one chromosome was removed in each run bib10. Following bib55, we examined the quality of the exponential fit by computing the normalized root-mean-square deviation (NRMSD) between the empirical ancestry covariance values and the fitted ones , across all the genetic distance bins (Appendix 1).

The estimated dates of admixture were considered significant if the (a) Z-score > 2, (b) < 200 generations and (c) NRMSD < 0.7. We converted the inferred dates from generations to years by assuming a mean generation time of 28 years bib41. For ancient samples, we added the sampling age of the ancient specimen (Supplementary file 1A). When multiple individuals were available, we used the average sampling ages to offset the admixture dates. We report dates in BCE by assuming the 1950 convention.

Comparison of old and new version of DATES

An earlier version of DATES (version v753) was released in bib43. The current method (version 4010) released in this study differs in some key aspects of the implementation as described below.

  1. Use of regression model vs. likelihood approach: In v753, we used a regression model to infer the residuals at each site in the genotype by conditioning on the allele frequency in the reference population and the genome-wide estimate of the admixture proportion bib43. In contrast, in the current version (v4010), we use a more rigorous likelihood framework where we infer the probability of ancestry from each reference population at each site in the genome (Equation 3).
  2. Rate of decay of exponential fit: In v753, like ALDER and ROLLOFF, we fit an exponential decay with the rate of t generations. However, this assumes that mosaic chromosomes are formed in the generation when the gene flow occurs. However, in reality, the mixing of ancestry only begins in the following generations as the chromosomes of distinct ancestry recombine. To correctly account for this effect, we fit an exponential with the rate of (t+1) in DATES v4010. In practice, this has a minor effect on the dates reported earlier, as in most cases the uncertainty is much larger than one generation.
  3. Goodness of fit test: In v4010, we implemented the NRMSD to assess the fit of the exponential curve. NRMSD computes the deviation between the empirical estimate and fitted data in order to provide a statistical way to characterize the noisiness of the fitted curve. Lower values of NRMSD suggest a better fit, however, there is no clear interpretation of the absolute value of NRMSD. Based on the empirical distribution of NRMSD values in our study samples (Appendix 1—figure 3), we infer a conservative threshold of 0.7 to define a ‘good’ fit. We caution that users should adjust this threshold based on their application and always visually inspect their exponential fits to ensure reliable results.
  4. Support for arbitrary number of chromosomes: Unlike v753 that was optimized for parameters in humans, the new version supports an arbitrary number of chromosomes (inputted by the user) so DATES can be used in any species.

A comparison of the two version of DATES using simulated data (Appendix 1—table 1) and empirical data (Appendix 1—table 2, Appendix 1—table 3) yields qualitatively similar results.

Simulations

We constructed admixed genomes following the approach described in bib40. This method requires phased haplotypes from two source populations and uses two key parameters to simulate data from admixed individuals, (a) the mixture proportion () that represents the probability that a particular sampled haplotype comes from one of the reference panels, namely source1 and source2, and (b) the time of mixture () which is the number of generations since mixture. To simulate an admixed individual, we begin at the start of the chromosome and sample a haplotype from either source1 with a probability () and source2 with a probability (). At each subsequent marker, we check if there was a recombination event between the two neighboring markers. A recombination event occurs with a probability of (), where g is the genetic distance in Morgans. We use the time of generations to account for the fact that in the first generation following admixture, the offspring inherits one chromosome of each ancestry. In the next generation, the crossovers lead to a mixing of ancestry. Thus, when a recombination event occurs, we resample the ancestry between source1 and source2, otherwise, we copy the haplotype from the same source population. (Note, a recombination event can lead to a switch to a haplotype of the same ancestry.) Once the ancestry is chosen, we randomly pick a haplotype from the ancestral pool (without replacement) and copy its sequence to the genome of the admixed individual. This process is continued until we reach the end of the chromosome. Using this approach, we generate the genomes of admixed individuals. The simulated haploid chromosomes are merged at random to construct diploid admixed individuals. This algorithm requires more than ancestral haplotypes for generating data for diploid admixed individuals bib40. For more than two reference populations, the same algorithm is repeated iteratively.

We used 111 CEU and 112 YRI phased 1000 genomes phase 3 dataset bib3 for generating 10 admixed genomes for ~380,000 SNPs (unless otherwise stated). For the inference, we used French and Yoruba from HGDP bib35. We generated data for various demographic scenarios, where we varied the time the admixture (), proportion of mixture (), sample size in the reference and target populations, divergence between the ancestral and reference populations used and studied their impact on the estimated dates. We also characterized the impact of features of ancient DNA such as missing data, pseudo-haploid genotypes, and limited sample size. In order to simulate pseudo-haploid genotypes, we randomly sampled an allele at each heterozygous site and assigned it as the homozygous genotype at that site bib26. To generate missing data, we set the genotype call at a site as ‘missing’ or ‘unknown’ (in eigenstrat format as 9) where the proportion of missing genotypes ranged between 5% and 60% in our simulations. We also evaluated the impact of the choice of reference populations used in DATES in case of simple and multiple pulses of admixture.

To study the impact of complex scenarios of admixture involving founder events and continuous gene flow, we used a coalescent simulator, MaCs bib12. We simulated 100 Mb of three populations with an effective population size of 12,500, mutation rate of 1.2 × 10–8 and recombination rate 1 × 10–8 per base pair per generation, respectively bib25bib31. We assumed the admixture occurred continuously over a period of time or was followed by the bottleneck. In case of the latter, the duration of the bottleneck was 1–10 generations with reduction in effective population size from 12,500 to 10–1000 and the population recovered to its original size after the bottleneck or maintained a small size until present (no recovery founder event). For each simulation, we generated data for two haploid chromosomes and combined these to generate one diploid chromosome.

Software availability

The executable and source code for DATES will be available on GitHub: https://github.com/MoorjaniLab/DATES_v4010 (copy archived at swh:1:rev:e034dc0d6fe8d41a828796f07791d50011b6bb04; bib13).