The Eurasianhouse mouse Mus musculusis useful for tracing prehistorical human movement related to the spread of farming. We determined whole mitochondrial DNA (mtDNA) sequences (ca. 16,000 bp) of 98 wild-derived individuals of two subspecies, M. m. musculus (MUS) and M. m. castaneus (CAS). We revealed directional dispersals reaching as far asthe Japanese Archipelagofrom their homelands. Our phylogenetic analysis indicated that the eastward movement of MUS was characterised by five step-wise regional extension events: 1) broad spatial expansion into eastern Europe and the western part of western China, 2) dispersal to the eastern part of western China, 3) dispersal tonorthern China, 4) dispersal to the Korean Peninsula and 5) colonisation and expansion in the Japanese Archipelago. These events were estimated to have occurred during the last 2,000–18,000 years. The dispersal of CAS was characterised by three events: initial divergences(ca. 7,000–9,000 years ago) of haplogroups in northern most China and the eastern coast of India, followed by two population expansion events that likely originated from the Yangtze River basinto broad areas of South and Southeast Asia, including Sri Lanka, Bangladesh and Indonesia (ca. 4,000–6,000 years ago) and to Yunnan, southern China and the Japanese Archipelago (ca. 2,000–3,500). This study provides a solid framework for the spatiotemporal movementof the human-associated organisms in Holocene Eastern Eurasia using whole mtDNA sequences, reliable evolutionary rates and accurate branching patterns. The information obtained here contributes to the analysis of a variety of animals and plants associated with prehistoric human migration.
We used a total of 98 house mouse samples in this study. Most of our samples overlap with those used by Suzuki et al. (2013). See Li et al. (in press) for the localities where the samples were collected samples codes.
DNA extraction and variant calling
We determined the whole mitogenome sequences of the 98 house mouse samples(ca. 16,000 bp). Our samples, along with their qualified concentrations and volumes, were sent to BGI (Shenzhen, China) forwhole-genome sequencing. Librarieswere constructed for each sample with index sequences, and paired-end reads of 100 bp were sequenced using the BGISEQ-500 platformby BGI. For each sample, ~1 billion clean reads were obtained. We mapped the raw reads to the GRCm38 (mm10) house mouse reference genome sequence,including the mitogenome,using the BWA-MEM method (Li and Durbin 2009) with the ‘-M’ command option. Samblaster(Faust and Hall 2014) (https://github.com/GregoryFaust/samblaster) with the ‘-M’ command option was used for identifying duplicates in read-id groups for exclusion from downstream analysis. The average median coverage of the whole genome sequence was 30.4 per sample. When reads were simultaneously mapped to the nuclear genome and mitogenome, the reads mapped to regions of the mitogenome that were highly similar or identical to regions in the nuclear genome, due to nuclear mtDNA segments, and yielded very low mapping quality (MQ) scores. To recalibrate the MQ score, we remapped all mapped mitogenomereads to the C57BL6 complete mitogenome (NC_005089.1) using BWA-MEM and recalculated the MQ scores.Single-nucleotide polymorphisms and indels were obtained using the GATK4 HaplotypeCaller program (Mckenna et al. 2010) following the 'Best Practice' pipeline instructions. Each gVCF file was merged using GenotypeGVCFs tosimultaneously call the genotypes of all samples. To identify low-depth uncalled sites, we created a consensus sequence in FASTA format using bcftools consensus with the '-M' option to determine missing genotypes.
The repository consists of 13 files used in this study, in addition to the Readme file.
- #1_Li_n98_NN_16038bp_nex. This file is the nexus file including the entire set of our original mitogenome sequences (16038 bp) of Mus musculus. The data were used in the BEAST analysis shown in Fig. 2 and constructing the Neighbor Net network shown in Supplementary Fig. S2.
- #2_Li_n99_ML_meg.meg. This file is the MEGA file including the entire mitogenome sequences and that of Mus spretus (15,185 bp). The data were used in constructing the ML tree shown in Supplementary Fig. 3.
- #3_Li_MUS_n48_meg.meg. This file is the MEGA file of mitogenome sequences (16038 bp) including 47 MUS sequences and a DOM sequence used as outgroup. The data were used in constructing the ML tree shown in Supplementary Fig. 3A.
- #4_Li_CAS_n45_16038bp.meg. This file is the MEGA file of mitogenome sequences (16038 bp) including 44 CAS sequences and a DOM sequence used as outgroup. The data were used in constructing the ML tree shown in Supplementary Fig. 3B.
- #5_Li_MUS1_n42_meg.meg. This file is the MEGA file of mitogenome sequences (16038 bp) containing 42 MUS sequences used in the BEAST analysis shown in Fig. 4A.
- #6_CAS1_n32_meg.meg. This file is the nexus file of mitogenome sequences (16038 bp) containing 32 CAS sequences used in the BEAST analysis shown in Fig. 5A.
The following 6 nexus files were those used in the Arlequin analysis shown in Table 2. See Table 2 for detail.