Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders
Data files
Oct 07, 2024 version files 736.32 MB
-
Dataset_S1_all_ARC-SV_cxSV_calls.zip
91.54 MB
-
Dataset_S2_all_ARC-SV_simpleSV_calls.zip
644.78 MB
-
README.md
11.08 KB
Abstract
Complex structural variations (cxSVs) are often overlooked in genome analyses due to detection challenges. We developed ARC-SV, a probabilistic and machine-learning-based method that enables accurate detection and reconstruction of cxSVs from standard whole-genome sequencing datasets. By applying ARC-SV across 4,262 genomes representing all continental populations, we identified cxSVs as a significant source of natural human genetic variation. The 4,262 individual genomes are sourced from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project. We also applied ARC-SV to Neanderthal genomes, a number of benchmarking genomes including CHM13-T2T, HG002, HuRef, PG1, and HepG2 (cancer) as well as 119 postmortem brain (79 from ComminMind Consortium and 40 from the National Institute of Mental Health Human Brain Collection Core). Most brain samples are from donors with major psychiatric disorders. The high-confidence cxSV calls for all samples (including dot plot visualizations) are compiled into Dataset S1. ARC-SV. The high-confidence simple SV calls produced by ARC-SV for all samples are also included and compiled into Dataset S2. In our study (Zhou et al, Cell 2024), our analysis of these Datasets revealed that rare cxSVs have a propensity to occur in neural genes and loci that underwent rapid human-specific evolution, including those regulating corticogenesis. By performing single-nucleus multiomics in postmortem brains, we discovered cxSVs associated with differential gene expression and chromatin accessibility across various brain regions and cell types. Additionally, cxSVs detected in brains of psychiatric cases are enriched for linkage with psychiatric GWAS risk alleles detected in the same brains. Furthermore, our analysis revealed significantly decreased brain-region- and cell-type-specific expression of cxSV genes, specifically for psychiatric cases, implicating cxSVs in the molecular etiology of major neuropsychiatric disorders.
README: Automatic detection of complex genome structural variation across human populations and in brains of individuals with psychiatric disorders
https://doi.org/10.5061/dryad.z08kprrpc
Description of the data and file structure
Description of the output of SV calls:
For each cluster of candidate breakpoints, ARC-SV attempts to resolve the local structure of both haplotypes. The output file arcsv_out.tab contains one line for each non-reference haplotype called. A call typically consists of a single SV (simple or complex), but some contain multiple variants that were called together.
Where multiple values are given, as in svtype, the order is left to right in the alternate haplotype, which is shown in the rearrangement column.
All genomic positions in arcsv_out.tab are 0-indexed for compatibility with BED files.
Output field | Description |
---|---|
chrom | chromosome name |
minbp | position of first novel adjacency |
maxbp | position of last novel adjacency |
id | identifier consisting of the region in which the event was called |
svtype | classification of each simple SV/complex breakpoint in this event |
complextype | complex SV classification |
num_sv | number of simple SVs + complex SV breakpoints in this call |
bp | all breakpoints, i.e., boundaries of the blocks in the "reference" column (including the flanking blocks) |
bp_uncertainty | width of the uncertainty interval around each breakpoint in bp . For odd widths, there is 1 bp more uncertainty on the right side of the breakpoint |
reference | configuration of genomic blocks in the reference. Blocks are named A through Z, then a through z, then A1 through Z1, etc. |
rearrangement | predicted configuration of genomic blocks in the sample. Inverted blocks are followed by a tick mark, e.g., A', and insertions are represented by underscores _ |
len_affected | length of reference sequence affected by this rearrangement (plus the length of any novel insertions). For complex SVs with no novel insertions, this is often smaller than maxbp - minbp, i.e., the "span" of the rearrangement in the reference |
filter | currently, this is INSERTION if there is an insertion present, otherwise PASS |
sv_bp | breakpoint positions for each simple SV/complex breakpoint in the event (there are num_sv pairs of non-adjacent reference positions, each one describing a novel adjacency) |
sv_bp_uncertainties | breakpoint uncertainties for each simple SV/complex breakpoint in the event |
gt | genotype [either HET or HOM ] |
af | allele fraction for the called variant [either 0.5 or 1.0, unless --allele_fraction_list was set] |
inslen | length of each insertion in the call |
sr_support | number of supporting split reads for each simple SV and complex breakpoint (length = num_sv) |
pe_support | number of supporting discordant pairs for each simple SV and complex breakpoint (length = num_sv) |
score_vs_ref | log-likelihood ratio score for the call: `log( p(data |
score_vs_next | log-likelihood ratio score for the call vs the next best call: `log( p(data |
rearrangement_next | configuration of genomic blocks for the next best call (may contain more blocks than the "reference" and "rearrangement" columns |
num_paths | number of paths through this portion of the adjacency graph. The called haplotype corresponds to one such path |
Dotplots include all validated cxSVs from Table S1
Alignment-based validation
Human Pangenome References (https://doi.org/10.1038/s41586-023-05896-x)
Ebert et al, 2021 (https://doi.org/10.1126/science.abf7117)
Gart et al, 2021 (https://doi.org/10.1038/s41587-020-0711-0)
Sanger validated
HuRef
HepG2
Fetal brain somatic cxSVs (http://www.genome.org/cgi/doi/10.1101/gr.262667.120)
Files and variables
File: Dataset_S1_all_ARC-SV_cxSV_calls.zip
Description: ARC-SV cxSV calls from Zhou et al. "Automatic detection of complex genome structural variation across human populations and in brains of individuals with psychiatric disorders"
File: Dataset_S2_all_ARC-SV_simpleSV_calls.zip
Description: ARC-SV simple SV calls from Zhou et al. "Automatic detection of complex genome structural variation across human populations and in brains of individuals with psychiatric disorders"
Code/software
ARC-SV https://github.com/SUwonglab/arcsv
Access information
Data was derived from the following sources:
- 1000 Genomes 30x on GRCh38 (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38)
- Human Genome Diversity Project (https://www.internationalgenome.org/data-portal/data-collection/hgdp)
- Simons Genome Diversity Project (https://www.internationalgenome.org/data-portal/data-collection/sgdp)
- CHM13 WGS (NCBI SRA: SRR2088062, SRR2088063)
- PGP1 WGS (NCBI SRA: SRR14718703)
- HG002 WGS (NCBI SRA: SRR14724532)
- HG005 WGS (NCBI SRA: SRR14724528)
- HuRef WGS (Zhou et al. https://www.nature.com/articles/sdata2018261)
- HepG2 WGS (Zhou et al. encodeproject.org: ENCFF356NCL, ENCFF726BIF)
- NA12878 Illumina WGS (basespace.illumina.com/datacentral Run ID: NextSeq 500 v2: TruSeq Nano350, (NA12878)_H3GYCBGXX)
- NA12878 DNBSEQ-T7 WGS (China National GeneBank DataBase, db.cngb.org, CNGBdb: CNR0497793)
- Element Biosciences AVITI HG002 WGS (NCBI SRA: SRX17079410)
- GTEx (DBGap: phs000424.v8.p2)
- Altai Neanderthal genome (https://www.eva.mpg.de/genetics/genome-projects/neandertal)
- Fetal brain WGS (NIH National Institute of Mental Health Data Archive: ID #2330, study ID #496, and DOI: 10.15154/1410419)
- PsychENCODE (WGS of HBCC and CMC brains) (www.synapse.org, sample accessions in Table S4 of Zhou et al. "Automatic detection of complex genome structural variation across human populations and in brains of individuals with psychiatric disorders")
- Illumina WGS of a female bonobo (Carbone #601152) (NCBI BioProject: PRJNA526933)
Methods
Structural variation (SV) calls from standard whole-genome sequencing (WGS) datasets were made via ARC-SV (https://github.com/SUwonglab/arcsv). Dot plots were generated using LAST (https://github.com/lpryszcz/last).