# Data for: Population analysis of retrotransposons in giraffe genomes supports RTE decline and widespread LINE1 activity in Giraffidae This repository contains data packages for the following parts of the analyses presented in the study by Petersen M, Winter S, Coimbra RTF, Kapitonov VV, and Nilsson MA (2021): 1. Repeat annotation of the Kordofan giraffe genome assembly 2. Clustering analysis of giraffe SINEs 3. Transposable element insertion calling The downstream filter and analysis in R, auxiliary helper scripts as well as the MELT pipeline scripts are available in the Gitlab repository for this study at https://gitlab.com/mpetersen/giraffe-tes. Unfortunately, there have been some interface changes during the development of the R package ggtree that necessitate a setup with specific versions of some packages. We provide a Conda environment definition in the form of a YAML file that can be used to set up a reproducible environment to run the analyses in. A session info document of a working R session is also provided in the Gitlab repository. - env.yaml - sessionInfo.txt The analysis is completely contained in the RMarkdown document `study.Rmd`. Its inputs and analysis settings are parameterized (with the exception of the filter parameters) in the YAML file `study.params.yaml`. The required input files for the analysis are all available in the Gitlab repository with the exception of the VCF files: - study.Rmd - study.params.yaml - kordofan-giraffe_assembly.fa.fai -- Fasta index of the reference genome, generated by `samtools faidx` - genetic_distance.tree -- phylogenetic tree of all individuals under study - giraffe_genomes.xlsx -- information on the giraffe genomes, most importantly which sample belongs to which species - the VCF files from package #3 Here we provide the larger data files that are not in the Gitlab repository. ## Package 1: Repeat annotation of the Kordofan giraffe genome assembly These are the files used by and produced by RepeatMasker version open-4.0.9 on the Kordofan giraffe genome assembly (NCBI accession number ASM1828223v1). Note that we used the Cetartiodactyla-specific section of the Repbase library that is available at https://www.girinst.org/repbase. Repbase requires a licence to use its sequences, therefore we include here only the giraffe-specific LINE1 and RTE consensus sequences that we identified in the Kordofan giraffe genome using RepeatModeler and our own analysis (see Methods). These consensus sequences can be added to the Repbase Cetartiodactyla library to perform the same TE annotation as in our study. - giraffe_LINEs.fa -- giraffe-specific LINE1 and RTE consensus sequences - kordofan-giraffe_assembly.fa.cat.gz -- main RepeatMasker output - kordofan-giraffe_assembly.fa.out -- table with all TE annotations - kordofan-giraffe_assembly.fa.tbl -- summary table: repeat content by TE type Both a GFF file and the .align file (to be used for the repeat landscape) can be generated from the .cat.gz file by utility scripts in the RepeatMasker package. ## Package 2: Clustering analysis of giraffe SINEs These are the results of the bovine SINE clustering analysis in the giraffe genome. Included are clusters with 100% intra-cluster identity over at least 98% of the length (see Methods). There were no novel giraffe-specific SINEs. - Bov-A2_Gir.fasta -- giraffe-specific Bov-A2, 85 clusters - Bov-tA_Gir.fasta -- giraffe-specific Bov-tA, 23 clusters In the sequence headers, Bov-A2.N.M stands for the Bov-A2 cluster number N, which is composed of M identical sequences. Segmental duplications have been excluded. ## Package 3: Transposable element insertion calling These are the input and output files for MELT version 2.2.0 (Gardner et al. 2017) that is available at https://melt.igs.umaryland.edu/index.php. Helper scripts to run the pipeline are available at the Gitlab repository at https://gitlab.com/mpetersen/giraffe-tes. The MEI (mobile element insertion) files contain the TE consensus sequence, a map of existing insertions in the reference genome, and parameters for the search. There is one MEI zip file for each TE. - LINE1v3_MELT.zip - RTEv3_MELT.zip The resulting VCF output files from the MELT-Split and the MELT-Deletion pipelines. The first pipeline identifies insertions that are not in the reference genome; the second identifies insertions that are not in the individual genomes. - LINE1v3.final_comp.vcf - RTEv3.final_comp.vcf - DEL.final_LINEcomp.vcf These files are partially filtered by the MELT pipeline and are additionally filtered prior to downstream analysis steps (see paper and Gitlab repository for details). ## Package 4: Population genetics analysis The input files for the population genetics analysis. It requires SNP calls for the 48 individuals, which are generated from using ANGSD (http://popgen.dk/angsd/index.php/SNP_calling) from the same BAM files that are also used by MELT (package 3). There is a short script in the Github repository for the study by Coimbra et al. (2021) that performs this step; see here: https://github.com/rtfcoimbra/Coimbra-et-al-2021_CurrBiol/blob/main/snp_calling.sh The population genetics analysis is unfortunately not as portable as the rest of the study. It includes a script that converts the SNP calls VCF to PLINK format (code/thin_vcf_and_convert_to_plink.sh), and a set of R commands that use the package SambaR. It requires these input files __in the working directory__: - dm0row_diploid.txt - label2subspecies.tsv (this can be generated from giraffe_genomes.xlsx in the Gitlab repository) - giraffe_snp_heterozygosity.txt (this comes from Coimbra et al. (2021)) - mysnps.thinned.40000.bim - mysnps.thinned.40000.fam - mysnps.thinned.40000.raw