Overturning circulation structures the microbial functional seascape of the South Pacific
Data files
Jul 08, 2025 version files 215.25 MB
-
README.md
14.81 KB
-
SD1_genome_classification_and_coverage.csv
2.11 MB
-
SD2_16S_counts_silva138_r1.csv
82.41 MB
-
SD3_18S_counts_pr2_r1.csv
128.53 MB
-
SD4_environmental_metadata_r1.csv
79.58 KB
-
SD5_unique_ASV_percents_16S_18S.csv
13.13 KB
-
SD6_KO_functional_zones.tsv
901.67 KB
-
SD7_euk_LPI_genes_water_masses_sig_ko_fisher.tsv
14.72 KB
-
SD8_wm_KO_enrichment.tsv
67.42 KB
-
SD9_CAZy_enrichments_by_wm.csv
1.10 MB
Abstract
Global overturning circulation partitions the deep ocean into regions with unique physicochemical characteristics, but the extent to which these water masses represent distinct ecosystems remains unknown. Here, we integrate extensive genomic information with hydrography and water mass age to delineate microbial taxonomic and functional boundaries across the South Pacific. Prokaryotic richness steeply increases with depth in the surface ocean, forming a “phylocline”, below which richness is consistently high, dipping slightly in highly aged water. Reconstructed genomes self-organize into six spatially-distinct taxonomic cohorts and ten functionally-distinct biomes that are primarily structured by wind-driven circulation at the surface and density-driven circulation at depth. Overall, water physicochemistry, modulated at depth by water age, drives microbial diversity and functional potential in the pelagic ocean.
This dataset supports the publication titled "Overturning circulation structures the microbial functional seascape of the South Pacific" by Kolody et al. (DOI 10.1126/science.adv6903) and full methods are available in the supplementary text. It presents high-resolution genomicand amplicon data from the South Pacific Ocean in the context of the environmental data sampled along the GO-SHIP Pacific 18 (P18) transect. The data include prokaryotic metagenome-assembled genomes (MAGs), 16S and 18S rRNA gene amplicon data, functional annotations (KEGG, CAZy), and physical/chemical oceanographic metadata. These files collectively enabled the study of microbial diversity, biogeography, and functional potential as shaped by ocean overturning circulation and water mass age.
Description of the data and file structure
The dataset is divided into nine main data tables (Data S1–S9), each described below. All files are provided as .csv
or .tsv
tables and are linked to the analytical framework outlined in the main text and supplementary methods of the publication. Most large files are best explored in R or Python.
Data S1: Genome Abundance and Metadata
File: SD1_genome_classification_and_coverage.csv
This table contains information for curated marine genomes (n = 307 prokaryotic MAGs). Columns are:
- genome: genome identifier. These are composed of the following fields, separated by underscores: library name, ggkbase taxonomic annotation, approximate % GC, and approximate genome coverage in the library the genome was isolated from.
- library: sequencing library representing the sample for which the coverage will be given in the following column. . Library names have the following fields, separated by underscores: cruise line, cruise station, CTD cast number, niskin number, molecular sample identifier, size class from serial filtration (L= large, 0.5 um filter; S = small, 0.22 um filter), and collection depth in meters. Libraries from the second sequencing effort are appended with “_2”.
- mean.coverage: mean normalized coverage of the given genome for the sample (library) in the proceeding column
- GTDB.Tk.classification: Taxonomic assignment of the given genome via GTDB-Tk
- percent.completeness: % genome completeness from checkM
- percent.contamination: estimated % genome contamination from checkM
- GC: genome GC content
- genome.size: genome size in base pairs (bp)
- contigs: number of contigs comprising the given genome
- longest.contig.length: length of the longest contig in the genome
- mean.contig.length: mean length of contigs in the genome
- coding.density: genome coding density
- translation.table: translation table used for gene prediction
- predicted.genes: number of genes predicted in the genome
- genome.quality: genome quality. Medium is ≥ 50% completeness, < 10% contamination; High is > 90% completeness, < 5% contamination
- cohort: WGCNA module (biogeographic cohort) membership of genome
- age_bin: Fisher's Exact test-based assignment of genome to a water age group (< 50 years, 50 -1000 years, > 1000 years)
- age_bin_FDR: FDR-corrected p-value for Fisher's Exact test-based enrichment of genomes across water age groups
- iRep: iRep genome replication rate estimates (NA = insufficient coverage)
- iRep_r_squared: R-squared value for iRep replication rate estimate (NA = insufficient coverage)
- Water_Mass: Fisher's Exact test-based assignment of genome to a water mass [Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m)]
- Water_Mass_enrichment_odds_ratio: odds ratio associated with enrichment of genomes across water masses
- Water_Mass_enrichment_FDR: FDR-corrected p-value for Fisher's Exact test-based enrichment of genomes across water masses
Data S2: 16S rRNA Amplicon Counts and Annotations
File: SD2_16S_counts_silva138_r1.csv
This table describes the taxonomy and biogeography of prokaryotes detected via 16S rRNA metabarcoding on the P18 transect. Columns are:
- Feature.ID: unique identifier for each organism (amplicon sequence variant: ASV)
- molecular_sample_id: unique identifier for molecular sample collected
- Confidence_silva: confidence score for SILVA v138 taxonomic annotations
- raw_counts: non-normalized read copies of given ASV
- copies_per_mL: estimated ASV copies contained in each milliliter of seawater filtered (see methods)
- Taxon_silva: SILVA v138 taxonomic annotations ASVs
- domain, phylum, class, order, family, genus: un-nested silva taxomic annotations
- cohort: WGCNA module (biogeographic cohort) assignment for given ASV
Best viewed using R due to file size.
Data S3: 18S rRNA Amplicon Counts and Annotations
File: SD3_18S_counts_pr2_r1.csv
This table describes the taxonomy and biogeography of eukaryotes detected via 18S rRNA metabarcoding on the P18 transect. It includes:
- Feature.ID: unique identifier for each organism (amplicon sequence variant: ASV)
- molecular_sample_id: unique identifier for molecular sample collected
- Confidence_pr2: confidence score for Protist Ribosomal Reference (PR2) taxonomic annotations
- raw_counts: non-normalized read copies of given ASV
- copies_per_mL: estimated ASV copies contained in each milliliter of seawater filtered (see methods)
- Taxon_pr2: Protist Ribosomal Reference (PR2) taxonomic annotations
- kingdom, supergroup, division, class, order, family, genus, species: un-nested PR2 taxomic annotations
- cohort: WGCNA module (biogeographic cohort) assignment for given ASV
Best viewed in R due to large file size.
Data S4: Environmental Metadata
File: SD4_environmental_metadata_r1.csv
This table describes the physical and chemical environment at the locations where molecular samples were collected. Full environmental metadata for the GO-SHIP P18 cruise are available through CCHDO. NA's indicate data was not collected for the given sample. Columns are:
- molecular_sample_id: unique identifier for molecular sample collected
- station_number, cast_number, niskin_id: station, CTD-cast number (in case of multiple casts at a given station), and niskin bottle identifiers given by the GO-SHIP program to differentiate samples
- LATITUDE: latitude (degrees)
- CTD_pressure: ambient pressure measurement from CTD probe (dbar)
- DATE: date as year, month, day
- TIME: time of CTD cast
- LONGITUDE: longitude (degrees)
- water_column_depth: water column depth in meters
- CTD_temperature: ambient temperature of sample from CTD probe (ITS-90)
- CTD_salinity: salinity measurement from CTD probe (PSS-78)
- OXYGEN: oxygen (µmol/kg)
- SILCATE: silicate (µmol/kg)
- NITRAT: nitrate (µmol/kg)
- NITRIT: nitrite (µmol/kg)
- PHSPHT: phosphate (µmol/kg)
- CFC.11: trichlorofluoromethane (CFC-11) in pmol/kg
- CFC.12: dichlorodifluoromethane (CFC-12) in pmol/kg
- SF6: sulfur hexafluoride concentration (fmol/kg)
- total_carbon: total carbon (µmol/kg)
- alkalinity: alkalinity (µmol/kg)
- pH_seawater_sensor: pH measured by the seawater sensor
- dissolved_organic_carbon: dissolved organic carbon
- total_dissolved_nitrogen: total dissolved nitrogen
- DELC14: δ¹³C (per mille)
- DELC13: δ¹⁴C (per mille)
- cal_BP: cal BP (calibrated years before the present)
- radioc_years_before_2017: water age in years before present (defined as 2017- the year of collection) derived from radiocarbon data (see methods)
- CFC11_age: water age (years before 2017) estimated from CFC-11 data (see methods)
- CFC12_age: water age (years before 2017) estimated from CFC-12 data (see methods)
- SF6_age: water age (years before 2017) estimated from SF6 data (see methods)
- avg_atm_tracer_age: average atmospheric tracer age (years before 2017)
- consensus_age: consensus age across atmospheric tracers and radiocarbon data (years before 2017)
- avg_atm_tracer_age_interpolated: average atmospheric tracer age interpolated to cover more samples (years before 2017)
- consensus_age_interpolated: consensus age across atmospheric tracers and radiocarbon data interpolated to cover more samples (years before 2017)
- CCL4: carbon tetrachloride (pmol/kg)
- N2O: nitrous oxide (nmol/kg)
- total_dissolved_phosphorus: total dissolved phosphorus
- AABW_fraction: mixing fraction of Antarctic Bottom Water
- NPIW_fraction: mixing fraction of North Pacific Intermediate Water
- LCDW_fraction: mixing fraction of Lower Circumpolar Deep Water
- UCDW_fraction: mixing fraction of Upper Circumpolar Deep Water
- AAIW_fraction: mixing fraction of Antarctic Intermediate Water
- PDW_fraction: mixing fraction of Pacific Deep Water
- closest_end_member: water mass with the highest mixing fraction for the given sample
- closest_end_member_percent: percent of water in the sample estimated to originate from the closest end member (water mass)
- Prochlorococcus_cell_counts: cell counts of Prochlorococcus from flow cytometry. Not collected for molecular samples 42, 45, 214, and 215
- Synechococcus_cell_counts: cell counts of Synechococcus from flow cytometry. Not collected for molecular samples 42, 45, 214, and 215
- sample_depth: depth at which the sample was collected (m)
Data S5: ASV Endemism
File: SD5_unique_ASV_percents_16S_18S.csv
This table gives the number and percentage of 16S rRNA and 18S rRNA amplicon sequence variants (ASVs) unique to each sample. Missing 18S rRNA values correspond to samples for which 16S samples passed quality control thresholds but 18S rRNA samples did not (see Methods). Columns are:
- molecular_sample_id: unique identifier for molecular sample collected
- total_ASVs_16S: total 16S rRNA ASVs
- unique_ASVs_16S: number of 16S rRNA ASVs found only in the given sample
- percent_ASVs_unique_16S: % of 16S rRNA ASVs unique to the given sample
- total_ASVs_18S: total 18S rRNA ASVs
- unique_ASVs_18S: number of 18S rRNA ASVs found only in the given sample
- percent_ASVs_unique_18S: % of 18S rRNA ASVs unique to the given sample
Data S6: KEGG Orthologs in Functional Zones
File: SD6_KO_functional_zones.tsv
This table describes the KEGG Orthologies (KOs) pertaining to each functional zone defined in the manuscript. Columns are:
- kegg_id: KO identifier
- kegg_hit: description of KO functions
- genomes: number of genomes each KO is found in.
- total.library.normalized.coverage: Library-normalized cumulative coverage of each KO across genomes.
- Functional_Zone: Which of the 10 functional zones (determined via WGCNA) each KO pertains to.
This is a large file best viewed in R or similar rather than excel. In cases where a gene was annotated as multiple KOs, KOs are given separated by commas and descriptions are separated by semicolons. Because of the presence of commas it is recommended to import into R using the following options: sd6 <- read.table("SD6.tsv", sep = "\t", header = T, comment.char = "", quote = "")
Data S7: Eukaryotic KO Enrichment by Water Mass
File: SD7_euk_LPI_genes_water_masses_sig_ko_fisher.tsv
This table gives the statistical enrichment/depletion of KOs from eukaryotic scaffolds across five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). FDR-adjusted p-values were determined using Fisher's tests. Because of punctuation present in KO descriptions it is recommended to import into R using the following options: sd7 <- read.table("SD7.tsv", sep = "\t", header = T, comment.char = "", quote = "")
.
Columns are:
- KO: KO identifier
- KO_description: description of KO functions
- Water_Mass: Water mass that the given KO was found to have a significant association with
- Odds_Ratio: Odds ratio of association between KO and water mass
- Association: Whether the KO is enriched or depleted in the given water mass
- Adjusted_P_Value: FDR-adjusted p-value from Fisher's test
Data S8: Prokaryotic KO Enrichment by Water Mass
File: SD8_wm_KO_enrichment.tsv
This table gives the statistical enrichment/depletion of KOs from prokaryotic genomes across five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). FDR-adjusted p-values were determined using Fisher's tests. Because of punctuation present in KO descriptions it is recommended to import into R using the following options: sd8 <- read.table("SD8.tsv", sep = "\t", header = T, comment.char = "", quote = "")
.
Columns are:
- kegg_hit: description of KO functions
- kegg_id: KO identifier
- Water_Mass: Water mass that the given KO was found to have a significant association with
- Odds_Ratio: Odds ratio of association between KO and water mass
- Association: Whether the KO is enriched or depleted in the given water mass
- Adjusted_P_Value: FDR-adjusted p-value from Fisher's test
Data S9: CAZy Enrichment by Water Mass
File: SD9_CAZy_enrichments_by_wm.csv
This table has Carbohydrate-Active enZYme (CAZy) gene families that are significantly enriched or depleted in prokaryotic genomes pertaining to any of the five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). CAZys were annotated using DRAM. FDR-adjusted p-values were determined using Fisher's tests.
Columns are:
- cazy: description of Carbohydrate-Active enZYme (CAZy). In some cases multiple CAZys will be given as a single entry (separated by a semicolon) because genes had multiple annotations
- Water_Mass: Water mass that the given CAZy was found to have a significant association with
- Odds_Ratio: Odds ratio of association between CAZy and water mass
- Association: Whether the CAZy is enriched or depleted in the given water mass
- Adjusted_P_Value: FDR-adjusted p-value from Fisher's test
Sharing/Access Information
Other public data access links:
- Full environmental metadata for the GO-SHIP P18 cruise:
CCHDO P18 cruise data
Primary source data used in this study:
- Physical and chemical parameters: CTD profiles and bottle data from publicly available CCHDO data (see above).
- Genome sequences and metagenomic assemblies: Created from this study, using raw reads from metagenomes sequenced on Illumina NovaSeq.
- Ribosomal RNA amplicon data: 16S and 18S amplicons sequenced using Illumina MiSeq and HiSeq platforms.