Global overturning circulation partitions the deep ocean into regions with unique physicochemical characteristics, but the extent to which these water masses represent distinct ecosystems remains unknown. Here, we integrate extensive genomic information with hydrography and water mass age to delineate microbial taxonomic and functional boundaries across the South Pacific. Prokaryotic richness steeply increases with depth in the surface ocean, forming a “phylocline”, below which richness is consistently high, dipping slightly in highly aged water. Reconstructed genomes self-organize into six spatially-distinct taxonomic cohorts and ten functionally-distinct biomes that are primarily structured by wind-driven circulation at the surface and density-driven circulation at depth. Overall, water physicochemistry, modulated at depth by water age, drives microbial diversity and functional potential in the pelagic ocean.

This dataset supports the publication titled "Overturning circulation structures the microbial functional seascape of the South Pacific" by Kolody et al. (DOI 10.1126/science.adv6903) and full methods are available in the supplementary text. It presents high-resolution genomicand amplicon data from the South Pacific Ocean in the context of the environmental data sampled along the GO-SHIP Pacific 18 (P18) transect. The data include prokaryotic metagenome-assembled genomes (MAGs), 16S and 18S rRNA gene amplicon data, functional annotations (KEGG, CAZy), and physical/chemical oceanographic metadata. These files collectively enabled the study of microbial diversity, biogeography, and functional potential as shaped by ocean overturning circulation and water mass age.

Description of the data and file structure

The dataset is divided into nine main data tables (Data S1–S9), each described below. All files are provided as .csv or .tsv tables and are linked to the analytical framework outlined in the main text and supplementary methods of the publication. Most large files are best explored in R or Python.

Data S1: Genome Abundance and Metadata

File: SD1_genome_classification_and_coverage.csv

This table contains information for curated marine genomes (n = 307 prokaryotic MAGs). Columns are:

genome: genome identifier. These are composed of the following fields, separated by underscores: library name, ggkbase taxonomic annotation, approximate % GC, and approximate genome coverage in the library the genome was isolated from.
library: sequencing library representing the sample for which the coverage will be given in the following column. . Library names have the following fields, separated by underscores: cruise line, cruise station, CTD cast number, niskin number, molecular sample identifier, size class from serial filtration (L= large, 0.5 um filter; S = small, 0.22 um filter), and collection depth in meters. Libraries from the second sequencing effort are appended with “_2”.
mean.coverage: mean normalized coverage of the given genome for the sample (library) in the proceeding column
GTDB.Tk.classification: Taxonomic assignment of the given genome via GTDB-Tk
percent.completeness: % genome completeness from checkM
percent.contamination: estimated % genome contamination from checkM
GC: genome GC content
genome.size: genome size in base pairs (bp)
contigs: number of contigs comprising the given genome
longest.contig.length: length of the longest contig in the genome
mean.contig.length: mean length of contigs in the genome
coding.density: genome coding density
translation.table: translation table used for gene prediction
predicted.genes: number of genes predicted in the genome
genome.quality: genome quality. Medium is ≥ 50% completeness, < 10% contamination; High is > 90% completeness, < 5% contamination
cohort: WGCNA module (biogeographic cohort) membership of genome
age_bin: Fisher's Exact test-based assignment of genome to a water age group (< 50 years, 50 -1000 years, > 1000 years)
age_bin_FDR: FDR-corrected p-value for Fisher's Exact test-based enrichment of genomes across water age groups
iRep: iRep genome replication rate estimates (NA = insufficient coverage)
iRep_r_squared: R-squared value for iRep replication rate estimate (NA = insufficient coverage)
Water_Mass: Fisher's Exact test-based assignment of genome to a water mass [Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m)]
Water_Mass_enrichment_odds_ratio: odds ratio associated with enrichment of genomes across water masses
Water_Mass_enrichment_FDR: FDR-corrected p-value for Fisher's Exact test-based enrichment of genomes across water masses

Data S2: 16S rRNA Amplicon Counts and Annotations

File: SD2_16S_counts_silva138_r1.csv

This table describes the taxonomy and biogeography of prokaryotes detected via 16S rRNA metabarcoding on the P18 transect. Columns are:

Feature.ID: unique identifier for each organism (amplicon sequence variant: ASV)
molecular_sample_id: unique identifier for molecular sample collected
Confidence_silva: confidence score for SILVA v138 taxonomic annotations
raw_counts: non-normalized read copies of given ASV
copies_per_mL: estimated ASV copies contained in each milliliter of seawater filtered (see methods)
Taxon_silva: SILVA v138 taxonomic annotations ASVs
domain, phylum, class, order, family, genus: un-nested silva taxomic annotations
cohort: WGCNA module (biogeographic cohort) assignment for given ASV

Best viewed using R due to file size.

Data S3: 18S rRNA Amplicon Counts and Annotations

File: SD3_18S_counts_pr2_r1.csv

This table describes the taxonomy and biogeography of eukaryotes detected via 18S rRNA metabarcoding on the P18 transect. It includes:

Feature.ID: unique identifier for each organism (amplicon sequence variant: ASV)
molecular_sample_id: unique identifier for molecular sample collected
Confidence_pr2: confidence score for Protist Ribosomal Reference (PR2) taxonomic annotations
raw_counts: non-normalized read copies of given ASV
copies_per_mL: estimated ASV copies contained in each milliliter of seawater filtered (see methods)
Taxon_pr2: Protist Ribosomal Reference (PR2) taxonomic annotations
kingdom, supergroup, division, class, order, family, genus, species: un-nested PR2 taxomic annotations
cohort: WGCNA module (biogeographic cohort) assignment for given ASV

Best viewed in R due to large file size.

Data S4: Environmental Metadata

File: SD4_environmental_metadata_r1.csv

This table describes the physical and chemical environment at the locations where molecular samples were collected. Full environmental metadata for the GO-SHIP P18 cruise are available through CCHDO. NA's indicate data was not collected for the given sample. Columns are:

molecular_sample_id: unique identifier for molecular sample collected
station_number, cast_number, niskin_id: station, CTD-cast number (in case of multiple casts at a given station), and niskin bottle identifiers given by the GO-SHIP program to differentiate samples
LATITUDE: latitude (degrees)
CTD_pressure: ambient pressure measurement from CTD probe (dbar)
DATE: date as year, month, day
TIME: time of CTD cast
LONGITUDE: longitude (degrees)
water_column_depth: water column depth in meters
CTD_temperature: ambient temperature of sample from CTD probe (ITS-90)
CTD_salinity: salinity measurement from CTD probe (PSS-78)
OXYGEN: oxygen (µmol/kg)
SILCATE: silicate (µmol/kg)
NITRAT: nitrate (µmol/kg)
NITRIT: nitrite (µmol/kg)
PHSPHT: phosphate (µmol/kg)
CFC.11: trichlorofluoromethane (CFC-11) in pmol/kg
CFC.12: dichlorodifluoromethane (CFC-12) in pmol/kg
SF6: sulfur hexafluoride concentration (fmol/kg)
total_carbon: total carbon (µmol/kg)
alkalinity: alkalinity (µmol/kg)
pH_seawater_sensor: pH measured by the seawater sensor
dissolved_organic_carbon: dissolved organic carbon
total_dissolved_nitrogen: total dissolved nitrogen
DELC14: δ¹³C (per mille)
DELC13: δ¹⁴C (per mille)
cal_BP: cal BP (calibrated years before the present)
radioc_years_before_2017: water age in years before present (defined as 2017- the year of collection) derived from radiocarbon data (see methods)
CFC11_age: water age (years before 2017) estimated from CFC-11 data (see methods)
CFC12_age: water age (years before 2017) estimated from CFC-12 data (see methods)
SF6_age: water age (years before 2017) estimated from SF6 data (see methods)
avg_atm_tracer_age: average atmospheric tracer age (years before 2017)
consensus_age: consensus age across atmospheric tracers and radiocarbon data (years before 2017)
avg_atm_tracer_age_interpolated: average atmospheric tracer age interpolated to cover more samples (years before 2017)
consensus_age_interpolated: consensus age across atmospheric tracers and radiocarbon data interpolated to cover more samples (years before 2017)
CCL4: carbon tetrachloride (pmol/kg)
N2O: nitrous oxide (nmol/kg)
total_dissolved_phosphorus: total dissolved phosphorus
AABW_fraction: mixing fraction of Antarctic Bottom Water
NPIW_fraction: mixing fraction of North Pacific Intermediate Water
LCDW_fraction: mixing fraction of Lower Circumpolar Deep Water
UCDW_fraction: mixing fraction of Upper Circumpolar Deep Water
AAIW_fraction: mixing fraction of Antarctic Intermediate Water
PDW_fraction: mixing fraction of Pacific Deep Water
closest_end_member: water mass with the highest mixing fraction for the given sample
closest_end_member_percent: percent of water in the sample estimated to originate from the closest end member (water mass)
Prochlorococcus_cell_counts: cell counts of Prochlorococcus from flow cytometry. Not collected for molecular samples 42, 45, 214, and 215
Synechococcus_cell_counts: cell counts of Synechococcus from flow cytometry. Not collected for molecular samples 42, 45, 214, and 215
sample_depth: depth at which the sample was collected (m)

Data S5: ASV Endemism

File: SD5_unique_ASV_percents_16S_18S.csv

This table gives the number and percentage of 16S rRNA and 18S rRNA amplicon sequence variants (ASVs) unique to each sample. Missing 18S rRNA values correspond to samples for which 16S samples passed quality control thresholds but 18S rRNA samples did not (see Methods). Columns are:

molecular_sample_id: unique identifier for molecular sample collected
total_ASVs_16S: total 16S rRNA ASVs
unique_ASVs_16S: number of 16S rRNA ASVs found only in the given sample
percent_ASVs_unique_16S: % of 16S rRNA ASVs unique to the given sample
total_ASVs_18S: total 18S rRNA ASVs
unique_ASVs_18S: number of 18S rRNA ASVs found only in the given sample
percent_ASVs_unique_18S: % of 18S rRNA ASVs unique to the given sample

Data S6: KEGG Orthologs in Functional Zones

File: SD6_KO_functional_zones.tsv

This table describes the KEGG Orthologies (KOs) pertaining to each functional zone defined in the manuscript. Columns are:

kegg_id: KO identifier
kegg_hit: description of KO functions
genomes: number of genomes each KO is found in.
total.library.normalized.coverage: Library-normalized cumulative coverage of each KO across genomes.
Functional_Zone: Which of the 10 functional zones (determined via WGCNA) each KO pertains to.

This is a large file best viewed in R or similar rather than excel. In cases where a gene was annotated as multiple KOs, KOs are given separated by commas and descriptions are separated by semicolons. Because of the presence of commas it is recommended to import into R using the following options: sd6 <- read.table("SD6.tsv", sep = "\t", header = T, comment.char = "", quote = "")

Data S7: Eukaryotic KO Enrichment by Water Mass

File: SD7_euk_LPI_genes_water_masses_sig_ko_fisher.tsv

This table gives the statistical enrichment/depletion of KOs from eukaryotic scaffolds across five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). FDR-adjusted p-values were determined using Fisher's tests. Because of punctuation present in KO descriptions it is recommended to import into R using the following options: sd7 <- read.table("SD7.tsv", sep = "\t", header = T, comment.char = "", quote = "").

Columns are:

KO: KO identifier
KO_description: description of KO functions
Water_Mass: Water mass that the given KO was found to have a significant association with
Odds_Ratio: Odds ratio of association between KO and water mass
Association: Whether the KO is enriched or depleted in the given water mass
Adjusted_P_Value: FDR-adjusted p-value from Fisher's test

Data S8: Prokaryotic KO Enrichment by Water Mass

File: SD8_wm_KO_enrichment.tsv

This table gives the statistical enrichment/depletion of KOs from prokaryotic genomes across five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). FDR-adjusted p-values were determined using Fisher's tests. Because of punctuation present in KO descriptions it is recommended to import into R using the following options: sd8 <- read.table("SD8.tsv", sep = "\t", header = T, comment.char = "", quote = "").

Columns are:

kegg_hit: description of KO functions
kegg_id: KO identifier
Water_Mass: Water mass that the given KO was found to have a significant association with
Odds_Ratio: Odds ratio of association between KO and water mass
Association: Whether the KO is enriched or depleted in the given water mass
Adjusted_P_Value: FDR-adjusted p-value from Fisher's test

Data S9: CAZy Enrichment by Water Mass

File: SD9_CAZy_enrichments_by_wm.csv

This table has Carbohydrate-Active enZYme (CAZy) gene families that are significantly enriched or depleted in prokaryotic genomes pertaining to any of the five water masses: Antarctic Bottom Water (AABW), Antarctic Intermediate Water (AAIW), Lower Circumpolar Deep Water (LCDW), Upper Circumpolar Deep Water (UCDW), and Upper Water (above 500 m). CAZys were annotated using DRAM. FDR-adjusted p-values were determined using Fisher's tests.

Columns are:

cazy: description of Carbohydrate-Active enZYme (CAZy). In some cases multiple CAZys will be given as a single entry (separated by a semicolon) because genes had multiple annotations
Water_Mass: Water mass that the given CAZy was found to have a significant association with
Odds_Ratio: Odds ratio of association between CAZy and water mass
Association: Whether the CAZy is enriched or depleted in the given water mass
Adjusted_P_Value: FDR-adjusted p-value from Fisher's test

Sharing/Access Information

Other public data access links:

Full environmental metadata for the GO-SHIP P18 cruise:
CCHDO P18 cruise data

Primary source data used in this study:

Physical and chemical parameters: CTD profiles and bottle data from publicly available CCHDO data (see above).
Genome sequences and metagenomic assemblies: Created from this study, using raw reads from metagenomes sequenced on Illumina NovaSeq.
Ribosomal RNA amplicon data: 16S and 18S amplicons sequenced using Illumina MiSeq and HiSeq platforms.

Overturning circulation structures the microbial functional seascape of the South Pacific

Data files

Abstract

Description of the data and file structure

Data S1: Genome Abundance and Metadata

Data S2: 16S rRNA Amplicon Counts and Annotations

Data S3: 18S rRNA Amplicon Counts and Annotations

Data S4: Environmental Metadata

Data S5: ASV Endemism

Data S6: KEGG Orthologs in Functional Zones

Data S7: Eukaryotic KO Enrichment by Water Mass

Data S8: Prokaryotic KO Enrichment by Water Mass

Data S9: CAZy Enrichment by Water Mass

Sharing/Access Information

Overturning circulation structures the microbial functional seascape of the South Pacific

Data files

Abstract

README: Overturning circulation structures the microbial functional seascape of the South Pacific

Description of the data and file structure

Data S1: Genome Abundance and Metadata

Data S2: 16S rRNA Amplicon Counts and Annotations

Data S3: 18S rRNA Amplicon Counts and Annotations

Data S4: Environmental Metadata

Data S5: ASV Endemism

Data S6: KEGG Orthologs in Functional Zones

Data S7: Eukaryotic KO Enrichment by Water Mass

Data S8: Prokaryotic KO Enrichment by Water Mass

Data S9: CAZy Enrichment by Water Mass

Sharing/Access Information