Data from: Digital PCR quantification of ultrahigh ERBB2 copy number identifies poor breast cancer survival after trastuzumab
Data files
Mar 05, 2024 version files 132.67 KB
-
Meng.ERBB2.SCANB.682.TPM.csv
-
README.md
Abstract
HER2/ERBB2 evaluation is necessary for treatment decision-making in breast cancer (BC), however current methods have limitations and considerable variability exists. DNA copy number (CN) evaluation by droplet digital PCR (ddPCR) has complementary advantages for HER2/ERBB2 diagnostics. In this study, we developed a single-reaction multiplex ddPCR assay for determination of ERBB2 CN in reference to two control regions, CEP17 and a copy-number-stable region of chr. 2p13.1, validated CN estimations to clinical in situ hybridization (ISH) HER2 status, and investigated the association of ERBB2 CN with clinical outcomes. 909 primary BC tissues were evaluated and the area under the curve for concordance to HER2 status was 0.93 and 0.96 for ERBB2 CN using either CEP17 or 2p13.1 as reference, respectively. The accuracy of ddPCR ERBB2 CN was 93.7% and 94.1% in the training and validation groups, respectively. Positive and negative predictive value for the classic HER2 amplification and non-amplification groups was 97.2% and 94.8%, respectively. An identified biological “ultrahigh” ERBB2 ddPCR CN group had significantly worse survival within patients treated with adjuvant trastuzumab for both recurrence-free survival (hazard ratio, HR: 3.3; 95% CI 1.1–9.6; p = 0.031, multivariable Cox regression) and overall survival (HR: 3.6; 95% CI 1.1–12.6; p = 0.041). For validation using RNA-seq data as a surrogate, in a population-based SCAN-B cohort (NCT02306096) of 682 consecutive patients receiving adjuvant trastuzumab, the ultrahigh-ERBB2 mRNA group had significantly worse survival. Multiplex ddPCR is useful for ERBB2 CN estimation and ultrahigh ERBB2 may be a predictive factor for decreased long-term survival after trastuzumab treatment.
README: Digital PCR quantification of ultrahigh ERBB2 copy number identifies poor breast cancer survival after trastuzumab: SCAN-B RNA-seq data
https://doi.org/10.5061/dryad.rv15dv4dm
ABSTRACT: HER2/ERBB2 evaluation is necessary for treatment decision-making in breast cancer (BC), however current methods have limitations and considerable variability exists. DNA copy number (CN) evaluation by droplet digital PCR (ddPCR) has complementary advantages for HER2/ERBB2 diagnostics. In this study, we developed a single-reaction multiplex ddPCR assay for determination of ERBB2 CN in reference to two control regions, CEP17 and a copy-number-stable region of chr. 2p13.1, validated CN estimations to clinical in situ hybridization (ISH) HER2 status, and investigated the association of ERBB2 CN with clinical outcomes. 909 primary BC tissues were evaluated and the area under the curve for concordance to HER2 status was 0.93 and 0.96 for ERBB2 CN using either CEP17 or 2p13.1 as reference, respectively. The accuracy of ddPCR ERBB2 CN was 93.7% and 94.1% in the training and validation groups, respectively. Positive and negative predictive value for the classic HER2 amplification and non-amplification groups was 97.2% and 94.8%, respectively. An identified biological “ultrahigh” ERBB2 ddPCR CN group had significantly worse survival within patients treated with adjuvant trastuzumab for both recurrence-free survival (hazard ratio, HR: 3.3; 95% CI 1.1–9.6; p = 0.031, multivariable Cox regression) and overall survival (HR: 3.6; 95% CI 1.1–12.6; p = 0.041). For validation using RNA-seq data as a surrogate, in a population-based SCAN-B cohort (NCT02306096) of 682 consecutive patients receiving adjuvant trastuzumab, the ultrahigh-ERBB2 mRNA group had significantly worse survival. Multiplex ddPCR is useful for ERBB2 CN estimation and ultrahigh ERBB2 may be a predictive factor for decreased long-term survival after trastuzumab treatment.
Description of the SCAN-B RNA-seq data and file structure
RNA-sequencing data for breast tumors was generated within the SCAN-B initiative. The processed data file contains ERBB2 original, log2-transformed, and mean adjusted (correcting for 2 group protocols dUTP & TruSeq_*) transcripts per kilobase million (TPM) values as well as all clinical variables used in the analysis.
For the data file, the column definitions are as follows:
Ordinal : incrementing dataset number from 1 to 682
Sample : external SCAN-B ID for the breast tumor tissue sample
Case : external SCAN-B ID for the breast cancer diagnosis case
Patient : external SCAN-B ID for the patient
Diagnosis_year : year of breast cancer diagnosis
SpecimenType : type of sample specimen (all are from the primary tumor)
BiopsyType : type of sample specimen collection means (all are from the surgical operation for tumor removal)
GEX.assay : external SCAN-B ID for the RNA-sequencing gene expression dataset
LibraryProtocol : RNA-seq library preparation protocol name (dUTP, TruSeq_mRNA, or TruSeq_NeoPrep)
PM_READS : number of read pairs passing mask filters
ERBB2_tpm : ERBB2 TPM value
ERBB2_tpm_log : log2-transformed ERBB2 TPM value (an offset of 0.1 was added to all expression measurements prior to log2 transformation)
ERBB2_mean_adj_log : mean-adjusted log2-transformed ERBB2 TPM value (after LibraryProtocol 2 group batch correction, dUTP and TruSeq_*; see the Methods section in the associated publication)
age : age of patient at diagnosis (in 5-year bins)
tumor_size : tumor size (mm)
NHG : Nottingham histological grade, 3 groups (G1, G2, G3)
ER_10perc : estrogen receptor status using the Swedish cutoff of 10% positive cells (pos=positive, neg=negative)
PgR_10perc : progesterone receptor status using the Swedish cutoff of 10% positive cells (pos=positive, neg=negative)
Ki67_perc : Ki67 percent tumor cells positive
lymphNode3 : number of positive lymph nodes, 3 groups (0, 1to3, 4toX)
HER2_IHC : HER2 IHC score, 3 groups (0-1+, 2+, or 3+)
HER2_ISH : HER2 ISH result, 2 groups (amplified or normal)
OS_days : number of days of overall survival
OS_event : overall survival event (0=alive, 1=dead)
RFS_days : number of days of recurrence-free survival
RFS_event : recurrence-free survival event (0=no event, 1=relapse or death)
"NA" in any cell indicates data missing or not available.
Information about the SCAN-B RNA-seq data generation and processing
PROTOCOLS
Library construction protocol: Poly(A) mRNA is isolated from the total RNA in up to 96-well microtiter plate format by two rounds of purification with Dynabeads Oligo (dT)25 (Invitrogen) using a KingFisher Flex magnetic particle processor (ThermoScientific). Zinc-mediated fragmentation (Ambion) is performed and the fragmented mRNA retrieved using column purification (Zymo-Spin I-96 plates; Zymo). The sequencing library generation protocol is a modification of the dUTP method, which importantly retains the directionality (stranded-ness) of the sequenced RNA molecules. First strand cDNA synthesis is performed using random hexamers and standard dNTP mix followed by cleanup using Sephadex gel filtration (Illustra AutoScreen-96A plates; GE Healthcare), and second strand cDNA synthesis is performed using dUTP in place of dTTP in the dNTP-mix and cleanup using Zymo-Spin I-96 plates. The cDNA is end-repaired and A-tailed, and diluted TruSeq adapters with barcodes are ligated using a modified protocol (Illumina). Adapter-ligated cDNA is then size-selected to remove short oligonucleotides using carboxylic acid (CA) paramagnetic beads (Invitrogen) and polyethylene glycol (PEG), similar to the previously described methods, and automated on the KingFisher Flex. The second cDNA strand is digested using uracil-DNA glycosylase and the product is enriched by 12 PCR cycles (Illumina). The PCR product undergoes two cycles of size selection using CA-beads and varying concentrations of PEG, first to exclude DNA fragments >700 bp and then to exclude fragments <200 bp. Quality control is performed on control libraries using Qubit fluorometric measurement (Life Technologies) and Caliper LabChip XT microcapillary gel electrophoresis. Typically, 10-24 barcoded libraries are included in a pool and each pool is sequenced in at least one lane across dual flowcells. Paired-end sequencing of 50 bp read-length is performed on an Illumina HiSeq 2000 or NextSeq 500 instrument.
RNA-sequencing: RNA-sequencing was performed by modified Illumina stranded dUTP method or Illumina stranded TruSeq mRNA protocol, either implemented on KingFisher or on the Illumina NeoPrep system.
DATA PROCESSING PIPELINE
- Step 1: Base-calling using manufacturer's on-instrument software.
- Step 2: Demultiplexing with Picard versions 1.120 or 1.128. IlluminaBasecallsToFastq parameters used were ADAPTERS_TO_CHECK=INDEXED, ADAPTERS_TO_CHECK=PAIRED_END, INCLUDE_NON_PF_READS=false.
- Step 3: Filtering to remove reads that align (using Bowtie 2 with default parameters except -k 1 --phred33 --local) to ribosomal RNA/DNA (GenBank loci NR_023363.1, NR_003285.2, NR_003286.2, NR_003287.2, X12811.1, U13369.1), phiX174 Illumina control (NC_001422.1), and sequences contained in the UCSC RepeatMasker track (downloaded March 14, 2011).
- Step 4: Fragment size distribution (mean and width) for the alignment step was estimated for each sample using bowtie2 2.2.3 or 2.2.5. Parameters set during estimation were -fr, -k 1, --phred33, --local, and -u 100000, using human genome assembly GRCh38.
- Step 5: The demultiplexed and pass-mask filtered reads were aligned using HISAT2 v 2.1.0 to the human genome reference GRCh38/hg38 together with 104,133 transcript annotations from the UCSC knownGenes table (downloaded September 22, 2014) using the GENCODE release 27 transcriptome model, with default parameters except --no-unal --non-deterministic --novel-splicesite-outfile ${SPLICEFILE} --rna-strandness RF. HISAT2 indexes were created using the --snp parameter and dbSNP build 150.
- Step 6: Gene expression data in FPKM (fragments per kilobase of transcript per million mapped reads) and TPM (transcripts per kilobase million) were generated using StringTie v1.3.3b (default parameters including --rf -e) using protein coding transcripts from GENCODE release 27 as transcriptome model. Novel transcripts were discarded. A TPM gene expression matrix was generated from the .tsv files. To avoid zero values, an offset of 0.1 was added to all expression measurements, followed by log2 transformation.