Reproducible data, example subsets, and analysis pipeline for the extended TAaCGH study of breast cancer genomic and transcriptomic profiles

Ochoa Zavalza, Salvador 1 ; Pouokam, Maxime1; Kruse, Masato1; Gonzalez-Isunza, Georgina1; Sazdanovic, Radmila 2 ; Arsuaga, Javier 1

Research facility: University of California, Davis

Published Jan 03, 2026 on Dryad. https://doi.org/10.5061/dryad.0zpc867b7

Data files

Jan 03, 2026 version files 48.35 KB

README.md

5.42 KB
TAaCGH-Extension-main.zip

42.93 KB

Abstract

This repository contains the R-based computational framework used to implement and extend the TAaCGH (Tumor Array CGH) pipeline for the associated study. The materials provide a clear and reproducible workflow, taking the user from data preprocessing through survival analysis and subtype-specific modeling. When applied to the referenced publicly available datasets, the scripts reproduce all analyses reported in the manuscript, including maximally selected rank statistic (MaxStat) calculations and region-to-gene mapping steps. The workflow is designed to run end-to-end and produces a consistent set of outputs, such as multiple-testing-adjusted significance tables, genomic interval annotations, and gene biotype summaries. In addition, the pipeline generates diagnostic plots and gene metadata that support interpretation of subtype-specific genomic patterns. Detailed instructions describing the directory structure and execution steps are provided in the accompanying README to support reproducibility and reuse.

TAaCGH-Extension-main.zip contains all files required to reproduce the analyses and results presented in the associated study. This archive includes the complete codebase and documentation needed to execute the extended TAaCGH pipeline. Users can run the scripts directly to regenerate the results and figures described in the manuscript. A public, version-controlled copy of this repository is also available on GitHub at: https://github.com/Arsuaga-Vazquez-Lab/TAaCGH-Extension. Detailed explanations of each file, folder structure, and workflow components are provided in the included README.

Repository Overview

Horlings et al. dataset. Original data from Horlings et al. (2010) are available in the supplementary material of the publication at https://doi.org/10.1158/1078-0432.CCR-09-0709. A script to preprocess the dataset from the supplementary material is provided in the corresponding Horlings folder. To ensure proper execution, users must download all supplementary files associated with the publication and place them in the Horlings directory before running the preprocessing script.

TCGA dataset. Open-access data were obtained from the NCI Genomic Data Commons (GDC). The corresponding folder contains scripts to download the official files directly from the GDC API and to reproduce all preprocessing steps required to generate the matrices used in this study. Users should download the full datasets directly from the GDC source using the provided scripts.

METABRIC dataset. Clinical and copy-number data were obtained from the Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016) study available at the cBioPortal for Cancer Genomics. The files generated by the script: brca_metabric_clinical_data.tsv and METABRICCNADiscovery.txt, were used in conjunction to extract and analyze copy-number alterations in Chromosome 5 corresponding to the PURPL (LINC01021) locus. Users should download the full datasets directly from cBioPortal, and include them in the folder, prior to running the analysis scripts.

SCAN-B (GSE96058) dataset. Normalized RNA-seq expression data and corresponding clinical metadata were obtained from the NCBI Gene Expression Omnibus (GEO; accession GSE96058). The scripts in this repository demonstrate how to retrieve, clean, and process the official data files following the R data package workflow described in https://12379monty.github.io/GSE96058/. The complete dataset must be downloaded directly from the GEO repository. Mutation data were incorporated using files downloaded from the SCAN-B Mutation Explorer: https://oncogenomics.bmc.lu.se/MutationExplorer/.

File Structure and Functionality

Files ending in _survival.R perform Kaplan–Meier survival analyses using either copy number or gene expression data. Adjusted p-values are calculated using the Benjamini–Hochberg (BH) false discovery rate correction. Files ending in _data_download.R are intended to retrieve official datasets from their original repositories (such as GEO, Mutation Explorer, or GDC) and/or prepare them for use in this study. The Data/(Horlings;METABRIC;SCAN-B;TCGA) directories contain scripts and/or serve as placeholders to facilitate downloading the original datasets and preprocessing them for downstream analyses. The script maxstat_pvalues.R approximates (using Lau92, exactGauss, and condMC) the MaxStat p-values for assessing the independence between two variables based on the maximally selected rank statistic (M value).

Main Analysis: Extended TAaCGH Pipeline

The central script extending the TAaCGH workflow is cnv_region_probe2gene.R. After identifying significant regions using TAaCGH, this script locates genes within those regions that are statistically significant. Running this pipeline generates a Results/ directory organized by breast cancer subtype or characteristic (e.g., Luminal A, Luminal B, Basal, HER2, PI3K, etc.).

Each subtype folder includes a table of BH-adjusted (FDR) p-values for each probe in the region. Optional Bonferroni-corrected values (FWER) are also provided but are not emphasized in this study. Each folder also contains a file of patient profiles relevant to that subtype and a list of significant probes from the previously identified TAaCGH regions, annotated by BH or Bonferroni significance.

For each probe identified as significant by BH correction, a dedicated subfolder is created. Within each probe-specific folder, adjacent probes are used to define the genomic interval surrounding the probe. The Ensembl gene browser is then queried to retrieve annotation data—including HGNC symbols, Ensembl gene IDs, gene biotypes, and genomic coordinates—for all genes located within or near the interval defined by the adjacent probes. The resulting files summarize the genes found in each region and indicate which genes lie exactly at the probe location. Each folder also contains boxplots illustrating copy-number or expression differences by subtype (Luminal A, Luminal B, Basal, HER2, PI3K, etc.), along with a companion file reporting the statistical results for these plots. In addition, a biotype summary file describes the composition and characteristics of genes located between adjacent probes.