Genome and developmental transcriptome of the forensically important blowfly, Phormia Regina

Lin, Sheng-Hao 1 ; Bellantuono, Anthony1; Wells, Jeffrey1; DeGennaro, Matthew 1

Published Jul 08, 2025; Updated Oct 27, 2025 on Dryad. https://doi.org/10.5061/dryad.w6m905r04

Abstract

Forensic entomologists often rely on the morphological traits of insects, such as length and weight, to estimate the time of death. Blowflies like Phormia regina are particularly significant in North American forensic investigations. However, age estimation of older P. regina maggots becomes challenging due to limited morphological changes during the lengthy L3 larval phase of development. To address this gap, we used transcriptomic profiling of blowfly maggots to generate molecular markers that specify their age. We first characterized maggot weight, behavior, and mRNA in 10-hour increments during development. At 27.5°C, the weight of the maggots increased from when first recorded at 70 through 100 hours and then remained stable from 110 hours to pupation. The behavioral transition between the feeding and the post-feeding wandering stage usually took place between 100 and 120 hours. Second, we built a chromosomal-scale Phormia regina genome annotated with long mRNA reads to provide a reliable database to uncover transcriptomic signatures during larval development. We applied differential gene expression analysis (DEGs), weighted gene co-expression network analysis (WGCNA), and the generalized linear model (GLM) to identify nine candidate genes that all three statistical analyses indicated are useful for delineating the age of otherwise indeterminate L3 maggots. In turn, these genes could be used to design a quantitative PCR protocol for more accurate estimates of the time of death.

Dataset DOI: 10.5061/dryad.w6m905r04

Description of the data and file structure

https://doi.org/10.5061/dryad.w6m905r04

Blowflies from the Calliphoridae family are drawn to decaying organic matter. Female blowflies lay eggs on corpses, which hatch into larvae within hours to a day. These larvae grow through several stages before becoming adults, aiding in estimating the postmortem interval based on their development stage. Forensic entomologists use traits like length and weight for time of death estimation, but older larvae pose challenges due to limited morphological changes during their lengthy L3 phase.

Our dataset aims to estimate time of death by identifying genes in aging maggots of Phormia regina. We characterized maggot development and identified a transition between feeding and wandering stages at 100-120 hours at 27.5°C. We built a chromosomal-scale genome annotated with long and short mRNA reads as well as applied differential gene expression analysis, weighted gene co-expression network analysis, and generalized linear models to identify transcripts that showed robust change. We found nine candidate genes (Y5078, Y5076, agt, ech1, dhb4, asm, gabd, acohc, Ivd) that were significant using these three statistical approaches and are likely to be useful for age estimation in L3 maggots. The data set contains other genes that may be used to characterize the transcriptional changes in L3 blowfly maggots.

Files and variables

File Structure

All files are provided in .zip format and named according to the figure or table they support in the publication. Each .zip includes a menu.xlsx describing its internal files and purpose.

Input Files

File: Datainput_Figure3A.zip

Description: Input files used for genome assembly statistics and sequencing coverage analysis. Contains 8 subfolders:

· blast_add_blobtool/: Contains output from blast_add_blobtool.sh

o ASSEMBLY_NAME.ncbi.blastn.out: This file contains the BLASTN alignment results of the assembly sequences against the NCBI nucleotide database, used by blast_add_blobtool to assign taxonomic classification and detect possible contamination.

· blastn/: Input for blastn.sh

o Pregina.p_ctg.purged.fa: This file contains the purged primary contig assembly in FASTA format, used as the query input for BLASTN to identify sequence similarity against a reference database.

· blobtool_create/: Input for blobtool_create.sh

o P_HiCGenome.yml: This YAML file contains the metadata and configuration information (including taxonomic and coverage data) required for initializing the BlobToolKit dataset.

o out_JBAT.FINAL.fa: This file contains the final genome assembly in FASTA format, providing the sequence data to be analyzed and visualized by BlobToolKit.

· busco_genome/: Input for busco_genome.sh

o out_JBAT.FINAL.fa: This file contains the assembled genome in FASTA format, which is analyzed by BUSCO to assess genome completeness based on the presence of conserved single-copy orthologs.

· buscoAdd/: Input for buscoAdd.sh

o full_table.tsv: This file contains the full summary output from BUSCO, listing the status (complete, duplicated, fragmented, or missing) of each BUSCO gene in the analyzed genome assembly.

o missing_busco_list.tsv: This file contains the list of BUSCO orthologs that were identified as missing from the assembly, used by BlobToolKit to refine completeness estimates and taxonomic assignments.

· hifiasm/: Input for hiasm.sh

o m64140_200721_055642.Q20.fastq: Raw HiFi reads (available at PRJNA990781).

o Pregina.asm.bp.p_ctg.gfa: This file contains the primary contig assembly graph in GFA format generated by Hifiasm, representing the sequence assembly structure and contig connectivity.

· purge_dups_step1/: Input for purge_dups_step1.sh

o config.Pregina.json: This JSON configuration file contains the paths and parameters required by the first step of Purge_Dups to identify and remove redundant haplotypic duplications from the genome assembly.

· purge_dups/: Input for purge_dups.sh

o config.Pregina.json: This JSON configuration file specifies the input assembly, coverage information, and parameters needed by the run_purge_dups.py script to identify and remove duplicated haplotypic regions from the genome assembly.

File: DataInput_Figure3B.zip

Description: Input files for Hi-C genome scaffolding and contact map generation. Contains 5 subfolders:

· 01_mapping_arima/: From 01_mapping_arima.sh

o HHY_1_R1.bam, HHY_1_R2.bam: Paired-end Hi-C reads from a single male Phormia regina(PRJNA990781).

· faidx/: From faidx.sh

o Pregina.p_ctg.removedBac.fa: Final genome assembly after purging duplicates and removing bacterial contamination identified via BlobToolKit.

· step_1_hic/: From step_1_hic.sh

o yahs.out.bin: Binary output for Hi-C scaffold layout.

o yahs.out_scaffolds_final.agp: AGP file showing scaffold order and orientation.

o yahs.out_scaffolds_final.fa.fai: FASTA index for scaffolded genome.

· step2_hic/: From step_2_hic.sh

o out_JBAT.txt: Assembly layout summary from Juicebox.

o out_JBAT.hic: Binary Hi-C contact map for Juicebox visualization.

· yahs/: From yahs.sh

o Pregina.p_ctg.removedBac.fa: Purged and decontaminated genome.

o all1.bam: Hi-C long reads mapped to genome .bam file.

File: Datainput_Figure4A-4F_Figure5A_Figure6A-6F

Description: Input for differential gene expression analysis. Contains 3 subfolders:

· debrowser2/: From debrowser2.R

o 7.csv: This file contains the TPM expression values for all genes across all treatment conditions and samples, serving as the primary input for visualization and differential expression analysis in DEBrowser. The trailing number (1~6) indicates biological replicates.

o meta_data.csv: This file provides the sample metadata—including behavior, age, batch, and other experimental variables—used by DEBrowser to group and interpret the expression data in 7.csv

· salmon/: From salmon.sh

o {sample}_01.fq.gz, {sample}_02.fq.gz: These files contains the paired-end RNA-seq data for a given sample, used for transcript quantification with Salmon; the RNA-seq libraries are available in the NCBI SRA archive under accession PRJNA99078 (sample details provided in Supplemental Table 1).

o TruSeq3-PE-2.fa: This file provides the adapter sequences used for trimming Illumina TruSeq paired-end reads, ensuring clean input data for downstream transcript quantification.

o Phormia_regina.mrna-transcripts.fa: This FASTA file contains the predicted mRNA transcript sequences from the Phormia regina genome, generated using the Funannotate pipeline, and serves as the reference transcriptome for Salmon indexing and quantification.

· tximport/: From tximport.R

o Subfolders 80F1 to 130F3: Contain quant.sf files from maggots displaying feeding behavior, aged from 80 to 130 hours, with replicates 1 to 3.

o Subfolders 80W1 to 130W3: Contain quant.sf files from maggots displaying wandering behavior, aged from 80 to 130 hours, with replicates 1 to 3

o Phormia_regina.gff3: Genome annotation file from Hi-C genome assembly generated using the Funannotate pipeline to extract transcript-to-gene relationships for quantification summarization.

o quant_files.list: A text file listing the paths to all quant.sf files used as input for tximport

File: Datainput_Figure7A-B.zip

Description: Input files for WGCNA gene co-expression analysis. Contains 1 subfolder:

· WGCNA/: From WGCNA.R

o update10.xls: This Excel file contains the TPM expression values of all genes across all treatment conditions and replicates, used as input for constructing gene co-expression networks in WGCNA. F represents feeding and W represents wandering behavior. 80-130 represent 80-130 hrs per 10 hrs cohort. The trailing number (1~6) indicates biological replicates.

File: Datainput_Figure8A-B.zip

Description: Input files for linear regression modeling. Contains 1 subfolder:

· linearregression2/: From linearregression2.R

o time_mean3.txt: This file contains the mean TPM expression values for each gene across replicates within each treatment condition, serving as the basis for modeling gene expression trends over time. 80-130 represent 80-130 hrs per 10 hrs cohort.

o time_slope2.txt: This file contains the calculated slopes of linear regression lines derived from the mean TPM values in time_mean3.txt, representing the direction and rate of gene expression change across treatments.

File: DataInput_Table1.zip

Description: Input for genome annotation and comparative genomics. Contains 9 subfolders:

· repeatmodelerrun/: From repeatmodelerrun.sh

o Pregina.p_ctg.removedBac.fa: Genome assembly post-duplicate and bacterial sequence removal.

· repeatmasker/: From repeatmasker.sh

o Pregina.p_ctg.removedBac.fa: Input genome from post-duplicate and bacterial sequence removal for repeat masking.

o Phormia-families.fa: Identified repeat families generated by repeatmodeler.

· funannotate_train/: From funannotate_train.sh

o out_JBAT.FINAL.fa: Hi-C scaffolded genome.

o allIsoseq.fq: Concatenated Iso-Seq reads for training.

· funannotate_predict/: From funannotate_predict.sh

o masked.fasta: Repeat-masked genome from “Pregina.p_ctg.removedBac.fa” .

o transcripts.fa: Reference transcripts from Isoseq libraries.

o funannotate_train.coordSorted.bam: Aligned evidence BAM from isoseq to genome.

o funannotate_train.pasa.gff3: PASA-aligned gene structures.

· Isoseq3/: From Isoseq3.sh

o m64140_210607_173521.ccs.bam: Raw Iso-Seq reads (PRJNA990781).

· funannotate_update/: From funannotate_update.sh

o Phormia_regina.gbk: Annotated genome in GenBank format from funannotate_predict.sh step.

o masked.fasta: Repeat-masked genome from Pregina.p_ctg.removedBac.fa.

o Phormia_regina.gff3: Updated GFF3 annotation from funannotate_predict.sh step.

o allIsoseq.fq: Iso-Seq reads.

o pasa.alignAssembly.Template.txt: PASA template configuration.

· interproscan/: From interproscan.sh

o Phormia_regina.proteins.fa: Predicted proteins for domain analysis from “funannotate_update.sh” .

· phobius/: From phobius.sh

o Phormia_regina.proteins.fa: Predicted protein from “funannotate_update.sh” used to predict signal peptides and transmembrane domains.

· funannotate_annotate/: From funannotate_annotate.sh

· Phormia_regina.gbk: Previous GenBank file run from “funannotate_update.sh

o ” step .

o newisoseq.fa.xml: InterProScan XML annotation.

o newlongreads.txt: Phobius output.

· Data input for Figures 1 and S2: Provided in the Figure 1A tab of Lin_et_al_dryad_data.xlsx.

· Data input for Figures S3–S4: Provided in Datainput_Figure4A-4F_Figure5A_Figure6A-6F.zip.

· Data input for Figure S9: Provided in Datainput_Figure8A-B.zip.

Output Files

File: Dataoutput_Figure3A.zip

Description: Genome assembly and quality metrics.

· HHY1.sorted_cov.json: JSON file containing scaffold-level sequence coverage statistics used to assess uniformity and depth of genome coverage.

· diptera_odb10_busco.json: BUSCO assessment results in JSON format based on the Diptera ortholog dataset, reporting genome completeness using conserved genes

· full_table.tsv: Tab-separated file listing detailed BUSCO results for each ortholog, including classification (Complete, Fragmented, or Missing).

· gc.json: JSON file with scaffold-level GC content data used for visualizing base composition in quality plots.

· identifiers.json: JSON file mapping internal BlobToolKit identifiers to scaffold names, aiding traceability between datasets.

· length.json: JSON file containing scaffold lengths used in plotting cumulative genome size and contiguity metrics.

· meta.json: Metadata file summarizing assembly properties, project information, and configuration settings for BlobToolKit.

· missing_busco_list.tsv: List of BUSCO genes that were not detected in the assembly, useful for identifying missing genomic content.

· ncount.json: JSON file reporting the number of ambiguous bases (Ns) per scaffold, highlighting assembly gaps.

· short_summary.txt: Human-readable summary of BUSCO completeness results including percentages of Complete, Fragmented, and Missing genes.

File: Dataoutput_Figure3B.zip

Description: Hi-C contact matrix outputs.

· out_JBAT (1).assembly: A text file describing the Hi-C–based scaffold assembly layout, showing the order, orientation, and placement of contigs or scaffolds after scaffolding

· out_JBAT (1).hic: A binary Hi-C contact map file generated by Juicer Tools, used to visualize chromosome-scale interaction patterns in Juicebox

File: DataOutput_Table1.zip

Description: Comparative statistics and annotations.

· Phormia_regina.mrna-transcripts.fa: FASTA file containing the predicted mRNA transcript sequences from the current genome assembly, used for comparison against the published 2016 genome.

· blastn_gene.txt: BLASTN results of the current genome or transcript sequences queried against the NCBI nucleotide (nt) database, used to assess annotation quality and compare gene-level matches to those reported in the 2016 Phormia regina genome.

· Phormia_regina.stats.json: JSON-formatted summary reporting key statistics of the current genome and transcriptome (e.g., total transcript number, N50, GC content), used for comparison against the 2016 assembly metrics.

Descriptions of Tabs in Lin_et_al_dryad_data.xlsx

Each tab corresponds to data used in figures or tables in the manuscript. All column names are explained for clarity and reuse.

File: Lin_et_al_dryad_data.xlsx

Description: A multi-tab Excel file containing the data corresponding to all figures and analyses.

Variables

1. March 11 2020, Oct 8 2020, Dec 18 2020, March 21 2021
- Purpose: Records of maggot sampling at different developmental stages, documenting behavior (feeding vs. wandering) and weight.
- Columns:
  - Hrs(age): Age of maggots at the time of collection (in hours).
  - Feeding: Number of feeding maggots collected.
  - Wandering: Number of wandering maggots collected.
  - Tota numbers: Total number of maggots collected at each time point.
  - Feeding Age(hrs): Age assigned to feeding maggots for recording purposes.
  - Feeding weight (mg): Average weight (in milligrams) of feeding maggots.
  - Wandering Age(hrs): Age assigned to wandering maggots for recording purposes.
  - Wandering weight (mg): Average weight (in milligrams) of wandering maggots.

2. RNA seq & long reads metadata
- Purpose: Metadata describing the biological and technical details of RNA-seq and Iso-Seq libraries used in the study.
- Columns:
  - sample_name: Unique identifier for each biological replicate.
  - bioproject_accession: NCBI BioProject accession number (PRJNA990781).
  - organism: Species name (Phormia regina).
  - isolate: Description of sample source (e.g., single maggot).
  - breed: Laboratory strain used in the study.
  - isolation_source: Origin of the biological sample (e.g., lab colony).
  - collection_date: Date of sample collection.
  - geo_loc_name: Geographic location of sample origin (if applicable).
  - tissue: Tissue type used for sequencing (e.g., whole tissue).
  - age: Age of the maggot in hours (e.g., "90 hrs").
  - notes: Explanation of sample naming (e.g., F = feeding behavior).
  - Weight(mg): Measured weight of each maggot in milligrams.
  - File name 1–13 on SRA PRJNA990781: File names for sequencing reads deposited in NCBI SRA.
3. Blast uniprot results
- Purpose: Functional annotation of transcripts based on BLASTN matches to the UniProt protein database.
- Columns:
  - qseqid: Transcript ID from the de novo transcriptome.
  - sseqid: UniProt protein ID matched to the transcript.
  - pident: Percentage of nucleotide identity in the alignment.
  - alignment length: Length of aligned region (in base pairs).
  - Mismatch: Number of mismatches between query and subject.
  - gapopen: Number of gap openings in the alignment.
  - qstart: Start position of alignment in the query transcript.
  - qend: End position of alignment in the query transcript.
  - sstart: Start position in the subject (UniProt protein).
  - qend (subject): End position in the subject.
  - evalue: Expectation value indicating the statistical significance of the match.
  - bitscore: Alignment score reflecting the quality of the match.
4. RNA TPM values
- Purpose: Transcript-level expression values (TPM) across all experimental samples.
- Columns:
  - Gene: Gene name.
  - F80_1 to W130_6: TPM (Transcripts Per Million) values for each sample; "F" = feeding behavior, "W" = wandering behavior, numbers indicate age (in hours) and replicate number.
5. Figure 1A
- Purpose: Body weight measurements for feeding and wandering maggots across developmental stages.
- Columns:
  - Feeding Age: Age of feeding maggots.
  - Weight(mg): Individual weight in milligrams of feeding maggots.
  - Wandering Age: Age of wandering maggots.
  - Weight(mg): Individual weight in milligrams of wandering maggots.
6. Figure 1B
- Purpose: Count summary of maggots by age and behavioral status.
- Columns:
  - Age: Age group in hours.
  - Feeding: Number of feeding maggots.
  - Wandering: Number of wandering maggots.
  - Total: Combined total of feeding and wandering maggots.
7. Figure 4A – Figure 4F
- Purpose: Differential expression results of genes between feeding and wandering maggots.
- Columns:
  - F90_1 to W90_6: TPM values for each biological replicate.
  - padj: Adjusted p-value (FDR-corrected) for differential expression.
  - log2FoldChange: Log2-transformed fold change between conditions.
  - pvalue: Unadjusted p-value.
  - stat: Test statistic from the differential expression model.
  - foldChange: Non-log-transformed fold change.
  - log10padj: -log10 of the adjusted p-value (used for visualization).
8. Figure 5A
- Purpose: Clustering of feeding and wandering samples along principal components (PC1 and PC2) used to visualize variance and correlations of replicates.
- Columns:
  - sample: Biological replicate ID (e.g., F80_1, W100_4).
  - PC1: First principal component score for each sample.
  - PC2: Second principal component score for each sample.
  - Behavior: Condition label for each sample (Feeding or Wandering).
9. Figure 5B
- Purpose: Compare candidate genes associated with feeding and wandering behavior in blowfly maggots across studies.
- Columns:
  - Tarone et al 2011: Candidate genes reported in Tarone AM, Foran DR. Gene expression during blow fly development: improving the precision of age estimates in forensic entomology. Journal of Forensic Sciences. 2011;56(S1):S112–S122.
  - Sze et al 2012: Candidate genes reported in Sze SH, Dunham JP, Carey B, Chang PL, Li F, Edman RM, Fjeldsted C, Scott MJ, Nuzhdin SV, Tarone AM. A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms and transcript expression estimates. Insect Molecular Biology. 2012;21(2):205–221.
  - Pimsler et al 2021: Candidate genes reported in Pimsler ML, Hjelmen CE, Jonika MM, Sharma A, Fu S, Bala M, Sze SH, Tomberlin JK, Tarone AM. Sexual dimorphism in growth rate and gene expression throughout immature development in wild type Chrysomya rufifacies (Diptera: Calliphoridae) Macquart. Frontiers in Ecology and Evolution. 2021;9:696638.
  - Pimsler et al 2021 ∩ Lin et al: Genes shared between Pimsler et al. (2021) and this study (intersection).
  - Lin et al: Candidate genes identified in this study.
10. Figure 6A – Figure 6F
- Purpose: Differential expression results of genes across aging maggots.
- Columns:
  - F80_1 to W130_6: TPM values for each biological replicate.
  - padj: Adjusted p-value (FDR-corrected) for differential expression.
  - log2FoldChange: Log2-transformed fold change between conditions.
  - pvalue: Unadjusted p-value.
  - stat: Test statistic from the differential expression model.
  - foldChange: Non-log-transformed fold change.
  - log10padj: -log10 of the adjusted p-value (used for visualization).
10. Figure 7A
- Purpose: Expression values of genes grouped by WGCNA modules for heatmap visualization.
- Columns:
  - gene_id — Gene symbol/identifier for each feature tested.
  - module — The WGCNA module the gene was assigned
  - kME — “Module membership”: the Pearson correlation between the gene’s normalized expression profile and the eigengene of its assigned module. Ranges from −1 to 1. Larger absolute values mean the gene follows the module’s overall pattern more closely; |kME| ≥ ~0.8 is often considered a “hub-like” member.
  - p_value — Two-sided significance for the kME correlation (H₀: kME = 0), computed by WGCNA’s Student t-test for correlations. Values are in scientific notation (e.g., 2.91e-13). Consider controlling FDR across genes if you use this for thresholding.
11. Figure 7B
- Purpose: Assignment of genes to WGCNA modules and corresponding expression values.
- Columns:
  - gene_id — Gene symbol/identifier for each feature tested.
  - treatment — Sample label combining stage and timepoint with a replicate ID. F = feeding stage; W = wandering stage. Numbers (e.g., 80, 90, 100, …, 130) indicate the age cohort used in the study. Suffix 1–_3 are feeding replicates; _4–_6 are wandering replicates (e.g., F90_2, W100_5).
  - normalized_expression —expression value from DESeq2. These values are on a log2 scale and are comparable across samples for the same gene.
  - module — WGCNA module assignment.
12. Figure 8A – 8B
- Purpose: Age-related expression trends of significantly upregulated (7A) and downregulated (7B) genes.
- Columns:
  - 80–130: Mean TPM values for each age group (in hours).
  - 80STD–130STD: Standard deviation of TPM values at each age.
  - Slope: Slope of a linear regression line fitted to expression values over time.
  - RSquare: R² value indicating the goodness of fit of the linear model.
13. Figure 9A
- Purpose: Venn diagram data showing gene overlap across three analyses (DEG, GLM, and WGCNA).
- Columns:
  - Names: Source of gene list (e.g., DEG, GLM, WGCNA).
  - total: Number of overlapping genes.
  - elements: Names of the overlapping genes.
14. Figure 9B – 9C
- Purpose: TPM profiles of top upregulated (8B) and downregulated (8C) candidate genes identified in overlapping analyses.
- Columns:
  - Hours: Timepoint of sample collection (in hours).
  - Gene columns (e.g*., dhb4, asm, gabd, ivd*): Gene names.
  - Mean TPM: Average TPM values across replicates.
  - STDEV: Standard deviation of TPM values across replicates.

15. Figure S2

Purpose: Comparison of medians between the two behavioral groups.
Columns:
Table Analyzed / Col: Defines groups being compared (Column A = Feeding, Column B = Wandering).> * P value: Probability that the observed difference is due to chance.
Exact or approximate P value?: Indicates whether the P value was computed exactly or estimated.> * P value summary: Significance code (**** = highly significant).> * Significantly different (P < 0.05)?: States whether the groups differ statistically.> * One- or two-tailed P value?: Specifies whether the test checked both directions (here: two-tailed).
Sum of ranks in column A, B: Rank totals assigned to each group.> * Mann–Whitney U: The test statistic used for comparison.> * Median of column A: Median weight of Feeding group (n = 502).> * Median of column B: Median weight of Wandering group (n = 307).
Difference: Actual: Observed difference between medians.
Difference: Hodges–Lehmann: Estimated median difference using Hodges–Lehmann method.
16. Figure S3A
- Purpose: Differential expression results of genes between feeding and wandering maggots on the heatmap at 90 hours old.
- Columns:
  - F90_1 to W90_6: TPM values for each biological replicate.
  - padj: Adjusted p-value (FDR-corrected) for differential expression.
  - log2FoldChange: Log2-transformed fold change between conditions.
  - pvalue: Unadjusted p-value.
  - stat: Test statistic from the differential expression model.
  - foldChange: Non-log-transformed fold change.
  - log10padj: -log10 of the adjusted p-value (used for visualization).
17. Figure S3B
- Purpose: Differential expression results of genes between feeding and wandering maggots on the heatmap from all aging groups.
- Columns:
  - F90_1 to W130_6: TPM values for each biological replicate.
  - padj: Adjusted p-value (FDR-corrected) for differential expression.
  - log2FoldChange: Log2-transformed fold change between conditions.
  - pvalue: Unadjusted p-value.
  - stat: Test statistic from the differential expression model.
  - foldChange: Non-log-transformed fold change.
  - log10padj: -log10 of the adjusted p-value (used for visualization).
18. Figure S4A – Figure S4C
- Purpose: Differential expression results of genes across aging maggots on the heatmap.
- Columns:
  - F80_1 to W130_6: TPM values for each biological replicate.
  - padj: Adjusted p-value (FDR-corrected) for differential expression.
  - log2FoldChange: Log2-transformed fold change between conditions.
  - pvalue: Unadjusted p-value.
  - stat: Test statistic from the differential expression model.
  - foldChange: Non-log-transformed fold change.
  - log10padj: -log10 of the adjusted p-value (used for visualization).

19. Figure S5 – S8
- Purpose: Gene Ontology (GO) enrichment analysis results for DEGs.
- Columns:
  - FDR: False discovery rate for GO term enrichment.
  - Counts: Number of genes in the input list associated with the GO term.
  - Pathway Genes: Number of background genes associated with the GO term.
  - Fold Enrichment: Ratio of observed to expected gene counts.
  - Pathway: GO term name.
  - URL: Link to the GO term online.
  - Genes: List of genes contributing to the enrichment.
  - gene_ratio: Proportion of input genes associated with the term.
  - Sum: Total number of enriched terms or possibly a summary score.
- GO Categories:
  - Biological Process: Functional biological goals (e.g., cell division).
  - Molecular Function: Biochemical activities (e.g., ATP binding).
  - Cellular Component: Locations within the cell (e.g., nucleus).
20. Figure S9
- Purpose: TPM profiles of housekeeping candidate genes identified in the GLM analysis.
- Columns:
  - 80–130: Mean TPM values for each age group (in hours).
  - 80STD–130STD: Standard deviation of TPM values at each age.
  - Slope: Slope of a linear regression line fitted to expression values over time.
  - RSquare: R² value indicating the goodness of fit of the linear model.

File: Phormia_regina_genome_annotation_file.zip

Description: Genome annotation files in NCBI .asn format (Pregina.asn) and genome assembly .fasta file (P_regina_genome.fa) for submission and archival.

File: Phormia_regina_transcriptome_files.zip

Description: P.regina mRNA transcripts files in .fa format (Phormia_regina.mrna-transcripts.fa) and blastn results against UniProt database results (uniprot_blast.csv, see “3. Blast uniprot results “ on**Lin_et_al_dryad_data.xlsx **for details.

File: Supplemental_Table1.xlsx

Description: Supplemental_table1.xlsx contains tabular metadata describing the biological and technical details of the RNA-seq and Iso-Seq libraries used in this study. Each row represents a biological replicate used for transcriptomic analysis.

Variables

sample_name: Unique identifier for each biological replicate.
bioproject_accession: NCBI BioProject accession number (PRJNA990781).
organism: Species name (Phormia regina).
isolate: Description of sample source (e.g., single maggot).
breed: Laboratory strain used in the study.
isolation_source: Origin of the biological sample (e.g., lab colony).
collection_date: Date of sample collection.
geo_loc_name: Geographic location of sample origin (if applicable).
tissue: Tissue type used for sequencing (e.g., whole tissue).
age: Age of the maggot in hours (e.g., "90 hrs").

· notes: Explanation of sample naming (e.g., F = feeding behavior).

· Weight(mg): Measured weight of each maggot in milligrams.

· File name 1–13 on SRA PRJNA990781: File names for sequencing reads deposited in NCBI SRA.

File: Supplemental_Table_2.csv

Description: Supplemental_table2.xlsx contains per-gene module membership statistics from the WGCNA analysis. Each row corresponds to one gene that passed the variance filter and was included in network construction.

Variables
• gene_id: Gene symbol/identifier for the feature (e.g., ECH1).
• module: WGCNA module assignment.
• kME: Module membership (Pearson correlation, r) between the gene’s normalized expression profile and the eigengene of its assigned module; ranges −1 to 1, larger |r| indicates stronger membership.
• p_value: Two-sided p-value. (reported in scientific notation).

The long read libraries and isoseq libraries are connected to Figure 3A-B

The RNA seq libraries are connect to Figure 4-9.

Code/software

Description of the Software and File Structure

The dataset includes zipped folders containing scripts and command files used to run all bioinformatic and statistical analyses in the study. Each .zip file corresponds to a figure or table in the manuscript and contains scripts organized by analysis step:

File: Software_Figure3A.zip

Description: Contains shell scripts for genome assembly and quality assessment.

o hiasm.sh: Assembles the genome using HiFi reads.

o purge_dups_step1.sh and purgedups.sh: Remove haplotypic duplications.

o blobtool_create.sh: Generates initial BlobToolKit datasets.

o makedb_blastn.sh and blastn.sh: Run BLASTN against NCBI nt for contaminant detection.

o blast_add_blobtool.sh: Integrates BLAST results into BlobToolKit.

o busco_genome.sh: Runs BUSCO to assess genome completeness.

o buscoAdd.sh: Adds BUSCO results to BlobToolKit for visualization.

File: Software_Figure3B.zip

Description: Contains scripts for Hi-C–based genome scaffolding and contact map generation.

o faidx.sh: Indexes the genome using Samtools.

o 01_mapping_arima.sh: Maps Hi-C reads to the genome.

o yahs.sh: Performs Hi-C scaffolding using YAHS.

o step_1_hic.sh and step_2_hic.sh: Generate Hi-C contact maps and finalize scaffolds.

File: SofSoftware_Figure4A-4F_Figure5A_Figure6A-6F.zip

Description: Contains scripts for RNA-seq data processing and differential expression analysis.

o Preprocessing: Adapter trimming with Trimmomatic.

o Quantification: Transcript mapping with Salmon (salmon.sh).

o tximport.R: Imports quantification results and creates gene-level TPM and TMM matrices.

o debrowswer2.R: Performs differential expression analysis using DEBrowser.

File: Software_Figure7A-B.zip

Description: Contains WGCNA.R script for Weighted Gene Co-expression Network Analysis (WGCNA), used to identify modules of co-expressed genes related to maggot age and behavior.

File: Software_Figure8A-B_Software_Figure9B-C.zip

Description: Contains linearregression2.R for general linear regression modeling, used toanalyze gene expression trends (TPM values) across developmental age treatments.

File: Software_Table1.zip

Description: Contains scripts for genome annotation and supporting evidence generation.

o repeatmodelerrun.sh: Runs RepeatModeler for repeat identification.

o repeatmasker.sh: Masks repeats using RepeatMasker.

o funannotate_train.sh, funannotate_predict.sh: Perform training and prediction via Funannotate.

o Isoseq3.sh: Processes Iso-Seq data to provide transcript evidence for genome annotation refinement.

o funannotate_update.sh: Update the annotation using the isoseq evidence via Funannotate.

o interproscan.sh: Runs InterProScan for protein domain and function annotation.

o Phobius.sh: Predicts signal peptides and transmembrane domains using Phobius.

o funannotate_annotate.sh: Perform final annotation step using results from Phobius and interproscan via Funannotate.

The slurm scripts for the genome project were executed on the high-performance computing (HPC) slurm systems at Florida International University. R scripts can be run locally on a personal computing device.

Access information

Other publicly accessible locations of the data:

The RNA and DNA sequencing data files are available for download on NCBI Sequence Read Archive with Bioproject ID PRJNA990781.

Change Log

13-Sep-2025

Figure and file updates

Updated figure numbering throughout to match the revised manuscript.
- Example: data previously under Figure 2B–2C are now referenced as Figure 3A–3B, with all downstream analyses renumbered accordingly.
Added or updated figure input/output folders:
- Datainput_Figure3A.zip – added updated genome file (P_HiCGenome.yml) under blobtool_create/, and updated BUSCO files (full_table.tsv, missing_busco_list.tsv) under buscoAdd/.
- Dataoutput_Figure3A.zip – includes updated assembly quality metric .json files and new coverage file (HHY1.sorted_cov.json) for the Hi-C genome assembly.

Excel data file (Lin_et_al_dryad_data.xlsx)

Added new tabs:
- Figure 5A – PCA output files.
- Figure 5B – input/output files comparing candidate genes associated with feeding and wandering behavior across studies.
Updated new tabs:
- Figure 7A – (previously Figure 5A) updated output of expression values grouped by WGCNA modules to improve figure resolution.
- Figure 7B – (previously Figure 5B) updated output of WGCNA module assignments with corresponding expression values to improve figure resolution.
Expanded descriptions for supplemental figure tabs:
- Figure S2 – statistical results comparing medians between behavioral groups.
- Figure S3A–S3B – differential expression results of feeding vs. wandering maggots on heatmaps.
- Figure S4A–S4C – differential expression results across aging maggots on heatmaps.
- Figures S5–S8 – (previously listed as Figures 6A–6D in the old version) now shifted to supplemental figures.
- Figure S9 – TPM profiles of housekeeping genes.

Expression analysis clarifications

Expanded descriptions of replicate design (Feeding vs. Wandering, 80–130 h cohorts).

Supplemental materials

Added Supplemental Table 2 – genes associated with WGCNA color modules and p-values.
Added Phormia_regina_genome_annotation_file.zip and Phormia_regina_transcriptome_files.zip – explicitly listing updated genome assembly FASTAs, genome annotation files, transcriptome FASTAs, and transcript BLAST results

Identifying the transition between feeding and wandering behavior

We maintained a laboratory colony of a highly inbred line of Phormia regina. The source was from Dr. Amanda Roe at University of Nebraska at Lincoln [1]. Developmental rate data were available [1] . All rearing containers were maintained at 27.5+ 0.5 °C. under 16:8 light:dark cycle in one SMY04-1 DigiTherm CirKinetics Incubator (TriTech Research, Inc., Los Angeles, California, USA). The incubator was equipped with uniform lighting, additional fans, and a port for thermometer access.

Each equal-age cohort was reared within a plastic 72 x72x100 mm insect breeding box (Gyeonggi-do, Korea). Each box contained 0.5cm sawdust and a 30g of fresh chicken liver in a suspended 4 oz paper cup. Eggs were obtained on a paper towel soaked with chicken liver blood placed in a cage of P. regina adults for 30 minutes before removal. The removal time was considered age zero for those insects. About 500-1000 newly deposited eggs were placed on wet paper in a covered petri dish. The eggs were kept at high humidity over an open water container at 27.5+ 0.5 °C. After 24 hours, fifteen newly hatched maggots were transferred into each aliquot of fresh liver inside the rearing box (See Fig 1A). All the boxes were reared at the same 27.5+ 0.5 °C incubator. For each of four trials, 28 boxes with maggots were set up. During each sampling time, four rearing boxes were randomly removed from the incubators. We distinguished a maggot as feeding if it was still on the liver and wandering if it was in the sawdust. Sampling involved removal of an entire cohort at a preselected age. Sampled maggot was individually placed in a 1.5mL microcentrifuge tube with 1mL of Thermofisher RNAlater (Invitrogen, MA, USA) at 4 °C. Each tube was labeled with numbers for RNA analysis. After 24 hours, the RNAlater was removed from the tube and the insect was stored at -80 °C.

Build a new P. regina genome assembly

To create a high-quality reference assembly for P. regina, a single male adult was sequenced by PacBio Sequel II and we performed de novo assembly using HiFiasm v0.16.1 with purge mode to generate a male P. regina haplotype and diplotype assembly [2] . Duplicate contigs were further eliminated by purge_dups v1.2.6 [3]. The contaminated contigs were identified by NCBI BLAST against a publicly available nucleotide database and filtered by Blobtool v2.6.4[4] . The repeat element boundaries and repeat database was de novo assembly by the clean contigs using RepeatModeler v2.0.2 [5–13]. A single male adult fly was sequenced by Illumina NovaSeq 6000 to create Hi-C sequencing libraries. Hi-C sequencing libraries were first mapped to the clean contigs by Arima mapping and the .bam file was used to perform chromosomal integrated assembly by yahs v1.1 [14] . A Hi-C heat map was generated by juicebox v2.20.00 [15]. The curation and chromosomal boundaries were manually edited following the Genome Assembly Cookbook [16]. The chromosomal genome was annotated using the Funannotate v1.8.13 annotation pipeline, with Iso-Seq libraries providing support for evidence-based gene prediction during the annotation process. (See Isoseq mRNA analysis of P. regina ) [17–22].

RNA extraction

For each age cohort, one labeled maggot was selected based on the average weight of cohorts. The specimen was homogenized using pestle with 200 ul of ambion TRIzol reagent (MA, USA). RNA was extracted from using a QIAGEN RNeasy Plus extraction kit (MA, USA). The concentration of the RNA samples was assessed with an Invitrogen Qubit Fluorometer using a Qubit RNA HS Assay kit (MA, USA). The integrity of the extracted RNA was evaluated with a Bioanalyzer using an Agilent RNA 6000 Pico chip. The process was repeated three times for 3 replicates for each age cohorts (Hong Kong, China). A total of 33 RNA samples from maggots of known ages were sent to BGI (CA, USA) for Illumina RNA sequencing, and 2 RNA samples from adult flies were sent to Genewiz (NJ, USA).

Isoseq mRNA analysis of P. regina

A single virgin adult male, a single juvenile virgin adult female, a 110 hour feeding third instar maggot, and a 110 hour wandering maggot were submitted to PacBio (CA, USA) for Iso-Seq on the PacBio Sequel II platform. Consensus sequences were generated from the SMRTBell libraries and collapsed into isoforms by Isoseq v3.0 [23]. The long reads libraries were used as evidence for genome annotation.

Gene expression analysis

The quality of RNA seq libraries were initially assessed by FastQC v0.11.9 [24]. The adapter sequences from the short reads were trimmed by Trimmomatic v0.39 [25]. The clean short read libraries were mapped to the newly annotated genome using Salmon v1.8.0 for quantification [26]. Count matrices were obtained by tximport [27]. Differential expression analysis was performed by DeSeq2 1.34.0 Bioconductor [28]. The gene names were manually curated based on the results from the BLASTn against orthologous UniProt v2023_02. Pairwise comparison was performed among all treatments to identify differentially expressed transcripts (> 2-fold change, a < 0.01 FDR) using DEBrowser v3.18[29]. Gene co-expression analysis was produced by DESeq2 v1.34.0 followed by tidyverse v2.0.0, magrittr v2.0.3, and WGCNA v1.72.1 analysis on R v2022.12.0+353 [30–31]. Gene ontology enrichment analysis was performed by using ShinyGOV0.76 under FDR-corrected cut off a < 0.05 [32].

Finding candidate genes for predicting larval age.

Raw reads from the development of aging maggot were normalized to TPM values by DeSeq2 1.34.0 [28]. Linear regression model was applied to define the housekeeping genes (s <-2, all values greater than 0), upregulated genes ( b₁ > 1 and R²:0.9~1) and downregulated genes ( b₁ < -1 and R²:0.9~1) from the mean TPM values across each age cohort. The correlation plots were created using ggplot2 [33] The Venn diagram list was created by the VennDiagram v1.7.3 package from RStudio 0.15.0 and drawn by Procreate v5.3.7 from Savage Interactive Pty Ltd [34-35] . Linear regression figures were created by Prism10 v10.2.2 from GraphPad software.

1. Roe AL. Development Modeling of Lucilia Sericata and Phormia Regina (Diptera : Calliphoridae). University of Nebraska-Lincoln; 2014.

2. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18: 170–175.

3. Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 2020;36: 2896–2898.

4. Challis R, Richards E, Rajan J, Cochrane G, Blaxter M. BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes|Genomes|Genetics. 2020;10: 1361–1374.

5. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117: 9451–9457.

6. Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12: 1269–1276.

7. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21 Suppl 1: i351–8.

8. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27: 573–580.

9. Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9: 18.

10. Ou S, Jiang N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 2018;176: 1410–1422.

11. Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30: 3059–3066.

12. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22: 1658–1659.

13. Wheeler TJ. Large-Scale Neighbor-Joining with NINJA. Algorithms in Bioinformatics. Algorithms in Bioinformatics. 2009; 375–389.

14. Zhou C, McCarthy SA, Durbin R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. Edited by C. Alkan; 2023.

15. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016;3: 99–101.

16. Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, Olga Dudchenko, Sanjit S. Batra, Arina D. Omer, Sarah K. Nyquist, Marie Hoeger, Neva C. Durand, Muhammad S. Shamim, and Erez Lieberman Aiden. Genome Assembly Cookbook. THE CENTER FOR GENOME ARCHITECTURE. Baylor College of Medicine & Rice University.

17. Palmer JM. Funannotate: pipeline for genome annotation. 2016.

18. Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005;33: W465–7.

19. Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35: 3823–3835.

20. Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9: R7.

21. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29: 644–652.

22. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31: 5654–5666.

23. Zhou G, Chen X, Pang J, Srinives P. Domestication of Agronomic Traits in Legume Crops. Frontiers Media SA; 2021.

24. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010.

25. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120.

26. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14: 417–419.

27. Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015;4: 1521.

28. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.

29. Kucukural A, Yukselen O, Ozata DM, Moore MJ, Garber M. DEBrowser: interactive differential expression analysis and visualization tool for count data. BMC Genomics. 2019;20: 6.

30. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9: 559.

31. Bache SM, Wickham H. magrittr: a forward-pipe operator for R. R package version.

32. Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36: 2628–2629.

33. Wickham H. Getting Started with ggplot2. In: Wickham H, editor. ggplot2: Elegant Graphics for Data Analysis. Cham: Springer International Publishing; 2016. pp. 11–31.

34. Wickham H. Getting started with ggplot2. Use R! Cham: Springer International Publishing; 2016. pp. 11–31.

35. R Core Team, R. R: A language and environment for statistical computing. 2013.

Genome and developmental transcriptome of the forensically important blowfly, Phormia Regina

Data files

Abstract

File Structure

Input Files

File: Datainput_Figure3A.zip

File: DataInput_Figure3B.zip

File: Datainput_Figure4A-4F_Figure5A_Figure6A-6F

File: Datainput_Figure7A-B.zip

File: Datainput_Figure8A-B.zip

File: DataInput_Table1.zip

Output Files

File: Dataoutput_Figure3A.zip

File: Dataoutput_Figure3B.zip

File: DataOutput_Table1.zip

File: Supplemental_Table1.xlsx

File: Supplemental_Table_2.csv

Change Log

Changes after Sep 19, 2025:

Genome and developmental transcriptome of the forensically important blowfly, Phormia Regina

Data files

Abstract

README: Genome and developmental transcriptome of the forensically important blowfly, Phormia regina

File Structure

Input Files

File: Datainput_Figure3A.zip

File: DataInput_Figure3B.zip

File: Datainput_Figure4A-4F_Figure5A_Figure6A-6F

File: Datainput_Figure7A-B.zip

File: Datainput_Figure8A-B.zip

File: DataInput_Table1.zip

Output Files

File: Dataoutput_Figure3A.zip

File: Dataoutput_Figure3B.zip

File: DataOutput_Table1.zip

File: Supplemental_Table1.xlsx

File: Supplemental_Table_2.csv

Change Log

Methods

Change log

Changes after Sep 19, 2025: