Synthetic eco-evolutionary dynamics in simple molecular environment
Data files
Feb 27, 2024 version files 13.74 GB
-
oligo1_R1.fastq
-
oligo1_R2.fastq
-
R1_06_R1.fastq
-
R1_06_R2.fastq
-
R12_06_R1.fastq
-
R12_06_R2.fastq
-
R15_06_R1.fastq
-
R15_06_R2.fastq
-
R18_06_R1.fastq
-
R18_06_R2.fastq
-
R2_06_R1.fastq
-
R2_06_R2.fastq
-
R21_06_R1.fastq
-
R21_06_R2.fastq
-
R24_06_R1.fastq
-
R24_06_R2.fastq
-
R3-06_R1.fastq
-
R3-06_R2.fastq
-
R6_06_R1.fastq
-
R9_06_R1.fastq
-
R9_06_R2.fastq
-
README.md
Abstract
The understanding of eco-evolutionary dynamics, and in particular the mechanism of emergence of species, is still fragmentary and in need of test bench model systems. To this aim, we developed a variant of SELEX in-vitro selection to study the evolution of a population of ∼ 10^15 single-strand DNA oligonucleotide ‘individuals’. We begin with a seed of random sequences which we select via affinity capture from ∼ 10^12 DNA oligomers of fixed sequence (‘resources’) over which they compete. At each cycle (‘generation’), the ecosystem is replenished via PCR amplification of survivors. Massive parallel sequencing indicates that across generations the variety of sequences (‘species’) drastically decreases, while some of them become populous and dominate the ecosystem. The simplicity of our approach, in which survival is granted by hybridization, enables a quantitative investigation of fitness through a statistical analysis of binding energies. We find that the strength of individual-resource binding dominates the selection in the first generations, while inter and intra-individual interactions becomes important in later stages, in parallel with the emergence of prototypical forms of mutualism and parasitism.
README: Synthetic Eco-Evolutionary Dynamics in Simple Molecular Environment
https://doi.org/10.5061/dryad.5tb2rbpbs
Dataset: experimental files for the "oligo1" dataset, in fastq format.
Description of the data and file structure
The name of each file is as follows:
RX_06_RY.fastq
X: is the round (from 1 to 24)
Y: either 1 or 2, for reverse/forward sequences
oligo1_R1/R2 files are data relative to cycle 0.
Sharing/Access information
Analysis code and other materials can be also found: here https://github.com/francescomambretti/stat_phys_synthetic_biodiversity/
Code/Software
These codes have been written by Francesco Mambretti (2021-2023). They are meant to analyze experimental FASTQ files from the SEDES experiment.
------------------------------------- REQUIREMENTS -------------------------------------
-
python3
withnumpy, matplotlib, itertools, more_itertools, biopython, pandas, difflib
-
C++
installed and aC++
compiler supporting (at least)C++ - 2011
------------------------------------- input_params.py -------------------------------------
- first, modify
input_params.py
setting: first, modifyinput_params.py
setting:
-
key1
: "oligo1", "oligo2", "negative" or "seriesN" -> decide which dataset (others can be added) -
key2
: "R1", "R2", "R1R2" -> different reading directions (should give similar but not identical results, due to experimental imperfections) -
key3
: "fw", "rev" or "all" -> select only forward/reverse/all sequences -
key\_filter
: True/False -> whether to apply (True
) or not (False
) a special criterion to filter data. Default criterion here is to exclude PCR by-products. -
key\_no\_cut
: True/False -> whether to print either full sequences (True
) or cut sequences, deleting the primer bases (False
)
These options can be either set manually or via an external script such as the loop.py
included here. To use loop.py
, edit input_params.py
and set: key1="$KEY1$" key2="$KEY2$" key3="$KEY3$" key_filter=$KEY_FILTER$ key_no_cut=$KEY_NO_CUT$
Optionally, other parameters can be modified:
- colors of RSA histograms
- True/False for creating (or not) abundance histograms for unique strands
- min quality of the reads
-
l
-> resource length (defaults to 20 bases) -
subset_steps
-> to analyze only the first subset_steps for faster analysis on incomplete datasets -
use_stop
-> decide whether to really do it (True/False
) -
n
-> number of top-n strands for the related analysis of dominant individuals -
random_seq=50
-> number of random nucleotides, by default; not used, currently, apart fromL
ordinary definition -
cap_size=25
-> size of fixed sequences at the two ends; not really used -
extra_end=1
-> sometimes there is an extra base, old code versions needed it, currently it is ignored -
L=random_seq+cap_size+extra_end
-> max length, with cap and last one - length of predators -> can be edited (e.g. for N series) -
lower_bound=L-6
-> discard strand with less thanlower_bound
bases -> can be edited
Another editable parameter is results_folder
, which can be changed in case one needs to save some results separately.
------------------------------------- compilation -------------------------------------
-
make all
to generate C++ executables (C++-17 is used, but C++-11 compatibility should be enough)
------------------------------------- read_fastq.py -------------------------------------
- execute
python3 read_fastq.py
which processes the FASTQ files and generates text files and plots with the outcomes of the performed analyses.read_fastq.py
calls itself:
-
find_MCO_serial.x
(executable of the corresponingC++
code for Maximum Consecutive Overlap calculation between strands - see https://www.mdpi.com/1099-4300/24/4/458 for its definition and related discussions). -
find_equal_pair.x
detects the number of consecutive identical bases between two strands passed by command line. Based on the same routines offind_MCO_serial
, simplified version, used to detect aliens -
module_functions.py
: process FASTQ files, filter sequences, sort them by abundance, reverse and complement strands, track the abundance of the top-n
most abundant ones across cycles and compute their cross-MCO matrix -
main_plot.py
: generate text files and plots for RSA histograms, Shannon entropy associated to them, evolution of top-n
strands, the fraction of total population covered by top-n
individuals and the 2D histogram of (MCO,MCO_2nd). It callsmodule_plots.py
.
Methods
Please see the paper. All the methods are clearly explained there. Our experimental design takes advantage of a selective capture mechanism where magnetic beads carrying single-stranded DNA filaments of fixed length and sequence target DNA individuals present in a DNA library based on their level of complementarity. Sequences are selected, amplified via PCR, sequenced, and analysed with the home-made codes present also in this repository.