Synthetic eco-evolutionary dynamics in simple molecular environment

Mambretti, Francesco 1 ; Casiraghi, Luca2 ; Tovo, Anna3 ; Paraboschi, Elvezia Maria4 ; Suweis, Samir 3 ; Bellini, Tommaso2

Published Feb 27, 2024 on Dryad. https://doi.org/10.5061/dryad.5tb2rbpbs

Abstract

The understanding of eco-evolutionary dynamics, and in particular the mechanism of emergence of species, is still fragmentary and in need of test bench model systems. To this aim, we developed a variant of SELEX in-vitro selection to study the evolution of a population of ∼ 10^15 single-strand DNA oligonucleotide ‘individuals’. We begin with a seed of random sequences which we select via affinity capture from ∼ 10^12 DNA oligomers of fixed sequence (‘resources’) over which they compete. At each cycle (‘generation’), the ecosystem is replenished via PCR amplification of survivors. Massive parallel sequencing indicates that across generations the variety of sequences (‘species’) drastically decreases, while some of them become populous and dominate the ecosystem. The simplicity of our approach, in which survival is granted by hybridization, enables a quantitative investigation of fitness through a statistical analysis of binding energies. We find that the strength of individual-resource binding dominates the selection in the first generations, while inter and intra-individual interactions becomes important in later stages, in parallel with the emergence of prototypical forms of mutualism and parasitism.

https://doi.org/10.5061/dryad.5tb2rbpbs

Dataset: experimental files for the "oligo1" dataset, in fastq format.

Description of the data and file structure

The name of each file is as follows:

RX_06_RY.fastq

X: is the round (from 1 to 24)

Y: either 1 or 2, for reverse/forward sequences

oligo1_R1/R2 files are data relative to cycle 0.

Sharing/Access information

Analysis code and other materials can be also found: here https://github.com/francescomambretti/stat_phys_synthetic_biodiversity/

Code/Software

These codes have been written by Francesco Mambretti (2021-2023). They are meant to analyze experimental FASTQ files from the SEDES experiment.

------------------------------------- REQUIREMENTS -------------------------------------

python3 with numpy, matplotlib, itertools, more_itertools, biopython, pandas, difflib
C++ installed and a C++ compiler supporting (at least) C++ - 2011

------------------------------------- input_params.py -------------------------------------

first, modify input_params.py setting: first, modify input_params.py setting:

key1: "oligo1", "oligo2", "negative" or "seriesN" -> decide which dataset (others can be added)
key2: "R1", "R2", "R1R2" -> different reading directions (should give similar but not identical results, due to experimental imperfections)
key3: "fw", "rev" or "all" -> select only forward/reverse/all sequences
key\_filter: True/False -> whether to apply (True) or not (False) a special criterion to filter data. Default criterion here is to exclude PCR by-products.
key\_no\_cut: True/False -> whether to print either full sequences (True) or cut sequences, deleting the primer bases (False)

These options can be either set manually or via an external script such as the loop.py included here. To use loop.py, edit input_params.py and set: key1="$KEY1$" key2="$KEY2$" key3="$KEY3$" key_filter=$KEY_FILTER$ key_no_cut=$KEY_NO_CUT$

Optionally, other parameters can be modified:

colors of RSA histograms
True/False for creating (or not) abundance histograms for unique strands
min quality of the reads
l -> resource length (defaults to 20 bases)
subset_steps-> to analyze only the first subset_steps for faster analysis on incomplete datasets
use_stop -> decide whether to really do it (True/False)
n -> number of top-n strands for the related analysis of dominant individuals
random_seq=50 -> number of random nucleotides, by default; not used, currently, apart from L ordinary definition
cap_size=25 -> size of fixed sequences at the two ends; not really used
extra_end=1 -> sometimes there is an extra base, old code versions needed it, currently it is ignored
L=random_seq+cap_size+extra_end -> max length, with cap and last one - length of predators -> can be edited (e.g. for N series)
lower_bound=L-6 -> discard strand with less than lower_bound bases -> can be edited

Another editable parameter is results_folder, which can be changed in case one needs to save some results separately.

------------------------------------- compilation -------------------------------------

make all to generate C++ executables (C++-17 is used, but C++-11 compatibility should be enough)

------------------------------------- read_fastq.py -------------------------------------

execute python3 read_fastq.py which processes the FASTQ files and generates text files and plots with the outcomes of the performed analyses. read_fastq.py calls itself:

find_MCO_serial.x (executable of the corresponing C++ code for Maximum Consecutive Overlap calculation between strands - see https://www.mdpi.com/1099-4300/24/4/458 for its definition and related discussions).
find_equal_pair.x detects the number of consecutive identical bases between two strands passed by command line. Based on the same routines of find_MCO_serial, simplified version, used to detect aliens
module_functions.py: process FASTQ files, filter sequences, sort them by abundance, reverse and complement strands, track the abundance of the top-n most abundant ones across cycles and compute their cross-MCO matrix
main_plot.py: generate text files and plots for RSA histograms, Shannon entropy associated to them, evolution of top-n strands, the fraction of total population covered by top-n individuals and the 2D histogram of (MCO,MCO_2nd). It calls module_plots.py.

Synthetic eco-evolutionary dynamics in simple molecular environment

Data files

Abstract

README: Synthetic Eco-Evolutionary Dynamics in Simple Molecular Environment

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Works referencing this dataset