Data from: Size distribution of function-based human gene sets and the split-merge model

Li, Wentian1; Fontanelli, Oscar2; Miramontes, Pedro2

Published Jul 05, 2016 on Dryad. https://doi.org/10.5061/dryad.rc1kv

Cite this dataset

Li, Wentian; Fontanelli, Oscar; Miramontes, Pedro (2016). Data from: Size distribution of function-based human gene sets and the split-merge model [Dataset]. Dryad. https://doi.org/10.5061/dryad.rc1kv

Abstract

The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.

Usage notes

ensembl-nov2015

data for Fig.1: ensembl-nov2015.txt column 1: Ensembl gene family ID column 2: number of genes in the (Ensembl) gene family column 3: description of the (Ensembl) gene family

gf-size-feb2016

data for Fig.2: gf-size-feb2016.txt column 1: HGNC gene family ID or name column 2: number of genes in the (HGNC) gene family (possibly including pseudogenes) column 3: number of genes in the (HGNC) gene family (excluding pseudogenes) column 4: description of the (HGNC) gene family

gene-fam-with-TF

data for Fig.4: gene-fam-with-TF.txt column 1: HGNC gene family ID or name column 2: number of transcription factors in the (HGNC) gene family column 3: number of genes in the (HGNC) gene family (possibly including pseudogenes) column 4: description of the (HGNC) gene family

gene-fam-with-drug-target

data for Fig.5: gene-fam-with-drug-target.txt column 1: HGNC gene family ID or name column 2: number of drug target genes (gene products) in the (HGNC) gene family column 3: number of genes in the (HGNC) gene family (possibly including pseudogenes) column 4: description of the (HGNC) gene family