The functional dark side

Gene prediction algorithms have limitations and can yield inaccurate ORFs predictions leading to spurious proteins, which can lead to spurious protein families. We decided to track the presence and the distribution of both “spurious” and “shadow” predicted ORFs in our clusters.

Methods

To detect eventual spurious ORFs, we screened our data set against the AntiFam database [1], which contains Pfam protein families “composed solely of spurious open reading frames (ORFs)”.
The shadow ORFs are artifacts produced during the identification of the coding region that defines an ORF. During this process, constraints are applied that can result in intervals and subintervals that overlap the coding region of the ORF, in one of the five possible reading frames (two on the same strand and three on the opposite strand) [2]. We identified the shadow ORFs in our dataset using the criteria applied in Yooseph et al., 2018 [3]. i) Two ORFs on the same strand are considered shadows if they overlap by at least 60 bps. ii) ORFs on opposite strands are identified as shadows if they overlap by at least 50 bps, and their ends of 3 ‘are within the intervals of the others, or if they overlap by at least 120 bps and the end of 5 ‘of one is in the interval of the other.

Results

Spurious ORFs

TOTAL: 53,324 (0.02%)

Distribution of spurious ORFs in the different data sets.

TARA	Malaspina	GOS	OSD	HMP
4,203	2,298	4,939	1,620	40,264

Shadows ORFs

TOTAL: 611,774 (0.2%)

Distribution of shadows ORFs in the different data sets.

TARA	Malaspina	GOS	OSD	HMP
157,688	40,762	66,245	70,632	276,447

Spurious and shadow ORFs in the clusters

We detected a total of 53,324 (0.02%) spurious ORFs distributed in 6,228 (0.02%) clusters.

Number of spurious ORFs in the clusters and in each project.

Spurious in clusters ≥ 10 members	Spurious in clusters < 10 members > 1	Spurious in singletons
44,205	6,784	2,335

We identified 611,774 (0.2%) shadow ORFs distributed in 357,329 (1%) clusters.

Number of shadow ORFs in the clusters and in each project.

Shadows in clusters ≥ 10 members	Shadows in clusters < 10 members > 1	Shadows in singletons
290,077	144,571	177,126

Code and description

The scripts spur_shadow_orfs.sh and shadow_orfs.r identified the spurious and shadow ORFs in our dataset applying the criteria described above. The output is a tab-separated file containing the following fields:

orf_name
orf_length
cl_name
cl_size
prop_shadow_in_the_cluster
is.shadow
is.spurious

Identification of spurious and shadow ORFs

Methods

Results

Spurious and shadow ORFs in the clusters

Let's Get In Touch!