Gene prediction algorithms have limitations and can yield inaccurate ORFs predictions leading to spurious proteins, which can lead to spurious protein families. We decided to track the presence and the distribution of both “spurious” and “shadow” predicted ORFs in our clusters.
Spurious ORFs
TOTAL: 53,324 (0.02%)
Distribution of spurious ORFs in the different data sets.
TARA | Malaspina | GOS | OSD | HMP |
---|---|---|---|---|
4,203 | 2,298 | 4,939 | 1,620 | 40,264 |
Shadows ORFs
TOTAL: 611,774 (0.2%)
Distribution of shadows ORFs in the different data sets.
TARA | Malaspina | GOS | OSD | HMP |
---|---|---|---|---|
157,688 | 40,762 | 66,245 | 70,632 | 276,447 |
We detected a total of 53,324 (0.02%) spurious ORFs distributed in 6,228 (0.02%) clusters.
Number of spurious ORFs in the clusters and in each project.
Spurious in clusters ≥ 10 members | Spurious in clusters < 10 members > 1 | Spurious in singletons |
---|---|---|
44,205 | 6,784 | 2,335 |
We identified 611,774 (0.2%) shadow ORFs distributed in 357,329 (1%) clusters.
Number of shadow ORFs in the clusters and in each project.
Shadows in clusters ≥ 10 members | Shadows in clusters < 10 members > 1 | Shadows in singletons |
---|---|---|
290,077 | 144,571 | 177,126 |
The scripts spur_shadow_orfs.sh and shadow_orfs.r identified the spurious and shadow ORFs in our dataset applying the criteria described above. The output is a tab-separated file containing the following fields:
More info in the README.
[1] R. Y. Eberhardt, D. H. Haft, M. Punta, M. Martin, C. O’Donovan, and A. Bateman, “AntiFam: a tool to help identify spurious ORFs in protein annotation.,” Database: the journal of biological databases and curation, vol. 2012, p. bas003, Mar. 2012.
[2] S. Yooseph et al., “The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families,” PLoS biology, vol. 5, no. 3, p. 16, 2007.
[3] S. Yooseph, W. Li, and G. Sutton, “Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering.,” BMC bioinformatics, vol. 9, p. 182, Apr. 2008.