Sorry for the dust! We’re working hard to make this website available.
Links might fail, content might be incomplete and layout might be very ugly.
Gene prediction algorithms have limitations and can yield inaccurate ORFs predictions leading to spurious proteins, which can lead to spurious protein families. We decided to track the presence and the distribution of both “spurious” and “shadow” predicted ORFs in our clusters.
Scripts and description: The scripts spur_shadow_orfs.sh and shadow_orfs.r identified the spurious and shadow ORFs in our dataset applying the criteria described above. The output is a tab-separated file containing the following fields:
TOTAL: 53,324 (0.02%)
Distribution of spurious ORFs in the different data sets.
TOTAL: 611,774 (0.2%)
Distribution of shadows ORFs in the different data sets.
We detected a total of 53,324 (0.02%) spurious ORFs distributed in 6,228 (0.02%) clusters.
Number of spurious ORFs in the clusters and in each project.
|Spurious in clusters ≥ 10 members||Spurious in clusters < 10 members > 1||Spurious in singletons|
We identified 611,774 (0.2%) shadow ORFs distributed in 357,329 (1%) clusters.
Number of shadow ORFs in the clusters and in each project.
|Shadows in clusters ≥ 10 members||Shadows in clusters < 10 members > 1||Shadows in singletons|
 R. Y. Eberhardt, D. H. Haft, M. Punta, M. Martin, C. O’Donovan, and A. Bateman, “AntiFam: a tool to help identify spurious ORFs in protein annotation.,” Database: the journal of biological databases and curation, vol. 2012, p. bas003, Mar. 2012.
 S. Yooseph, W. Li, and G. Sutton, “Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering.,” BMC bioinformatics, vol. 9, p. 182, Apr. 2008.