Gene prediction

We used the official assemblies from the metagenomic projects TARA, OSD2014, Malaspina, HMP-I/II and GOS to test our approach. We used Prodigal (v2.6.3) [1] in metagenomic mode to predict the genes from the metagenomic dataset. We identified potential spurious genes using the AntiFam database. Furthermore, we screened for ‘shadow’ genes using the procedure described in Yooseph et al. [2]



- For more information regarding the identification of spurious and shadow genes, check here.
- A description of the data used for the manuscript can be found here.



We identified a total of 322,248,552 predicted ORFs in total for the metagenomic dataset (Table 1) and 93,723,190 genes for GTDB (Table 2).



Data set Number of contigs Number of genes
TARA 62,404,654 111,903,261
Malaspina 9,330,293 20,574,033
OSD 4,127,095 7,015,383
GOS 12,672,518 20,068,580
HMP 80,560,927 162,687,295

Table 1. Number of contigs and predicted genes Prodigal



We compiled the gene completion for the metagenomic dataset (Table 2). Where 00 is a complete gene with both start and stop codon identified; 01 has the right boundary incomplete; 10 has the left boundary incomplete; and 11 when both left and right edges are incomplete.



Dataset “00” “10” “01” “11” Total
Metagenomic 118,717,690 106,031,163 102,966,482 75,694,123 322,248,552

Table 2. Number of predicted genes per completeness category.



Prodigal only predicted 37% of complete genes (00) for the metagenomic dataset. After the gene prediction, the workflow proceeds with the Pfam annotation step.



The script gene_prediction.sh takes in input contigs from genomes or metagenomes, in fasta format, and returns the predicted ORFs amino acid sequences and a summary in .gff format. The ORFs headers are created using the script rename_orfs.awk.

[1] Hyatt, D., Chen, L. G.-L. L.-., LoCascio, F. P., Land, L. M., Larimer, W. F., & Hauser, J. L. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11(1), 119–119.
[2] Yooseph, S., Sutton, G., Rusch, B. D., Halpern, L. A., Williamson, J. S., Remington, K., Eisen, A. J., Heidelberg, B. K., Manning, G., Li, W., Jaroszewski, L., Cieplak, P., Miller, S. C., Li, H., Mashiyama, T. S., Joachimiak, P. M., Van Belle, C., Chandonia, M. J., Soergel, A. D., … Venter, C. J. (2007). The Sorcerer II global ocean sampling expedition: Expanding the universe of protein families. PLoS Biology, 5(3), 0432–0466.