We implemented a computational workflow (Agnostos) to structure and explore the large pool of genes with unknown functions found in microbial genomes and metagenomes. We used a protein domain-based approach to partition more than 400 million predicted genes from 1,628 metagenomes and 28,941 genomes into the different categories of known and unknown.
Brief schematic of the workflow
The workflow is based on Snakemake for the easy processing of large datasets in a reproducible manner. It provides three different strategies to analyze the data. The module DB-creation creates the gene cluster database, validates and partitions the gene clusters (GCs) in the main functional categories. The module DB-update allows the integration of new sequences (either at the contig or predicted gene level) in the existing gene cluster database. In addition, the workflow has a profile-search function to quickly screen the gene cluster PSSM profiles in the database
Follow the links for a detailed description of the methods and results for each of the steps in the workflow: