A workflow to unify the Known and Unknown

We implemented a computational workflow (Agnostos) to structure and explore the large pool of genes with unknown functions found in microbial genomes and metagenomes. We used a protein domain-based approach to partition more than 400 million predicted genes from 1,628 metagenomes and 28,941 genomes into the different categories of known and unknown.

workflow.jpg Brief schematic of the workflow

The workflow is based on Snakemake for the easy processing of large datasets in a reproducible manner. It provides three different strategies to analyze the data. The module DB-creation creates the gene cluster database, validates and partitions the gene clusters (GCs) in the main functional categories. The module DB-update allows the integration of new sequences (either at the contig or predicted gene level) in the existing gene cluster database. In addition, the workflow has a profile-search function to quickly screen the gene cluster PSSM profiles in the database

Follow the links for a detailed description of the methods and results for each of the steps in the workflow:

  1. Gene prediction
  2. Pfam annotations
  3. Deep clustering
  4. Gene cluster validation
  5. Gene cluster refinement
  6. Gene cluster classification
  7. Gene cluster category refinement
  8. Gene cluster communities inference
You can try the workflow here.
A description of the data used for the manuscript can be found here.