Functional validation scripts usage

Functional validation scripts usage

Validation of clusters annotated to Pfam domains, in terms of intra-cluster functional homogeneity.

R required packages:

tidyverse
data.table
proxy
stringr
textreuse
parallel

Addintional data required (found in this folder)

“files/pfam_shared_all” : a list of pfam terminal or middle domains of the same proteins

Usage

Rscript eval_shingl_jacc.r "data/annot_and_clust/marine_hmp_db_03112017_clu_ge10_annot.tsv" "data/cluster_validation/functional/shingl_jacc_val_annot.tsv"
  • output: tab-formatted table with 7 fields:
    • clusters old (MMseqs2) representative
    • jaccard average similarity value not scaled by the number of annotated members/ORFs in the cluster
    • jaccard average similarity value scaled by the number of annotated members/ORFs in the cluster
    • Type of annotation (completely homogeneous, Not homogeneous only mono-domain, not homogeneous multi-domain and singl-domain in the same cluster)
    • Proportion of that type of annotation in the cluster
    • Proportion of partial/complete ORFs in the cluster
    • Based on the annotation type, 3 different categories HA, MoDA or MuDA