Functional validation scripts usage
Functional validation scripts usage
Validation of clusters annotated to Pfam domains, in terms of intra-cluster functional homogeneity.
R required packages:
tidyverse
data.table
proxy
stringr
textreuse
parallel
Addintional data required (found in this folder)
“files/pfam_shared_all” : a list of pfam terminal or middle domains of the same proteins
Usage
Rscript eval_shingl_jacc.r "data/annot_and_clust/marine_hmp_db_03112017_clu_ge10_annot.tsv" "data/cluster_validation/functional/shingl_jacc_val_annot.tsv"
- output: tab-formatted table with 7 fields:
-
clusters old (MMseqs2) representative
-
jaccard average similarity value not scaled by the number of annotated members/ORFs in the cluster
-
jaccard average similarity value scaled by the number of annotated members/ORFs in the cluster
-
Type of annotation (completely homogeneous, Not homogeneous only mono-domain, not homogeneous multi-domain and singl-domain in the same cluster)
-
Proportion of that type of annotation in the cluster
-
Proportion of partial/complete ORFs in the cluster
-
Based on the annotation type, 3 different categories HA, MoDA or MuDA