Gene Cluster refinement

Methods

After the validation, we proceeded with the retrieval of a subset of high-quality clusters. At first, we removed the clusters containing ≥10% bad-aligned ORFs and a Jaccard similarity index < 1 and the clusters with ≥ 30% shadow ORFs. From the remaining set of clusters, we removed the single shadow, spurious and/or bad-aligned ORFs. Steps:

I) Filter out “bad” clusters (≥ 10% bad-aligned ORFs & Jaccard similarity index <1) II) Filter out “shadow” clusters (≥ 30% shadow ORFs) III) Remove the single rejected/bad ORFs (shadow, spurious and bad-aligned)

Results

From the set of 3,003,897 clusters, we removed 57,052 clusters classified as “bad” after the validation. From the remaining 2,946,845 clusters we removed 6,252 clusters with more than 30% shadow ORFs. At the end from each of the left 2,940,593 clusters, we removed a total of 2.7 million single shadow, spurious and bad-aligned ORFs, and we obtained a set of 2,940,592 refined clusters with a total of 260,142,446 ORFs. In this last step we lost 336 clusters: 244 resulted composed of only spurious and bad aligned ORFs, one in the annotated set of clusters and 243 in the not annotated set, and 92 clusters were discarded/moved to the singletons set because left with only one sequence. Moreover, 1,190 annotated clusters became non annotated after the refinement/removal of the single ”unwanted”/”rejected” ORFs, which represented the only annotated ORFs in those clusters. Steps in numbers are shown in the tables below:

Steps of the cluster refinement both in terms of number of clusters and number of ORFs (kept and removed).

refinement_steps_clu.png

refinement_steps_ORFs.png


Summarising: we removed 63,640 clusters and a total of 8,325,409 ORFs; they constitute the 2% and the 3% of the initial cluster and ORF sets respectively. The majority of the clusters (98%) resulted, therefore, having good quality concerning homogeneity and real/actual ORFs content. The 2.9 million refined clusters are made of ~993K annotated clusters, containing ~174M ORFs, and 1.9M unannotated clusters with about 86M ORFs.




Cluster refinement:
refinement.sh. It takes the output from the cluster validation and the shadows and spurious ORFs, and returns a refined set of clusters (tables plus ffindex databases). More info in the README