Output dataset¶

Output format overview¶

ProkFunFind will output files with the same prefix. annot.gff, json, tsv, pkl formatted files will be outputed every time. Other files will be reported based on search methods used.

blast.m6¶

The blast output file will only be generated if BLAST was used as part of the search approach. The BLAST output consists of a tablular blast output table. Column 1 is the protein sequences of genomes; column 2 is the query sequences; the table format is the BLAST -m 6 format.

GUT_GENOME143137_00182      ecoli_garR      52.203  295     134     3       2       292     3       294     3.80e-104       297
GUT_GENOME143137_00182      cclostridioforme_GarR   51.890  291     139     1       2       291     3       293     6.49e-104       296
GUT_GENOME143137_00419      ecoli_garD      30.208  384     244     7       5       368     122     501     2.39e-52        173
GUT_GENOME143137_00419      cclostridioforme_GarD   33.113  302     195     4       68      368     188     483     5.51e-51        169
GUT_GENOME143137_00648      cclostridioforme_gudA   28.041  296     205     3       82      372     46      338     1.34e-36        126
GUT_GENOME143137_00649      cclostridioforme_gudC   27.273  154     98      4       4       145     8       159     3.05e-06        35.0
GUT_GENOME143137_00650      cclostridioforme_gudB   41.837  490     277     4       13      497     17      503     3.93e-117       345
GUT_GENOME143137_00901      cclostridioforme_gudA   25.200  250     143     6       72      301     46      271     1.78e-12        57.4
GUT_GENOME143137_00903      cclostridioforme_gudB   34.647  482     311     3       17      496     19      498     2.31e-92        281
GUT_GENOME143137_00918      cclostridioforme_GarL   22.378  286     216     4       3       284     6       289     4.03e-21        81.3

hmmtblout¶

The hmmscan output will be generated if hmmscan is used as part of the search approach. This table consists of the targets (profile HMM IDs) in column 1, the gene IDs in column 3, and evalues in column 5.

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name            accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------  -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HNDD                 -          GCF_000478885.1_00103 -            0.00023    9.8   9.6    0.0039    5.7   9.6   1.9   1   1   0   1   1   1   1 -
HNDD                 -          GCF_000478885.1_00145 -            2.4e-05   13.1  22.2   0.00015   10.4   4.5   2.8   2   2   0   2   2   2   2 -
DEVR                 -          GCF_000478885.1_00174 -            3.6e-07   19.5   0.0   5.8e-07   18.8   0.0   1.2   1   0   0   1   1   1   1 -

annot.gff¶

The annotated gff file provides a summary of the hits to the different ProkFunFind queries along with their genomic locations. The genes that pass the filtering criteria for each search approach are listed in this file. This file can be imported into programs like Geneious for subsequent visualiation and curation.

GUT_GENOME143137_1  ProkFunFind     CDS     187675  188575  .       -       .       ID=GUT_GENOME143137_00182;Name=garR;Parent=Cl_0;Target=ecoli_garR 2 294;pct_identity=52.203;evalue=3.8e-104
GUT_GENOME143137_2  ProkFunFind     CDS     38455   39622   .       +       .       ID=GUT_GENOME143137_00419;Name=garD;Parent=Cl_0;Target=ecoli_garD 121 501;pct_identity=30.208;evalue=2.39e-52
GUT_GENOME143137_2  ProkFunFind     CDS     296187  297321  .       +       .       ID=GUT_GENOME143137_00648;Name=gudA;Parent=Cl_1;Target=cclostridioforme_gudA 45 338;pct_identity=28.041;evalue=1.34e-36
GUT_GENOME143137_2  ProkFunFind     CDS     297794  299288  .       +       .       ID=GUT_GENOME143137_00650;Name=gudB;Parent=Cl_1;Target=cclostridioforme_gudB 16 503;pct_identity=41.837;evalue=3.93e-117
GUT_GENOME143137_3  ProkFunFind     CDS     239793  240903  .       +       .       ID=GUT_GENOME143137_00901;Name=gudA;Parent=Cl_0;Target=cclostridioforme_gudA 45 271;pct_identity=25.2;evalue=1.78e-12
GUT_GENOME143137_3  ProkFunFind     CDS     241398  242916  .       +       .       ID=GUT_GENOME143137_00903;Name=gudB;Parent=Cl_0;Target=cclostridioforme_gudB 18 498;pct_identity=34.647;evalue=2.31e-92
GUT_GENOME143137_3  ProkFunFind     CDS     262245  263112  .       +       .       ID=GUT_GENOME143137_00918;Name=garL;Parent=Cl_1;Target=cclostridioforme_GarL 5 289;pct_identity=22.378;evalue=4.03e-21
GUT_GENOME143137_4  ProkFunFind     CDS     87018   88551   .       -       .       ID=GUT_GENOME143137_01073;Name=gudB;Parent=Cl_0;Target=cclostridioforme_gudB 18 507;pct_identity=36.531;evalue=6.07e-95
GUT_GENOME143137_4  ProkFunFind     CDS     89063   90152   .       -       .       ID=GUT_GENOME143137_01075;Name=gudA;Parent=Cl_0;Target=cclostridioforme_gudA 47 301;pct_identity=26.515;evalue=1.06e-15
GUT_GENOME143137_5  ProkFunFind     CDS     36139   37639   .       -       .       ID=GUT_GENOME143137_01304;Name=gudB;Parent=Cl_0;Target=cclostridioforme_gudB 3 480;pct_identity=40.167;evalue=3.42e-120

yaml¶

The yaml file is similar to the input system definition that is provided in the config.yaml file. file with the “genes” added and the “completeness” of each subsystem is added to each component. This file acts an overall summary of the search results.

name: Equol Gene Cluster
components:
- name: Equol Production Pathway
  presence: essential
  components:
  - geneID: DZNR
    description: Daidzein reductase
    presence: essential
    terms:
    - id: K00219
      method: kofamscan
    genes:
    - GCF_000478885.1_00950
    - GCF_000478885.1_01879
    - GCF_000478885.1_02274
  ...
    completeness:
      essential: 4
      nonessential: 0
      essential_presence: 4
      nonessential_presence: 0

tsv¶

A tab separated file with three columns, summarizing the hits and gene clusters identified.

Name	Description
Gene_Name	The name of gene
Cluster_ID	Genes within the same genomic neighborhood are assigned the same cluster ID. ‘NA’ means that the gene was not found to be in the same neigborhood as any other hits.
Functions	What search terms and components a given hit was associated with.

Gene_Name   Cluster_ID      Functions
GUT_GENOME143137_00182      GUT_GENOME143137_1:Cl_0 Mucic_and_Saccharic_Acid/garR
GUT_GENOME143137_00419      GUT_GENOME143137_2:Cl_0 Mucic_and_Saccharic_Acid/garD
GUT_GENOME143137_00648      GUT_GENOME143137_2:Cl_1 Mucic_and_Saccharic_Acid/gudA
GUT_GENOME143137_00650      GUT_GENOME143137_2:Cl_1 Mucic_and_Saccharic_Acid/gudB
GUT_GENOME143137_00901      GUT_GENOME143137_3:Cl_0 Mucic_and_Saccharic_Acid/gudA
GUT_GENOME143137_00903      GUT_GENOME143137_3:Cl_0 Mucic_and_Saccharic_Acid/gudB
GUT_GENOME143137_00918      GUT_GENOME143137_3:Cl_1 Mucic_and_Saccharic_Acid/garL
GUT_GENOME143137_01073      GUT_GENOME143137_4:Cl_0 Mucic_and_Saccharic_Acid/gudB
GUT_GENOME143137_01075      GUT_GENOME143137_4:Cl_0 Mucic_and_Saccharic_Acid/gudA

pkl¶

Pickle object of the output Genome object, which can be loaded to python for further analysis. This can be de-serialized using the python pickle module to access and interact with the genome object data.

What to do next?¶

Import the prefix.annot.gff to genome analysis and visualization software to curate and visualize the results.

Re-run prokfunfind.py to test other parameters and optimize your search.