Input dataset¶
Command-line options¶
usage: prokfunfind [-h] -f -o [-p] -g Identify genes related functions of interest optional arguments: -h, --help show this help message and exit -f , --function Path to configuration file -o , --outputprefix The output file name prefix -p , --processes Number of genomes to process concurrently (default=1) -g , --gtab Table of genomes to search
A typical ProkFunFind command looks like the following:
prokfunfind -f queries/fun/config.yaml -g ./genome-list.tsv -o ./out/search-out.
The options provide the following information:
Option |
Description |
---|---|
-f, –function |
This option is used to provide the path to the configuration yaml file. |
-o, –outputprefix |
This option is used to give the output file name prefix. |
-g, –gtab |
This option is used to provide the path to an input search genome table. |
Input genomic data sets¶
The genome input data should be organized so that all of the files associated with one genome are in the same directory. Multiple genomes can be stored in the same directory.
The directoy should include the genome sequences(${prefix}.fna
),
protein sequences(${prefix}.faa
), annotation(${prefix}.gff
) and
associated annotation files.
The annotation files can include InterProScan tsv annotations (${prefix}.tsv
),
KOfamScan tsv output (${prefix}.kofam.tsv
), or EGGNog-mapper tsv output
(${prefix}.emapper.annotations
). Note not all annotation files are required
to run ProkFunFind, for example if a search is being performed only using protein
sequences and profile HMMs, then no additional annotation files are required (see
below for more information).
The annotation files can be generated using the EGGNog-mapper, KOFamScan, and InterProScan programs. The output form these programs does not require any additional formatting to be used by ProkFunFind. Example commands to generate the anntoation files are:
interproscan -t p -iprlookup --goterms --pathways -f tsv {fasta}.faa IPR
exec_annotation --tmp-dir {path to tmp directory} -f detail-tsv \
-p {path to KofamScan profiles} -o {path to output file} {fasta}.faa
emapper.py -i $infile -o {prefix} --temp_dir {path to tmp directory} \
--data_dir {path to EGGNog-mapper data} --tax_scope {taxonomic scope} \
--override --cpu {CPUS} --output_dir {path to output}
See the example below for the general format of the genome input files:
$ ls input.folder/
MGYG-HGUT-03390.faa # protein sequences
MGYG-HGUT-03390.fna # genome sequences
MGYG-HGUT-03390.gff # annotations
MGYG-HGUT-03390_InterProScan.tsv # Interproscan tsv output
MGYG-HGUT-03390.kofam.tsv # KOfamScan annotations
MGYG-HGUT-03390.emapper.annotations # EGGNog-mapper annotations
The genomes input is provided to the ProkFunFind program through a tab separated two column table that includes the genome file prefixes and the paths to the genome directories. The path can be given as an absolute path, the full path to the genome file directory, or as a relative path from the gtab table file, specified as a relative path starting with either ‘./’ or ‘../’.
This table should be provided through the –gtab argument:
MGYG-HGUT-03390 ./input.directory/
MGYG-HGUT-03391 ./input.directory/
Any number of genomes inputs can be provided in this file and ProkFunFind will run all searches on the provided genomes sequentially.
These input data can be generated using the ProkFunAnnotate snakemake pipeline. This pipeline can be run to generate the EggNOG-mapper and KOFamScan data used by snakemake from a set of input genomes. The input file needed to run the snakemake pipeline is the same one used by the ProkFunFind –gtab argument. This annotation pipeline can be used to generate the annotation files for genomes starging from just a genome sequence fasta file, or a genome sequence fasta file with a GenBank file, like can be downloaded from NCBI’s Genbank database.
Input function configuration¶
-f
should be followed by the path to the configuration yaml file (${function}
).
$ ls data/Mucic_and_Saccharic_Acid/
config.yaml
query.fa
query.fa.phr
query.fa.pin
query.fa.psq
query.hmm
Note
Please remember to make fasta file blastable by running the command makeblastdb -in {query.fasta} -dbtype prot
Similarly if profile HMMs are being used in the search they need to prepared for the search using the command hmmpress {query.hmm}
Configuration File¶
The configuration file (config.yaml
) is the main input for each query. This file is used to
provide general settings like file extensions and filtering thresholds, and to
provide the definition of the function being searched for. This file is split
into two sections separated by ‘—’ with the first section containing
the
configuration section¶
The configuration section of the config.yaml
file is where the settings for the ProkFunFind
search are specified. This file is made up of a main section and multiple other
sections related to the specific search approaches and filtering.
---
main:
cluster_tool: DBSCAN
faa_suffix: .faa
gff_suffix: .gff
fna_suffix: .fna
DBSCAN:
cluster_eps: 4
cluster_min_samples: 2
hmmer:
hmmer_query: ./query.hmm
hmmer_exec: hmmscan
hmmer_threads: 1
evalue: 1e-3
blast:
blast_query: ./query.fa
blast_exec: blastp
blast_threads: 1
evalue: 1e-3
kofamscan:
annot_suffix: .kofam.tsv
threshold: 0.5
emapper:
annot_suffix: .emapper.annotations
interproscan:
annot_suffix: _InterProScan.tsv
main¶
The main section of the configuration file contains general information about the annotation file suffixes and points to the feature model file and search terms table.
main:
cluster_tool: DBSCAN
faa_suffix: .faa
gff_suffix: .gff
fna_suffix: .fna
Name |
Description |
---|---|
search_terms |
The name of the file that relates search term IDs and query IDs (see below) |
cluster.tool |
The method used to cluster the genes options:
|
system.file |
The name of the file that describe the structure of the function system |
faa_suffix |
The suffix of the fasta file that contains the predicted amino acid gene sequences |
fna_suffix |
The suffix of the fasta file that contains the genome sequence(s) |
gff_suffix |
The suffix of the file that contains the GFF gene annotations for the genome |
DBSCAN¶
If multiple hits are found in the genomes during the ProkFunFind searches, the hits will be checked to see if they are in the same genomic region. This is done using Density-Based Spatial Clustering of Applications with Noise (DBSCAN). For more information on the scikit-learn DBSCAN implementation see DBSCAN.
DBSCAN:
cluster_eps: 4
cluster_min_samples: 2
Name |
Description |
---|---|
cluster.eps |
How close two genes should be in order for them to be considered to be in the same cluster. Distance is in number of genes. |
cluster.min_samples |
Minimum number of genes of interest within range set by cluster.eps required for a given gene to be considered a core member of a cluster. |
Search Approach Settings¶
The remaining sections of the configuration file are used to defined search approach specific settings. The settings allowed in each section are detailed below.
‘blast’¶
blast:
blast_query: ./bait.fa
blast_exec: blastp
blast_evalue: 1e-4
blast_threads: 1
evalue: 1e-3
ident_pct: 30
Name |
Description |
---|---|
blast_query |
The name of the protein fasta file containing the query sequences. This fasta file needs to be indexed using the ‘makeblastdb’ command. This can be given as an absolute path to the query sequence file, or as a relative path starting with ‘./’ or ‘../’ |
blast_exec |
The executable tool will be passed to the cmd to run blast. Currently blastp is the only supported blast method. |
blast_evalue |
The evalue will be passed to the cmd to run blast. Only hits below this will be returned from the blast program. Default is 10. |
blast_threads |
The number of threads will be passed to the cmd to run blast. Default is 1. |
evalue |
The evalue threshold used to filter the blast results after they are generated. This does not affect the raw BLAST output, but is instead used to filter the results after they are generated. Default is 0.01 |
ident_pct |
The identity threshold used to filter blast hits. The default value is 30 (30% identity). |
‘hmmer’¶
hmmer:
hmmer_query: ./Hdc.hmm
hmmer_exec: hmmscan
hmmer_evalue: 1e-4
hmmer_threads: 1
evalue: 1e-3
bitscore: 0
Name |
Description |
---|---|
hmmer.query |
The name of the profile HMM file. The HMM file should be indexed with hmmpress. The path can be given as an absolute path or a relative path starting with ‘./’ or ‘../’. |
hmmer.exec |
The executable tool will be passed to the cmd to run blast. Currently hmmscan is the only supported HMMER method. |
hmmer.evalue |
The evalue will be passed to the cmd to run hmmscan. Only hits below this will be returned from the hmmscan program. Default is 10. |
hmmer.threads |
The number of threads will be passed to the cmd to run hmmscan. Default is the number of cpu cores detected on your machine. |
evalue |
The evalue threshold used to filter the hmmscan results after they are generated. This does not affect the raw hmmscan output, but is instead used to filter the results after they are generated. Default is 0.01 |
bitscore |
The bitscore threshold used to filter blast hits. The default value is 0. |
‘kofamscan’¶
kofamscan:
annot_suffix: .kofam.tsv
evalue: 1e-3
threshold: 1
Name |
Description |
---|---|
annot_suffix |
The file extension for the kofamscan prediction output. |
evalue |
The evalue threshold used to filter the kofamscan results. Default is 0.01 |
threshold |
The threshold value is used to adjust the score thresholds which are used to determine if a kofamscan prediction is significant or not. Kofamscan assigns a prediction score to each protein query for each KO number. If the score is above a predetermined value for that KO, then the protein is putatively assigned to that KO. This score can be adjusted using this threshold setting, which will be used to multiply the score needed to make it more or less strict. Example: .. code-block: K00001 gene1 score: 10 KO_value: 12
- if the threshold is set to 1, then this gene would not be assigned to K00001
- if the threshold is set to 0.5, then the KO_value needed would be adjusted to 6 (12*0.5), resulting in the gene being
assigned to K00001
|
KofamScan Annotation File¶
The KofamScan annotation file is a tabular file generated using teh kofam_scan tool provided here (https://github.com/takaram/kofam_scan). The tabular oputput generated by KofamScan consists of seven columns where each line represents a putative annotation for a gene with certain lines being marked with a ‘*’ if they pass predetermined score threshold for that KO. An example KofamScan annotation file can be seen below. In this example multiple possible annotations are reproted for GCA_001563995.1_00002, with only one, K07333, being considered signfiicant based on the score being higher than the threshold for that KO. Adjusting the threshold parameter in the ProkFunFind configuration file will multiple the value for a signifcant hit by the value given in the configuration file. For example a threshold setting of 0.5, would make it so that the required threshold for K12511 in the output below change to 64.735 (i.e., 129.47 * 0.5). This can potentially affect what annotations are considered signficant in the annotation file.
# gene name KO thrshld score E-value "KO definition"
# --------- ------ ------- ------ --------- -------------
* GCA_001563995.1_00002 K07333 99.57 216.9 2.1e-65 "archaeal flagellar protein FlaJ"
GCA_001563995.1_00002 K12511 129.47 79.5 1e-23 "tight adherence protein C"
GCA_001563995.1_00002 K12510 83.50 50.3 1e-14 "tight adherence protein B"
...
‘interproscan’¶
interproscan:
annot_suffix: _InterProScan.tsv
evalue: 1e-3
Name |
Description |
---|---|
annot_suffix |
the file extension for the InterProScan annotation file. |
evalue |
The evalue threshold used to filter the InterProScan results. Default is 0.01 |
InterProScan Annotation File¶
The InterProScan annotation file is a tab separated table generated by the InterProScan command line tool. Information on installing and running InterProScan is available here: https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html The annotation file generated by InterProScan consists of 13 columns, with the key ones for ProkFunFind being the gene ID (1), type of annotation (4), annotation (5), and evalue (9). An example of an InterProScan annotaiton file can be seen below:
GCA_001563995.1_00003 1fde1e3b11f2b914cce637e9d04ad07a 140 Pfam PF07790 Archaeal Type IV pilin, N-terminal 4 48 1.6E-5 T 02-05-2023 IPR012859 Archaeal Type IV pilin, N-terminal
GCA_001563995.1_00003 1fde1e3b11f2b914cce637e9d04ad07a 140 TIGRFAM TIGR02537 arch_flag_Nterm: archaeal flagellin N-terminal-like domain 3 25 2.3E-6 T 02-05-2023 IPR013373 Flagellin
...
‘prokka’¶
prokka:
annot_suffix: .prokka.tsv
Name |
Description |
---|---|
annot_suffix |
The file extension for the Prokka tabular annotation file output. |
Prokka Annotation File¶
Prokka can provide preliminary annotations including COG assignments for a genome (see the Prokka tool and documentation here: https://github.com/tseemann/prokka). The Prokka annotation file is a genreated as part of the Prokka pipeline with a ‘.tsv’ extension. If you are using multiple annotaiton files with ProkFunFind, be careful about possible overlapping generic file extensions like ‘.tsv’. It would be recommended to rename the file to something more descriptive like ‘.prokka.tsv’. The Prokka annotaiton file consists of 7 columns including the gene name (1), and COG (6). No e-values or bit scores are provided for these annotaitons, so no filtering of these results is possible in ProkFunFind. An example Prokka annotaiton file can be seen below:
IMMGDCFF_00012 CDS 876 yisK COG0179 putative protein YisK
IMMGDCFF_00013 CDS 1308 asnS_1 6.1.1.22 COG0017 Asparagine--tRNA ligase
‘bakta’¶
bakta:
annot_suffix: .bakta.tsv
Name |
Description |
---|---|
annot_suffix |
The file extension for the Bakta tabular annotation file output. |
Bakta Annotation File¶
Bakta can provide preliminary anntoation files including KEGG KOs and COGs. The Bakta tool is available here: https://github.com/oschwengers/bakta The ‘.tsv’ output file from bakta provides information on the called genes and their putative annotations. The gene IDs are provided in column 5 while the annotations can be found in a comma separated list in column 7. An example bakta annotaiton file can be seen below:
contig_11 cds 18802 22275 - KBDKFM_00100 bcsC cellulose synthase complex outer membrane protein BcsC COG:COG5010, COG:UW, GO:0019867, GO:0030244, KEGG:K20543, RefSeq:WP_167582918.1, SO:0001217, UniParc:UPI001430EF41, UniRef:UniRef100_A0A7D7DNT3, UniRef:UniRef50_P37650, UniRef:UniRef90_P37650
contig_11 cds 22257 23363 - KBDKFM_00105 bcsZ cellulose synthase complex periplasmic endoglucanase BcsZ COG:COG3405, COG:G, EC:3.2.1.4, GO:0005576, GO:0008810, GO:0030245, KEGG:K20542, RefSeq:WP_167582921.1, SO:0001217, UniParc:UPI00142F9A65, UniRef:UniRef100_A0A7H9QRD3, UniRef:UniRef50_P37651, UniRef:UniRef90_P37651
‘emapper’¶
emapper:
annot_suffix: .emapper.annotations
evalue: 1e-3
Name |
Description |
---|---|
annot_suffix |
The file extension for the EGGNog-mapper prediction output. |
evalue |
The evalue threshold used to filter the EGGNog-mapper results. Default is 0.01 |
EggNOG-mapper Annotation File¶
The EggNOG-mapper anontation tool produces a tabular output that provides annotaiton information including COG assignments for a given genome. Details on how to install and run EggNOG-mapper can be found here: https://github.com/eggnogdb/eggnog-mapper The EggNog mapper annotation file includes the gene IDs in column 1, the e-values in column 3, and the COG annotaitons in column 5. The COG annotations include the base level COGs as well as COGs defined at different taxonomic levels. ProkFunFind searches can be performed with any of these COGs at any taxonomic level. An example EggNOG-mapper annotation file can be seen below:
##
#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs
GCA_001563995.1_00002 694440.JOMF01000005_gene106 3.8e-27 129.4 COG2064@1|root,arCOG01808@2157|Archaea,2XT4Q@28890|Euryarchaeota,2NAP1@224756|Methanomicrobia 2157|Archaea N Type II secretion system (T2SS), protein F - - - ko:K07333 - - - - ko00000,ko02035,ko02044 - - -T2SSF
GCA_001563995.1_00012 1343739.PAP_05820 4e-147 528.5 COG1855@1|root,arCOG04116@2157|Archaea,2XSZY@28890|Euryarchaeota,242R1@183968|Thermococci 2157|Archaea VK homology RNA-binding domain - GO:0005575,GO:0005618,GO:0005623,GO:0030312,GO:0044464,GO:0071944 - ko:K06865 - - - - ko00000 -- - Intein_splicing,KH_1,KH_2,LAGLIDADG_3,PIN,PIN_4,T2SSE
Function definition¶
The second part of the configuration file contains the definition of the function of interest. Functions are defined in the YAML format in a hierarchical structure. An example of a function definition can be seen below:
---
name: Equol Gene Cluster
components:
- name: Equol Production Pathway
presence: essential
components:
- geneID: DHDR
description: Dihydrodaidzein reductase
presence: essential
terms:
- id: GCF_000422625.1_00043
method: blast
ident_pct: 90
evalue: 0.00001
- geneID: THDR
description: Tetrahydrodaidzein reductase
presence: essential
terms:
- id: COG1053
method: emapper
Functions are defined in a nested structure. Each component is defined with a name property, an optional description property, and a presence property which defines if that component is essential or nonessential for the overall function.
Name |
Description |
---|---|
name/geneID:(str) |
The name of the components/ The gene ID |
components:(list) |
The list of subcomponents |
presence:(option) |
“essential”, “nonessential” |
analogs:(dict) |
Followed an equivalent component |
Components are ultimately associated with geneIDs, which have the same set of properties as higher level components, but also have search terms associated with them. In the example below the geneID ‘DHDR’ is associated with a sequence as a search term:
- geneID: DHDR
description: Dihydrodaidzein reductase
presence: essential
terms:
- id: GCF_000422625.1_00043
method: blast
ident_pct: 90
evalue: 0.00001
Search terms consist of a search term ID, the method associated with searching for this term, and additional filtering parameters. Any of the filtering parameters applicable to a given search term can be set for individual search terms in this way. See the configuration settings in the above sections for info on what filtering parameters are applicable for each approach.