Queries and Search Genomes

This tutorial section will cover how to format the queries and the information for the genomes that you are going to search.

Queries

The query format used by ProkFunFind is designed around the concept of what we refer to as a feature model. This feature model is a collection of search terms that are associated with a biological function of interest. These terms are organized into a hierarchical structure that is used to represent the relationships between different components of the biological system.

The tutorial materials for this section are located in the ./queries/ directory.

../_images/Function-based-searches.png

Feature Model Definition

The feature model is defined in the second portion of the central config.yaml configuration file. The feature model definition is provided in yaml format. This definition is where the function organization is defined and where each gene in the function is associated with search terms. An example of a function definition can be seen in the queries/examples/config.yaml example file:

---
name: Equol Gene Cluster
components:
- name: Equol Production Pathway
  presence: essential
  components:
  - geneID: DZNR
    description: Daidzein reductase
    presence: essential
    terms:
    - id: DZNR
      method: hmmer
  - geneID: DHDR
    description: Dihydrodaidzein reductase
    presence: essential
    terms:
    - id: GCF_000422625.1_00043
      method: blast
      ident_pct: 90
  - geneID: THDR
    description: Tetrahydrodaidzein reductase
    presence: essential
    terms:
    - id: COG1053
      method: emapper
      evalue: 0.001
  - geneID: DDRC
    description: Dihydrodaidzein racemase
    presence: essential
    terms:
    - id: DDRC
      method: hmmer

The feature definition is organized as a series of nested components. The outermost component is the overall function definition. In the example above that would be the ‘Equol Gene Cluster’. The function is then organized in a nested structure with subcomponents that are ultimately associated with search terms. Each of these portions of the feature will be explained in the subsequent sections.

Types of Search Terms

ProkFunFind supports multiple types of queries and additional support for new search approaches is actively being worked on.

You can perform searches using ProkFunFind with the following kinds of search terms:

Type of Search

Search Term

Protein Sequence

Amino Acid Sequence

Hidden Markov Model

Protein Profile HMMs

Protein

Domains Supported domains predicted by InterProScan (SEE: InterProScan). Including Pfam and TIGRFAM

Ortholog Groups

Kegg Orthology (KO) and Clusters of Orthologous Groups (COGs)

Multiple search terms can be associated with a single geneID through the terms property in the yaml function definition.

- geneID: THDR
  description: Tetrahydrodaidzein reductase
  presence: essential
  terms:
  - id: COG1053
    method: emapper
    evalue: 0.001
  - id: GCF_000422625.1_00043
    method: blast
    ident_pct: 90

Search Configuration

The configuration file is where all of the search parameters are defined. This central file should be named ‘config.yaml’. This file is broken up into different sections where filtering thresholds and file naming patterns are defined. An example of a config.ini file can be found in the config.yaml file in the examples directory.

---
main:
  cluster_tool: DBSCAN
  faa_suffix: .faa
  gff_suffix: .gff
  fna_suffix: .fna
DBSCAN:
  cluster_eps: 4
  cluster_min_samples: 2
hmmer:
  hmmer_query: query.hmm
  hmmer_exec: hmmscan
  hmmer_threads: 1
  evalue: 1e-3
blast:
  blast_query: query.fa
  blast_exec: blastp
  blast_threads: 1
  evalue: 1e-3
kofamscan:
  annot_suffix: .kofam.tsv
  threshold: 0.5
emapper:
  annot_suffix: .emapper.annotations
interproscan:
  annot_suffix: _InterProScan.tsv
prokka:
  annot_suffix: .prokka.tsv
bakta:
  annot_suffix: .bakta.tsv

The ‘main’ section is used to provide the file extensions for the genome input files. This section is also used to set the clustering tool used to identify if the hits from the search are found in any clusters within the genome. Currently on the DBSCAN algorithm is supported for gene clustering.

The ‘DBSCAN’ section is used to set the parameters used in the DBSCAN clustering to determine if multiple genes are present in the same clusters in the genome. The ‘clsuter.eps’ setting is used to set how far two observations can be while still being considered to be in the same cluster. the ‘cluster.min_samples’ parameter is used to determine how many genes must be in the same region for them to be considered a cluster. See DBSCAN for more information on the DBSCAN implementation.

The other sections defined in the configuration file are search approach specific. These sections are used to define the file extensions for the annotation or query files as well as the filtering parameters for each search approach. See the other search specific tutorial sections for examples and the ‘inputs’ section of the documentation for a complete table of all settings allowed for each search approach.

Search Space

The last component of the ProkFunFind approach is the genomes being searched. The set of information needed for each genome depends on what kinds of searches are being performed. At minimum each genome needs a genome fasta file, a GFF file containing the predicted genes, and a protein fasta file of the predicted protein sequences. With just this information searches can be performed using BLAST or HMMER. To search using additional features, files containing the results of running EGGNog-mapper, InterProScan, or KOfamscan also need to be present. See the Inputs and PFA Tutorial sections for more information on these annotation formats and the ProkFunAnnotate pipeline that can be used to generate them.