Mixed Searches

This tutorial section will cover how to perform searches using multiple types of search terms. In the previous tutorial examples you performed searches using only one type of supported search term in each search. While these types of searches can be useful, in multiple of the tutorial examples you saw how using just one type of search term can result in false positive or false negative search results. A benefit of using the ProkFunFind search approach comes from the ability to utilize multiple different types of queries during a search. For example one gene may be best represented by a profile HMM while another may be part of a well defined KO or COG group.

Mixed Search Terms

In ProkFunFind mixed searches you can incorporate any of the support search term types in any combination. The only limitation is based on what files you have available for each genome. For example in order to perform searches based on protein sequences and COG identifiers you will need to have the protein fasta files and EGGNog-mapper results available for each genome being searched.

Queries

The two most important files for configuring a mixed search in ProkFunFind are the configuration file and search-terms file. The file is going to be formatted like it has been in previous examples, but it needs to include configuration sections for each of the search term types that are being used. An example can be seen in the ./queries/mixed-search/config.yaml file:

---
main:
  cluster_tool: DBSCAN
  system_file: systems.json
  search_terms: search-terms.tsv
  faa_suffix: .faa
  gff_suffix: .gff
  fna_suffix: .fna
DBSCAN:
  cluster_eps: 4
  cluster_min_samples: 2
hmmer:
  hmmer_query: query.hmm
  hmmer_exec: hmmscan
  hmmer_threads: 1
  evalue: 1e-3
blast:
  blast_query: query.fa
  blast_exec: blastp
  blast_threads: 1
  evalue: 1e-3
kofamscan:
  annot_suffix: .kofam.tsv
  threshold: 0.5
emapper:
  annot_suffix: .emapper.annotations
interproscan:
  annot_suffix: _InterProScan.tsv

The other important file is the search-terms file. In this file each query ID is associated with the individual search terms and the search approaches. In this file multiple search terms can be associated with the same query, allowing for queries to be identified through multiple approaches simultaneously. The search terms file for this tutorial search can be seen in the config.yaml file:

name: Equol Gene Cluster
components:
- name: Equol Production Pathway
  presence: essential
  components:
  - geneID: DZNR
    description: Daidzein reductase
    presence: essential
    terms:
    - id: DZNR
      method: hmmer
  - geneID: DHDR
    description: Dihydrodaidzein reductase
    presence: essential
    terms:
    - id: GCF_000422625.1_00043
      method: blast
  - geneID: THDR
    description: Tetrahydrodaidzein reductase
    presence: essential
    terms:
    - id: COG1053
      method: emapper
  - geneID: DDRC
    description: Dihydrodaidzein racemase
    presence: essential
    terms:
    - id: DDRC
      method: hmmer

This search is going to use a mix of 3 profile HMMs, 1 protein sequence, 2 KOs, 4 domain signatures, and 4 COGs.

Output

The output is the same as what is produced by other searches. Because the search is done using multiple search terms it can also be useful to check the output to see what search terms are producing hits to certain genes in the results. This information can be found in the gff output of the search. For this search the output can be seen in the ./out/mixed-search/mixed.GTDB18040.annot.gff file:

GCF_000478885.1_1     ProkFunFind     CDS     7382    8305    .       -       .       ID=GCF_000478885.1_00007;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=3.2e-19
GCF_000478885.1_1     ProkFunFind     CDS     28201   29646   .       +       .       ID=GCF_000478885.1_00024;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=2e-15
GCF_000478885.1_1     ProkFunFind     CDS     261233  262642  .       +       .       ID=GCF_000478885.1_00150;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=7.7e-22
GCF_000478885.1_1     ProkFunFind     CDS     288712  290358  .       -       .       ID=GCF_000478885.1_00174;Name=DEVR;ClusterID=Cl_NA;Target=DEVR;evalue=1.2e-07

In this output the Target property in column 9 provides what specific search term ID produced the hit to that gene.