ProkFunAnnotate

The ProkFunAnnotate snakemake pipeline can be used to genreate the annotation data used by ProkFunFind. This pipeline is available in a snakemake pipeline here: https://github.com/nlm-irp-jianglab/ProkFunAnnotate

The following tutorial section will provide a walkthrough on how to generate annotation files for a genome using the PFA pipeline.

Example Data

The genome used in this tutorial is contained in the ./pfa-genomes/ directory of the prokfunfind-tutorial repository. The input file is the ‘./pfa-list.tsv’ file.

To get started using the annotation pipeline you are going to want to clone a copy of the scripts using the following command:

git clone https://github.com/nlm-irp-jianglab/ProkFunAnnotate.git

This will make a directory called ProkFunAnnotate which contains the associated snakemake files.

Downloading the Annotation Data

Both EggNog-mapper and KOFamScan require annotation databases to run their annotation functions on a new genome. These databases can be downloaded manually following the directions in the two programs respective manuals or they can be downloaded using the download_snakemake snakemake pipeline provided in the repository.

The following command can be used to automatically download the files and store them in a designated data folder:

snakemake -s  ./ProkFunAnnotate/download_snakemake --use-singularity \
--cores 4 --config "data_dir=data"

The data_dir=data argument provided to the snakemake command designates what output directory the downloaded database files will be stored in. In this case it is a directory called ‘data/’.

The –use singularity option tells the snakemake pipeline to use singularity to run a designated image that contains EggNog-mapper and KOFamScan preinstalled. This gives us access to the accessory scripts to download and format the data without having the programs installed globally on the computer.

The first time running this command might take a few extra minutes to download and launch the singularity image, but subsequent runs of this script and of the annotation script will be able to use the downloaded image without redownloading it.

This command will make a directory called ‘data/’ that contains two subdirectories one called egg_data/ that stores the EGGNog-mapper database and one called kofam_data/ that contains the KOFamScan database files.

Running the Annotation Pipeline from Genome Fasta Files

To run the annotation pipeline you first need a genome or set of genomes that you want to run the annotaitons on. These genomes are expected to be in fasta format with the file extension ‘.fna’. For this test a genome file is provided in the ./pfa-genomes/GTDB26128.fna

An input file used to run the annotation pipeline can be found in the ‘./pfa-list.tsv’ file. This file is setup to run the annotation pipeline on the one genome stored in the ./pfa-genomes/ directory.

GTDB26128       pfa-genomes/

To run the annotation script the following command can be used:

snakemake -s ./annotate_snakemake --core 32 --use-singularity \
--singularity-args "\-B ./data/egg_data/:/opt/egg_data/,./data/kofam_data/:/opt/kofam_data/" \
--config "genomes=./genome_list.tsv" --latency-wait 10

This command will launch the snakemake pipeline which will peform the following steps: 1. Predict the gene content using PROKKA 2. Annotate the genes using EGGNog-mapper 3. Annotate the genes using KOFamScan

After the pipeline is finished there will be a 5 new files in the directory containing the genome file. A ‘.faa’ file containing the predicted protein sequences, a ‘.gff’ file containing the predicted gene locations, a ‘.emapper.tsv’ file containing the EGGNog-mapper annotations, a ‘.kofam.tsv’ file containing the KOFamScan annotations, and a ‘.prokka.tsv’ file that contains the annotations generated by the Prokka program itself. A new subdirectory named after the genomes will also be generated that contains additional files generated by PROKKA during the gene prediction process.

Running the Annotation Pipeline from Genbank file and Genome Fasta File

The ProkFunFind annotaiton pipeline can also be run using a genbank file and a genome fasta file similar to the ones that can be downloaded for any genome in the NCBI Genbank database. For example the files associated with the genome here: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP029205.1

The downloaded fasta file should contain the genome contig fasta sequences, and have the extension ‘.fasta’ (the default extension when downloading a fasta file from Genbank). The Genbank file should have the extension ‘.gb’ (the default extension when downloading a genbank file from Genbank)

The rest of the input is the same as running the pipeline on just a genome fasta file and the output will be the same. the only difference is taht the snakemake pipeline should be run with the annotate_gb_snakemake instead of annotate_snakemake snakemake file.