PathogenFinder 2

Version

Welcome to the web application PathogenFinder2.

PathogenFinder2 is a novel deep learning model able to predict pathogenic capacity on humans from bacteria, only considering its genome. PathogenFinder2 is also able to report the proteins that has mattered the most for the prediction, as well as report the embeddings that can locate a bacterial genome in the Pathogenic Bacteria Landscape.
For more a more detailed information, please consider the article "Whole-genome prediction of bacterial pathogenic capacity on novel bacteria using protein language models, with PathogenFinder2" .

BETA version: Note that the current version of PathogenFinder 2 is still in beta and you may encounter issues. Please let us now in via the contact page if you encounter any issues.

Instructions

Version

You can choose which version of the software you want to use from the drop down list.

Input data type

You must select what format your file has. For now, PathogenFinder2 only accepts input files in FASTA format.
The fasta file must contain the genomic data of one bacterial isolate. For more than one input, consider using the GitHub repository locally. The files must not be compressed.

To avoid problems caused by file names, we only allow a limited selection of ASCII characters: a-z, A-Z, 0-9, "_" (underscore), "-" (hyphen), "." (full stop)

Upload and submit job

Click on the 'Submit job' button to submit your job after having attached the files. The waiting page will be displayed and constantly updated until it terminates, and the server output page appears in your browser. You also have the option to input your email and be notified as soon as your results are ready. The data is available for one week from the moment the results are created.

Output

PathogenFinder2 prediction comes from an ensemble of 4 neural networks. Therefore, four different predictions are reported, each one being a number between 0 (without pathogenic capacity) and 1 (with pathogenic capacity). but for how close the bacteria is to the decision border of PathogenFinder2. It is valid to use the mean of the four values, but it is recommendable to take into account the separate predictions when taking decisions about the nature of the bacteria.
PathogenFinder2 prediction comes from an ensemble of 4 neural networks. Therefore, four different predictions are reported, each one being a number between 0 (without pathogenic capacity) and 1 (with pathogenic capacity). This number does not correlate with the pathogenic capacity, but for how accurate the prediction is (the closest to 0.5, the more unsure the neural network is about the pathogenic capacity). It is valid to use the mean of the four values, but it is recommendable to take into account the separate predictions when taking decisions about the nature of the bacteria.
As a standard, PathogenFinder2 will report a results file ("results.tsv"), as well as the embeddings file and the attentions scores file ("embeddings.npz" and "attentions.npz", respectfully). Intermediate files, like the predicted proteins or/and the embeddings file, are reported in case they were produced when using PathogenFinder2.
In case the option for mapping the top proteins highlighted by the attentions score to UniRef50 is selected, a table with the results will be also displayed (unavailable at the moment) as well as possible to download ("meh.tsv"). In case the option for mapping the embeddings to the Pathogenic Bacterial Landscape is selected, the image and the closer neighbours will be available for download.

PathogenFinder2 has four outputs, two standard and two supplementary:

Bacterial Pathogenic Capacity prediction: Explains the prediction of the neural networks on Bacterial pathogenic capacity. This will always be part of the output.

Highlighted proteins during the pathogenic capacity prediction: It shows the matches on UniRef50 of the 20 most relevant proteins for each neural network to predict pathogenicity. This will only be part of the output if Map the 20 most relevant proteins to UniRef50 option has been selected.

Bacterial pathogenic Landscape & Closest Bacteria in the Bacterial Pathogenic Landscape: It shows the location of the sample in the pathogenic landscape, as well as the 10 closest bacteria to your sample in the Bacterial Pathogenic Landscape. This will only be part of the output if Map your sequence to the Pathogenic Bacteria Landscape option has been selected.

Downloads: Section where you can download all the files produced by and during PathogenFinder2 run. This will always be part of the output, but the amount of files will depend on the options selected.

Input data type:

Select input type

Right now, the program only accepts bacterial genomes in a fasta format (not compressed).

Run extra phenotyping analysis:

This will delay the results notably.

Map your sequence to the Pathogenic Bacteria Landscape.

Map the 20 most relevant proteins to UniRef50.

Upload and submit job:

Email (Get email, when finished - Optional):

Files (The sum of uploaded file sizes cannot exceed 1 gb):

Citations

If you use and/or publish results obtained by the service, please cite the article below.

Ferrer Florensa, A., Almagro Armenteros, J. J., Kaas, R. S., Clausen, P. T., Nielsen, H., Rost, B., Aarestrup, F. M.
(2025). Whole-genome prediction of bacterial pathogenic capacity on novel bacteria using protein language models, with PathogenFinder2. bioRxiv, 2025-04.

Center for Genomic Epidemiology

PathogenFinder 2

Instructions

Version

Input data type

Upload and submit job

Output

Input data type:

Run extra phenotyping analysis:

Upload and submit job:

Citations

Supported by