Yet Another CPAN Grep

Bio-Palantir/Manual.pod

# PODNAME: Palantir::Manual
# ABSTRACT: User Guide for Palantir

__END__

=for todo
	TODO: Nothing right now


=head1 Aim and features

C<Palantir> (Post-processing Analysis tooL for ANTIsmash Reports) is a toolbox
for supporting genome mining analyses based on antiSMASH reports, one of the
most comprehensible and up-to-date available pipeline for the detection of 
secondary metalobism pathways. This package offers two different sets of
functionalities. On the one hand, C<Palantir> offers methods for helping the 
user to manipulate and analyze BGC information for small and large-scale 
genome mining projects: (1) FASTA sequence extraction at any BGC level, 
(2) PDF/Word reporting and (3) SQL tables generation for advanced data
management.

On the other hand, C<Palantir> aims to achieve a more complete and accurate 
I<in silico> characterization of NRPS/PKS enzymatic systems with several 
methods: (4) module delineation, (5) gap-filling for completing the BGC 
annotation, and (6) the dynamic elongation of their core domain sequences.
Moreover, (7) a visualization functionality allows the user to easily check the
refinements applied to the BGC domain architecture and compare these with 
antiSMASH version. Finally, (8) an "exploratory mode" devised to interpret the 
architecture from scratch, i.e., without any bias from a previously defined 
consensus, is also provided.


=head1 Usage


=head2 Installation

C<Palantir> is written in I<Modern Perl> but relies on several external 
dependencies (see below). You should download and install the corresponding 
binaries the way you feel the most appropriate for your system:

=over

=item - HMMER3 (L<http://hmmer.org/download.html>)

(This might be installed with C<sudo apt-get install hmmer>).

=item - sqlite3 (L<https://www.sqlite.org/download.html>).

(Needed to use F<export_bgc_sql_tables.pl>, this might be installed with
C<sudo apt-get install sqlite3>). 

=item - Inkscape (L<https://inkscape.org/release/>)

(Needed to use F<generate_bgc_report.pl>, this might be installed with
C<sudo apt-get install inkscape>).

=item - pandoc (L<https://pandoc.org/installing.html>)

(Needed to use F<generate_bgc_report.pl>, this might be installed with
C<sudo apt-get install pandoc>).

=back


Also, it is needed to install C<libgd-dev> package to make work the Perl 
package C<GD::Simple>.

	sudo apt-get install libgd-dev

Most other dependencies can be handled automatically by using C<cpanm> in a 
I<Perlbrew> environment L<https://perlbrew.pl/>. Below are a set of commands 
to setup such an environment on I<Ubuntu>.

    # install development tools
    $ sudo apt-get update
    $ sudo apt-get install build-essential

    # download the perlbrew installer...
    $ wget -O - http://install.perlbrew.pl | bash

    # initialize perlbrew
    $ source ~/perl5/perlbrew/etc/bashrc
    $ perlbrew init

    # search for a recent stable version of the perl interpreter
    $ perlbrew available
    # install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
    # (this will take a while)
    $ perlbrew install perl-5.26.2
    # install cpanm (for Perl dependencies)
    $ perlbrew install-cpanm

    # enable the just-installed version
    $ perlbrew list
    $ perlbrew switch perl-5.26.2

    # make perlbrew always available
    # if using bash (be sure to use double >> to append)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
    # if using zsh (only the destination file changes)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc

Major C<Palantir> dependencies are the C<Bio::MUST> series of modules. 
Install them as follows.

    $ cpanm Bio::FastParsers
    $ cpanm Bio::MUST::Core

Since C<Bio::MUST> modules rely on external bioinformatics programs and come
with complex test suites, they sometimes raise errors during installation. If
you encounter any such error, consider enabling C<--force> and/or C<--notest>
options of C<cpanm>.

    $ cpanm --force Bio::MUST::Core

Install C<Palantir> itself. All remaining dependencies can also be taken care 
of by C<cpanm>.

    $ cpanm Bio::Palantir


=head2 Input

Palantir accepts report files from antiSMASH version 3 and 4 (biosynML.xml), 
and from the newer version 5 (regions.js). The biosynML.xml reports are not
generated by default in antiSMASH 4 and need the C<--enable-biosynml> option 
to be written in the result repository. The regions.js file can be obtained 
from the results downloaded from the antiSMASH web server 
(https://antismash.secondarymetabolites.org) or with the standalone version. 

Also, FASTA files containing BGC sequences are used by F<explore_bgc_domains.pl>.


=head2 Binaries for the management of BGC data


=head3 I<extract_bgc_sequences.pl> - FASTA sequence extraction at any BGC level

(This script uses Palantir functionalities 1, 4, 5 and 6)

Protein sequences are useful in most downstream analyses performed on identified
BGC. F<extract_bgc_sequences.pl> gives an easy access to this information.

The most basic usage extracts Palantir annotation for every BGC gene present 
in the report:

	extract_bgc_sequences.pl --report-file=antismash5_report/regions.js

The extracted gene sequences will be stored by default in the
bgc_sequences.fasta file. The output filename can be specified with C<--outfile>
option.

If you want to extract specific BGC types, you can use C<--types> option.

You can find the list of types by using:

	extract_bgc_sequence.pl --help

Depending on the analysis you intend to do, it is also possible to specify which 
scale you would like to use for extracting sequences: cluster, gene, module or
domain (by default: gene).

N.B.: module and domain scales are only allowed for NRPS and type 1 PKS based
enzymes.

Here is an example of a more specific command line:

	extract_bgc_sequences.pl --report-file=antismash5_report/regions.js \
	--types=nrps --scale=domain --outfile=strain1_domains.fasta

Furthermore, if you are only interested in antiSMASH annotation, you can use 
C<--annotation=antismash> option. 

You can choose the cutting mode applied for the delineation of modules with
the C<--module-delineation> option. Two modes are available: condensation and
substrate-selection (default: substrate-selection).

Finally, for adding information into the sequence IDs of the FASTA file, 
you can use C<--prefix>. For instance, C<--prefix=strain1> will give sequence
IDs such as ">Strain1@Cluster...".


=head3 F<generate_bgc_report.pl> - (2) PDF/Word reporting

(This scripts uses Palantir functionality 2)

To format antiSMASH report in an easier format for reading, 
F<generate_bgc_report.pl> offers users PDF/Word docx reporting.

The PDF/Word report is constituted of one BGC/page and resume basic information
(type, coordinates, size and the BGC map). In case of NRPS/PKS BGCs, the list 
of domain and product monomers is also given.

Here is an example of basic command line use:

	generate_bgc_report.pl --report-file=antismash4_report/biosynML.xml \
	--filetype=pdf

N.B.: this script does not work for antiSMASH 5. 

C<--filetype> option allows the user to choose between PDF and Word docx output
(values: pdf or docx, default: pdf).

C<--types> and C<--outfile> options are available and
work as explained in the F<extract_bgc_sequences.pl> section.


=head3 F<export_bgc_sql_tables.pl> - (3) SQL tables generation

(This script uses Palantir functionalities 3, 4, 5 and 6)

Whether to do data visualization or statistics, a SQL database is useful for 
the analysis of large-scale and hierarchically organized data.
F<export_bgc_sql_tables.pl> exports the BGC information into SQL tables (you can
then choose the SQL database engine) and sets up an sqlite3 database.


=for todo
	TODO: Add the SQL schema
	Palantir annotation (as well as antiSMASH) are exported, see the SQL schema:
	L<todo>.
 

Basic command line example:

	export_bgc_sql_tables.pl --infiles antismash_report1/regions.js \
	antismash_report2/regions.js antismash_report3/regions.js

C<--infiles> option allows the user to specify multiple reports at once.

Several options are available:

C<--module-delineation>: Cutting modes: condensation and substrate-selection
(default: substrate-selection).
C<--db-name>: database name (default: bgc_db).
C<--cpu>: number of cpus to use (default: 1).

To give easily many reports in input, C<--file-table> allows the user to provide
a text file with the list of antiSMASH report paths.

For example:

F<reports.list>:
antismash5_reports/strain1/regions.js
antismash5_reports/strain2/regions.js
antismash4_reports/strain3/biosynML.xml
antismash3_reports/strain4/biosynML.xml
...

Additionally, C<--new-db> can be used to erase a pre-existing result repository.

Here is an advanced command line example: 

	export_bgc_sql_tables.pl --file-table=reports.list --types=nrps t1pks \
	--db-name=strain1_db --cpu=2

Also, some advanced options can tweak the way Palantir annotates NRPS/PKS BGCs:

C<--gap-filling>, when enabled, tries to find domains in the gaps (>=250aa) from 
antiSMASH BGC annotations by using a second detection run (default: 1)

C<--undef-recov>, some domains from antiSMASH reports do not possess a defined
function value (such as 'C',...) and are then uninformative. This option tries
to recover this value for completing the antiSMASH annotation (this is done by
default for the Palantir one) by running a detection run on the domain sequences
(default: 0).

C<--undef-cleaning>, when the domain function value is undefined by antiSMASH
and not retrieved, this option removes these domains from Palantir BGC
annotation (default: 1).

By default, we enable these three options as we think it helps achieving a more
complete BGC annotation.


=head3 F<generate_bgc_dnz_table.pl> - generate a denormalized table of BGC data

(This script uses Palantir functionalities 5 and 6)

For supporting manual data extraction with Excel or downstream analyses with a
programming language, such as R or Python, F<generate_bgc_dnz_table.pl> provides
a denormalized TSV (Tab-Separated Values). This denormalized table consists in
rows containing iteratively all the data from the different BGC scales (cluster,
gene, domain). 

Basic command line usage: 

	generate_bgc_dnz_table.pl --report-file=antismash5_report/regions.js

Several options are available: 

C<--types>: as explained in the F<extract_bgc_sequences.pl> section, BGC types
to filter.
C<--outfile>: output filename.
C<--id>: ID to be use as first column of the table (e.g., the organism name),
which is usefull to paste tables together. 
C<--annotation>: annotation version to use (palantir or antismash).


=head2 Binaries for the refinement of the annotation of NRPS/PKS BGCs


=head3 F<draw_bgc_maps.pl> - draw BGC maps

(This scripts uses Palantir functionalities 4, 5, 6 and 7)

As NRPS and PKS BGCs are constituted of different layers (genes, modules and 
domains), visualizing the maps of these BGCs is an easy way to compare different
annotations. F<draw_bgc_maps.pl> offers the mapping of three annotation
versions: antiSMASH, Palantir and Palantir's exploratory mode. 

Here is a basic usage example of this script:

	draw_bgcs.pl --report-file=examples/antismash5_report/regions.js \
	--mode=all --label=symbol

C<--mode>: BGC annotation to draw, you can choose between: all, palantir,
exploratory and antismash (default: all).

C<--label>: domain label to use on the map, three are available: function,
symbol or subtype. 'symbol' corresponds to the letter used to represent a
domain function (e.g., 'C' for condensation or 'KS' for ketosynthase domain),
while 'function' uses the complete domain name provided by the protein
signature. Finally, 'subtype' adds the subtype information for domains
responsible of the substrate activation and the condensation activity (e.g.,
LCL C domain or Val A domain). 

The label contains the prediction E-value in Palantir annotations). 

More options:

C<--module-delineation>: Cutting modes: condensation and substrate-selection
(default: substrate-selection).
C<--verbose>: prints additionnal information concerning domains (function, 
coordinates and sequences).
C<--outdir>: directory name where PNG files will be generated (default: ./png/).
C<--prefix>: String to use for prefixing PNG files.


=head3 F<explore_bgc_domains.pl> - Reports all detected NRPS/PKS domain without
architecture consensus

(This script uses Palantir functionalities 5, 6, and 8)

The NRPS and PKS domains predicted with antiSMASH rely on detection rules (i.e.,
E-value and % of the domain signature covered during HMMER analyses) that
have been improved over time by antiSMASH authors for separating true from
spurious predictions. However, it happens that true domains (that may have a
divergent sequence or be truncated) are discarded whereas they could add more
insight in the BGC annotation. Moreover, the architecture of NRPS/PKS BGCs given
by antiSMASH are often the result of a consensus architecture among overlapping
detected domain signatures.

If antiSMASH rules for validating the presence of a domain or determining the 
BGC architecture are usually reliable, they may sometimes lead to incomplete or 
incoherent BGC architectures. In order to improve this,
F<explore_bgc_domains.pl> provides an unbiased view (by a pre-existing
architecture consensus) of the BGC composition. For this, this script exports in
TSV and JSON format data from all detected domain signatures. 

This script, unlike the others, uses a FASTA file of NPRS/PKS data in input
(this file can be created from an antiSMASH report with
F<extract_bgc_sequences.pl>).

Here is a command line example: 

	explore_bgc_domains.pl --fasta-file=strain1_bgc_sequences.fasta \
	--outfile=strain1_exploratory_domains

which will produce two_files in output: 'strain1_exploratory_domains.tsv' and 
'strain1_exploratory_domains.json'. 

This script does not take other option than C<--outfile>.
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.