Yet Another CPAN Grep

Bio-Grid-Run-SGE/lib/Bio/Grid/Run/SGE/Config.pod

=pod

=encoding utf8

=head1 Configuration

Bio::Grid::Run::SGE uses various configuration settings to run a job. All
configuration is stored in the YAML format. The configuration can be stored at
B<two> places:

=over 4

=item 1. In a global configuration file located at F<~/.bio-grid-run-sge.conf>.

=item 2. In a per-job configuration file supplied as argument to the cluster script.

=back

=head2 Creating a global config file

The global config file can contain e.g. settings that are used for job
notifications and paths to executables.

=head3 Job notification

Bio::Grid::Run::SGE can notify you if a job finishes by email or jabber
message. You can also use a custom script with the C<script> option. An
example configuration would be:

    ---
    notify:
      mail:
        dest: person.in.charge@example.com
        smtp_server: smtp.example.com
      jabber:
        jid: grid-report@jabber.example.com/grid_report
        password: ...
        dest: person-in-charge@jabber.example.com
      script: /path/to/log/script.pl

Custom scripts will get a json encoded structure passed via stdin. The structure has the form:


    {
      "subject": "the subject",
      "message": "the main log message",
      "from":    "user@the.cluster.org"
    }

=head3 Other global configuration

You can add other configuration settings. If you start a lot of L<R|http://www.r-project.org> scripts you might want to add the Rscript bin as global configuration:

  ---
  notify:
    ....
  r_script_bin: /usr/bin/Rscript

This configuration setting is accessible in the cluster script via the
supplied configuration of the task function as
C<$c->{extra}{r_script_bin}>.


=head2 Creating a job-specific config file


  job_name: NAME
  mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep

  args: [ '-a', 10, '-b','no' ]
  test: 2
  no_prompt: 1

  num_parts: 3000
  # or
  combinations_per_task 300

  result_dir: result_gff
  working_dir:
  stderr_dir:
  stdout_dir:

  log_dir: dir
  tmp_dir: dir
  idx_dir: dir

  prefix_output_dirs: 

=head3 path specifictation in the config file

If the config file contains relative paths, the following policy is used:

=over 4

=item 1. The C<working_dir> config entry is used as "root".

=item 2. If no C<working_dir> config entry is specified, the directory of the config file is set to the working/root dir.

=item 3. If no config file is specified (yes, this is possible, but not recommended), the current dir is used as working/root dir.

=back

The working directory needs to exist.

=head3 The input section

With the input section it is possible to specify the type of input data and
how the index should be created.

The basic layout is:

  ---
  input:
  - ... # details of index 1
  - ... # details of index 2

Each index element shows up as argument in the C<task> function, 

  run_job(...
    task => sub {
      my ( $c, $result_prefix, $element_index_1, $element_index_2, ... ) = @_;
    }
    ...
  )

The number of indices you can use is determined by the L<mode|/Running mode>.
The most basic mode is C<Consecutive> and it takes one index and iterates
through every element.

=head4 Index types

=over 4

=item L<General|Bio::Grid::Run::SGE::Index::General>

=item L<List|Bio::Grid::Run::SGE::Index::List>

=item L<FileList|Bio::Grid::Run::SGE::Index::FileList>

=item L<Range|Bio::Grid::Run::SGE::Index::Range>

=item L<ListFromFile|Bio::Grid::Run::SGE::Index::ListFromFile>

=item L<Dummy|Bio::Grid::Run::SGE::Index::Dummy>

=back

=head3 Running mode

Bio::Grid::Run::SGE can run in different iteration modes

=over 4

=item L<Consecutive|Bio::Grid::Run::SGE::Iterator::Consecutive>

=item L<AvsB|Bio::Grid::Run::SGE::Iterator::AvsB>

=item L<AllvsAll|Bio::Grid::Run::SGE::Iterator::AllvsAll>
  
=item L<AllvsAllNoRep|Bio::Grid::Run::SGE::Iterator::AllvsAllNoRep>

=back


=

  ---
  input:
  - format: General
    #files, list and elements are synonyms
    files:
    - ../03_clean_evidence/result/merged.fa.clean
    chunk_size: 30
    sep: ^>
    sep_remove: 1
    sep_pos: '^'/'$'
    ignore_first_sep: 1

  - format: List
    list: [ 'a', 'b', 'c' ]
    
  - format: FileList
    files: [ 'filea', 'fileb', 'filec' ]

  - format: Range
    list: [ 'from', 'to' ]

=head3 RESEVED CONFIGURATION OPTIONS

Example configuration:

  'stdout_dir' => '/WORKING_DIR/xml_munge1.tmp/out',
  'test'       => '1',
  'no_prompt'  => undef,
  'input'      => [
    {
      'elements' => [
        '../../2013-10-13_string_b2g_blast/cafa_b2g_blastSTRING_9606_protein.sequences.result/cafa_b2g_blastSTRING_*_protein.sequences.*.blast.gz'
      ],
      'format'   => 'FileList',
      'idx_file' => '/WORKING_DIR/idx/xml_munge1.0.idx'
    }
  ],
  'mode'          => 'Consecutive',
  'range'         => [ '1', '1' ],
  'submit_bin'    => 'qsub',
  'submit_params' => [],
  'args'          => [],
  'working_dir'   => '/WORKING_DIR/test',
  'num_comb'      => 564,
  'log_dir'       => '/WORKING_DIR/xml_munge1.tmp/log',
  'stderr_dir'    => '/WORKING_DIR/xml_munge1.tmp/err',
  'tmp_dir'       => '/WORKING_DIR/xml_munge1.tmp',
  'smtp_server'   => 'net.wur.nl',
  'job_name'      => 'xml_munge1',
  'extra'         => { 'map' => '../split_test.map.json.gz' },
  'mail'          => 'joachim.bargsten@wur.nl',
  'script_dir'    => '/WORKING_DIR/bin',
  'idx_dir'       => '/WORKING_DIR/idx',
  'job_cmd' =>
    'qsub -t 1-1 -S perl -N xml_munge1 -e /WORKING_DIR/xml_munge1.tmp/err -o /WORKING_DIR/xml_munge1.tmp/out /WORKING_DIR/xml_munge1.tmp/env.xml_munge1.pl WORKING_DIR/bin/cl_xml_munge.pl --worker /WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
  'job_id' => '325541.1',
  'cmd'    => [ '/WORKING_DIR/bin/cl_xml_munge.pl' ],
  'worker_config_file' =>
    '/WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
  'prefix_output_dirs' => '1',
  'perl_bin'           => '/home/cafa/perl5/perlbrew/perls/perl-5.16.3/bin/perl',
  'result_dir'         => '/WORKING_DIR/xml_munge1.result',
  'part_size'          => 1,
  'num_parts'          => 564


Here is a list of reserved configuration options:


  $c = {
    cmd => ...,
    script_dir => ...
    no_post_task => ...,
    tmp_dir => ...,
    stderr_dir => ...,
    stdout_dir => ...,
    result_dir => ...,
    log_dir => ...,
    idx_dir => ...,
    test => ...,
    mail => ...,
    smtp_server => ...,
    no_prompt => ...,
    lib => ...,
    input => ...,
    extra => ...,
    num_parts => ...,
    combinations_per_task => ...,
    job_name => ...,
    job_id => ...,
    mode => ...,
    worker_config_file => ...,
    worker_env_script => ...,
    submit_bin => ...,
    submit_params => ...,
    perl_bin => ...,
    working_dir => ...,
    iterator => ...,

    args => ...,
  };

=head3 input section

  ---
  input:
  - format: General
    #files, list and elements are synonyms
    files:
    - ../03_clean_evidence/result/merged.fa.clean
    chunk_size: 30
    sep: ^>
    sep_remove: 1
    sep_pos: '^'/'$'
    ignore_first_sep: 1

  - format: List
    list: [ 'a', 'b', 'c' ]
    
  - format: FileList
    files: [ 'filea', 'fileb', 'filec' ]

  - format: Range
    list: [ 'from', 'to' ]

  job_name: NAME
  mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep

  args: [ '-a', 10, '-b','no' ]
  test: 2
  no_prompt: 1

  num_parts: 3000
  # or
  combinations_per_task 300

  result_dir: result_gff
  working_dir:
  stderr_dir:
  stdout_dir:

  log_dir: dir
  tmp_dir: dir
  idx_dir: dir

  prefix_output_dirs: 

The attribute C<args> is special, normally the main executable is hard-coded in the cl_* script,
but the arguments are changing per configuration. Therefore
L<Bio::Grid::Run::SGE::Master> provides the convenience attribute C<< $c->{args} >>
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.