Yet Another CPAN Grep

Catmandu-Breaker/lib/Catmandu/Breaker.pm

package Catmandu::Breaker;

our $VERSION = '0.141';

use Moo;
use Carp;
use Catmandu;
use Catmandu::Util;
use Catmandu;
use Data::Dumper;

has verbose  => ( is => 'ro', default => sub {0} );
has maxscan  => ( is => 'ro', default => sub {-1} );
has tags     => ( is => 'ro' );
has _counter => ( is => 'ro', default => sub {0} );

sub counter {
    my ($self) = @_;
    $self->{_counter} = $self->{_counter} + 1;
    $self->{_counter};
}

sub to_breaker {
    my ( $self, $identifier, $tag, $value ) = @_;

    $value //= '';

    croak "usage: to_breaker(idenifier,tag,value)"
        unless defined($identifier) && defined($tag) && defined($value);

    $value =~ s{\n}{\\n}mg;

    sprintf "%s\t%s\t%s\n", $identifier, $tag, $value;
}

sub from_breaker {
    my ( $self, $line ) = @_;

    my ( $id, $tag, $value ) = split( /\s+/, $line, 3 );

    croak "error line not in breaker format : $line"
        unless defined($id) && defined($tag) && defined($value);

    return +{ identifier => $id, tag => $tag, value => $value };
}

sub parse {
    my ( $self, $file, $format ) = @_;

    my $tags = $self->tags // $self->scan_tags($file);
    $format = $format // 'Table';

    my $importer = Catmandu->importer( 'Text', file => $file );
    my $exporter = Catmandu->exporter( 'Stat', fields => $tags, as => $format );

    my $rec     = {};
    my $prev_id = undef;

    my $it = $importer;

    if ( $self->verbose ) {
        $it = $importer->benchmark();
    }

    $it->each(
        sub {
            my $line = $_[0]->{text};

            my $brk   = $self->from_breaker($line);
            my $id    = $brk->{identifier};
            my $tag   = $brk->{tag};
            my $value = $brk->{value};

            if ( defined($prev_id) && $prev_id ne $id ) {
                $exporter->add($rec);
                $rec = {};
            }

            $rec->{_id} = $id;

            if ( exists $rec->{$tag} ) {
                my $prev
                    = ref( $rec->{$tag} ) eq 'ARRAY'
                    ? $rec->{$tag}
                    : [ $rec->{$tag} ];
                $rec->{$tag} = [ @$prev, $value ];
            }
            else {
                $rec->{$tag} = $value;
            }

            $prev_id = $id;
        }
    );
    $exporter->add($rec);

    $exporter->commit;
}

sub scan_tags {
    my ( $self, $file ) = @_;

    my $tags = {};
    my $io   = Catmandu::Util::io($file);

    print STDERR "Scanning:\n" if $self->verbose;
    my $n = 0;
    while ( my $line = $io->getline ) {
        $n++;
        chop($line);

        print STDERR "..$n\n" if ( $self->verbose && $n % 1000 == 0 );

        my $brk = $self->from_breaker($line);
        my $tag = $brk->{tag};
        $tags->{$tag} = 1;

        last if ( $self->maxscan > 0 && $n > $self->maxscan );
    }

    $io->close;

    return join( ",", sort keys %$tags );
}

1;

__END__

=pod

=head1 NAME

Catmandu::Breaker - Package that exports data in a Breaker format

=head1 SYNOPSIS

  # From the command line

  # Using the default breaker
  $ catmandu convert JSON to Breaker < data.json

  # Break a OAI-PMH harvest
  $ catmandu convert OAI --url http://biblio.ugent.be/oai to Breaker

  # Using a MARC breaker
  $ catmandu convert MARC to Breaker --handler marc < data.mrc

  # Using an XML breaker plus create a list of unique record fields
  $ catmandu convert XML --path book to Breaker --handler xml --fields data.fields < t/book.xml > data.breaker

  # Find the usage statistics of fields in the XML file above
  $ catmandu breaker data.breaker

  # Use the list of unique fields in the report
  $ catmandu breaker --fields data.fields data.breaker

  # verbose output
  $ catmandu breaker -v data.breaker

  # The breaker commands needs to know the unique fields in the dataset to build statistics.
  # By default it will scan the whole file for fields. This can be a very
  # time consuming process. With --maxscan one can limit the number of lines
  # in the breaker file that can be scanned for unique fields
  $ catmandu breaker -v --maxscan 1000000 data.breaker

  # Alternatively the fields option can be used to specify the unique fields
  $ catmandu breaker -v --fields 245a,022a data.breaker

  $ cat data.breaker | cut -f 2 | sort -u > data.fields
  $ catmandu breaker -v --fields data.fields data.breaker

  # Export statistics as CSV. See L<Catmandu::Exporter::Stat> for supported formats.
  $ catmandu breaker --as CSV data.breaker


=head1 DESCRIPTION

Inspired by the article "Metadata Analysis at the Command-Line" by Mark Phillips in
L<http://journal.code4lib.org/articles/7818> this exporter breaks metadata records
into the Breaker format which can be analyzed further by command line tools.

=head1 BREAKER FORMAT

When breaking a input using 'catmandu convert {format} to Breaker' each metadata
fields gets transformed into a 'breaker' format:

   <record-identifier><tab><metadata-field><tab><metadata-value><tab><metadatavalue>...

For the default JSON breaker the input format is broken down into JSON-like Paths. E.g.
when give this YAML input:

    ---
    name: John
    colors:
       - black
       - yellow
       - red
    institution:
       name: Acme
       years:
          - 1949
          - 1950
          - 1951
          - 1952

the breaker command 'catmandu convert YAML to Breaker < file.yml' will generate:

    1 colors[]  black
    1 colors[]  yellow
    1 colors[]  red
    1 institution.name  Acme
    1 institution.years[] 1949
    1 institution.years[] 1950
    1 institution.years[] 1951
    1 institution.years[] 1952
    1 name  John

The first column is a counter for each record (or the content of the _id field when present).
The second column provides a JSON path to the data (with the array-paths translated to []).
The third column is the field value.

One can use this output in combination with Unix tools like C<grep>, C<sort>, C<cut>, etc to
inspect the breaker output:

    $ catmandu convert YAML to Breaker < file.yml | grep 'institution.years'

Some input formats, like MARC, the JSON-path format doesn't provide much information
which fields are present in the MARC because field names are part of the data. It is
then possible to use a special C<handler> to create a more verbose breaker
output.

For instance, without a special handler:

    $ catmandu convert MARC to Breaker < t/camel.usmarc
    fol05731351   record[][]  LDR
    fol05731351   record[][]  _
    fol05731351   record[][]  00755cam  22002414a 4500
    fol05731351   record[][]  001
    fol05731351   record[][]  _
    fol05731351   record[][]  fol05731351
    fol05731351   record[][]  082
    fol05731351   record[][]  0
    fol05731351   record[][]  0
    fol05731351   record[][]  a

With the special L<marc handler|Catmandu::Exporter::Breaker::Parser::marc>:

    $ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc

    fol05731351   LDR 00755cam  22002414a 4500
    fol05731351   001 fol05731351
    fol05731351   003 IMchF
    fol05731351   005 20000613133448.0
    fol05731351   008 000107s2000    nyua          001 0 eng
    fol05731351   010a     00020737
    fol05731351   020a  0471383147 (paper/cd-rom : alk. paper)
    fol05731351   040a  DLC
    fol05731351   040c  DLC
    fol05731351   040d  DLC

For the L<Catmandu::PICA> tools a L<pica handler|Catmandu::Exporter::Breaker::Parser::pica> is available.

For the L<Catmandu::MAB2> tools a L<mab handler|Catmandu::Exporter::Breaker::Parser::mab> is available.

For the L<Catmandu::XML> tools an L<xml handler|Catmandu::Exporter::Breaker::Parser::xml> is available:

    $ catmandu convert XML --path book to Breaker --handler xml < t/book.xml

=head1 BREAKER STATISTICS

Statistical information can be calculated from a breaker output using the
'catmandu breaker' command:

    $ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > data.breaker
    $ catmandu breaker data.breaker

    | name | count | zeros | zeros% | min | max | mean | median | mode   | variance | stdev | uniq%| entropy |
    |------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------|
    | 001  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 100  | 3.3/3.3 |
    | 003  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 0.0/3.3 |
    | 005  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 100  | 3.3/3.3 |
    | 008  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 100  | 3.3/3.3 |
    | 010a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 100  | 3.3/3.3 |
    | 020a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 90   | 3.3/3.3 |
    | 040a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 0.0/3.3 |
    | 040c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 0.0/3.3 |
    | 040d | 5     | 5     | 50.0   | 0   | 1   | 0.5  | 0.5    | [0, 1] | 0.25     | 0.5   | 10   | 1.0/3.3 |

The output table provides statistical information on the usage of fields in the
original format. We see that the C<001> field was counted 10 times in the data set,
but the C<040d> value is only present 5 times. The C<020a> is empty in 10% (zeros%)
of the records. The C<001> has very unique values (entropy is maximum), but all C<040c>
fields contain the same information (entropy is minimum).

See L<Catmandu::Exporter::Stat> for more information about the statistical fields
and supported output formats.

=head1 MODULES

=over

=item * L<Catmandu::Exporter::Breaker>

=item * L<Catmandu::Cmd::breaker>

=back

=head1 SEE ALSO

L<Catmandu>, L<Catmandu::MARC>, L<Catmandu::XML>, L<Catmandu::Stat>

=head1 AUTHOR

Patrick Hochstenbach, C<< <patrick.hochstenbach at ugent.be> >>

=head1 CONTRIBUTORS

Jakob Voss, C<< nichtich at cpan.org >>

Johann Rolschewski, C<< jorol at cpan.org >>

=cut
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.