Yet Another CPAN Grep

Data-Validate-CSV/lib/Data/Validate/CSV.pm

use v5.12;
use strict;
use warnings;

package Data::Validate::CSV;

our $AUTHORITY = 'cpan:TOBYINK';
our $VERSION   = '0.003';

use Data::Validate::CSV::Cell;
use Data::Validate::CSV::Column;
use Data::Validate::CSV::MultiValueCell;
use Data::Validate::CSV::Note;
use Data::Validate::CSV::Row;
use Data::Validate::CSV::Schema;
use Data::Validate::CSV::SingleValueCell;
use Data::Validate::CSV::Table;
use Data::Validate::CSV::Types;

1;

__END__

=pod

=encoding utf-8

=head1 NAME

Data::Validate::CSV - read and validate CSV

=head1 SYNOPSIS

CSV Schema (JSON):

  {
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "name": "country",
        "datatype": { "base": "string", "length": 2 }
      },{
        "name": "country group",
        "datatype": "string"
      },{
        "name": "name (en)",
        "datatype": "string"
      },{
        "name": "name (fr)",
        "datatype": "string"
      },{
        "name": "name (de)",
        "datatype": "string"
      },{
        "name": "latitude",
        "datatype": { "base": "number", "maximum": 90, "minimum": -90 }
      },{
        "name": "longitude",
        "datatype": { "base": "number", "maximum": 180, "minimum": -180 }
      }]
    }
  }

CSV Data:

  "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
  "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
  "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Perl:

  use Path::Tiny qw(path);
  use Data::Validate::CSV;
  
  my $table = Data::Validate::CSV::Table->new(
    schema     => path('countries.csv-metadata.json'),
    input      => path('countries.csv'),
    has_header => !!0,
  );
  
  while (my $row = $table->get_row) {
    for my $e (@{$row->errors}) {
      warn $e;
    }
    printf(
      "%s is at latitude %f, longitude %f.\n",
      $row->get("name (en)")->value,
      $row->get("latitude")->value,
      $row->get("longitude")->value,
    );
  }

=head1 DESCRIPTION

There's not really a lot of documentation right now.

Mostly there's three interfaces you need to know about: tables, rows,
and cells. (There are also columns, schemas, and notes, but for most
day-to-day usage, those can be considered internal implementation
details.)

=head2 Table interface

The table is constructed with the following attributes:

=over

=item C<< schema >>

A schema for the table. Can be a hashref, a JSON string, a scalar ref to
a JSON string, or a L<Path::Tiny> path to a file containing the schema.

=item C<< input >>

The CSV data for the table. Can be a filehandle, a scalar ref to a string
of data, or a L<Path::Tiny> path to a file.

=item C<< has_header >>

A boolean indicating whether the CSV contains a header row. This will be
used to supply any column names missing from the schema, and will be 
skipped from being returned by C<get_row>.

=item C<< reader >>

A coderef which, if given a filehandle, will return a parsed line of CSV.
The default is basically something like:

  sub { Text::CSV_XS->new->getline($_[0]) }

That's probably sufficient for most cases, but you may need to supply your
own reader for handling tab-delimited files.

=item C<< skip_rows >>

An integer, number of additional rows to skip I<before> the header.
Some CSV files contain a title or credit line. Defaults to 0.

=item C<< skip_rows_after_header >>

An integer, number of additional rows to skip I<after> the header.
Defaults to 0.

=back

The table provides the following methods:

=over

=item C<< get_row >>

Returns a row object for the next row of the table.

=item C<< all_rows >>

Gets all the rows as a list.

=item C<< row_count >>

The number of non-skipped, non-header lines read so far.

=back

=head2 Row interface

The rows returned by C<get_row> and C<all_rows> are blessed objects.
They provide the following methods:

=over

=item C<< raw_values >>

The values returned by L<Text::CSV_XS> without any further processing.

=item C<< values >>

The values returned by L<Text::CSV_XS>, processed by datatype. Date and
time datatypes will be reformatted from any CLDR-based format to ISO 8601.
Booleans using non-standard representations will be changed to "1" and "0".
Fields that have a separator defined will be split into an arrayref.
Numbers given as percentages will be divided by 100. And so forth.

=item C<< cells >>

Returns the same values as C<values> but wrapped in cell objects. The
following are equivalent:

  $row->values->[0];
  $row->cells->[0]->value;
  $row->[0];  # $row overloads @{}

Why fetch a cell instead of directly fetching the value? The cell object
offers a few other useful methods.

=item C<< get($name) >>

Gets a single cell from the row by its name. Names are defined in the
schema, or the header row if missing from the schema.

  $row->get("country")->value;

=item C<< row_number >>

The row number for this row in the table. Rows are numbered starting at
1. Headers and skipped rows are not counted.

=item C<< key_string >>

For tables that has a primary key, this returns a string formed by joining
together the primary key columns. It ought to be a unique identifier for this
row within the table, and if it is not, this will be raised as an error.

=item C<< errors >>

An arrayref of strings of errors associated with this row. This includes
data validation problems.

=back

=head2 Cell interface

It is possible to bypass using the cell interface and access cell values
directly from the rows, but if accessing cells, these are the methods they
provide:

=over

=item C<< raw_value >>

The value returned by L<Text::CSV_XS> without any further processing.

=item C<< value >>

The value returned by L<Text::CSV_XS>, processed by datatype.

=item C<< inflated_value >>

Like C<value> but inflates some values to blessed objects. Date and time
related datatypes will be returned as L<DateTime>, L<DateTime::Incomplete>,
or L<DateTime::Duration> objects. Booleans will be returned as
L<JSON::PP::Boolean> objects.

=item C<< row_number >>

The row number for the cell's parent row in the table. Rows are numbered
starting at 1. Headers and skipped rows are not counted.

=item C<< col_number >>

The column number of this cell within the parent row. Columns are numbered
starting at 1.

=item C<< datatype >>

The datatype for this cell as a hashref.

=back

=head1 BUGS

Please report any bugs to
L<http://rt.cpan.org/Dist/Display.html?Queue=Data-Validate-CSV>.

=head1 SEE ALSO

L<https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/>.

=head1 AUTHOR

Toby Inkster E<lt>tobyink@cpan.orgE<gt>.

=head1 COPYRIGHT AND LICENCE

This software is copyright (c) 2019 by Toby Inkster.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=head1 DISCLAIMER OF WARRANTIES

THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.