Yet Another CPAN Grep

Lingua-Orthon/lib/Lingua/Orthon.pm

package Lingua::Orthon;
use 5.006;
use strict;
use warnings FATAL => 'all';
use Carp qw(croak);
use List::AllUtils qw(any);
use Number::Misc qw(is_numeric);
use Statistics::Lite qw(mean);
use String::Util qw(hascontent nocontent);
use Unicode::Collate;

$Lingua::Orthon::VERSION = '0.03';

=pod

=encoding CP-1252

=head1 NAME

Lingua-Orthon - Orthographic similarity of string to one or more others by Coltheart's N and related measures

=head1 VERSION

This is documentation for B<Version 0.03> of Lingua::Orthon.

=head1 SYNOPSIS

 use Lingua::Orthon 0.03;
 my $orthon = Lingua::Orthon->new();
 my $bool = $orthon->are_orthons('BANG', 'BARN'); # 0
 $bool = $orthon->are_orthons('BANG', 'BONG'); # 1
 my $idx = $orthon->index_diff('BANK', 'BARK'); # 2
 my $count = $orthon->index_identical('BANG', 'BARN'); # 2
 my (@diff) = $orthon->char_diff('BANG', 'BONG'); # (qw/A O/)
 $count = $orthon->onc(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # 4
 my $aref = $orthon->list_orthons(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # BAND, BING, RANG, BONG
 $count = $orthon->levenshtein('BANG', 'BARN'); # 2
 my $float = $orthon->old(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # ~= 1.67

=head1 DESCRIPTION

Lingua-Orthon provides measures of similarity of character strings based on their orthographic identity, as relevant to psycholinguistic research. Case- and mark-sensitivity for determining character equality can be controlled. Wraps to Levenshtein Distance methods, extended to the OLD-20 metric, are provided for convenience of comparison. No methods are explicitly exported; all methods are called in the object-oriented way.

=head1 SUBROUTINES/METHODS

=head2 new

 my $ortho = Lingua::Orthon->new();

Constructs/returns class object for accessing other methods.

Optionally, set the argument B<match_level> to an integer value ranging from 0 to 3 to control case- and mark-sensitivity. See L<set_eq|Lingua::Orthon/set_eq>.

=cut

sub new {
    my ( $class, %args ) = @_;
    my $self = {};
    bless $self, $class;
    $self->set_eq( match_level => $args{'match_level'} );
    return $self;
}

=head2 are_orthons

 $bool = $orthon->are_orthons('String1', 'String2');

Returns 0 or 1 (Coltheart's Boolean) if two given strings are orthographic neighbours by a 1-mismatch I<substitution>: i.e., the strings are of the same size (are equal in character count) and there is only one discrepancy between them by a single substitution of a character in the same ordinal position (no additions, deletions or transpositions). So I<BANG> and I<BAND> are orthons by this measure (they differ only in the final letter), but I<BANG> and I<BRANG> are not (the letter I<R> is an I<addition> to I<BANG> via I<BRANG>, or a I<deletion> from I<BRANG> to I<BANG>).

I<Identical strings>: If two identical letter strings are given (I<BANG>, I<BANG>), they are defined as I<not> being orthons: the number of index identical characters must be at least one less than the length of the string(s).

I<Case-sensitivity>: By default, identity is defined case-insensitively; e.g., I<Bang> and I<bang>, and I<BaNG> and I<bAnd> are orthons. However, if B<match_level> has been set (in L<new|Lingua::Orthon/new> or L<set_eq|Lingua::Orthon/set_eq>) to a higher level than 1 (or as undef or 0), then case is respected; e.g., I<Bang> and I<bang> are orthons, but I<Bang> and I<bing> are NOT orthons (they involve substituting both the I<B>s and the second letters (I<a> and I<i>) ... but I<BaNG> and I<BiNG>, or I<BaNG> and I<BING>, are orthons. (This usefully applies to putting L<Coltheart's I<N>|Lingua::Orthon/onc, coltheart_n> (the sum of single-substitution orthons a string has within a lexicon) to questions of the featural versus lexical basis of neighbourhood effects).

See Coltheart et al. (1977) (in L<REFERENCES|Lingua::Orthon/REFERENCES>). The measure is computationally simple and economical, relative to other measures, such as based on a wider array of edit-types (e.g., Levenshtein Distance), that, while having greater explanatory power (Yarkoni et al., 2008), can tax resources on the order of days to effectively compute for a single string relative to a humanly memorable corpus.

=cut

sub are_orthons {
    my ( $self, $w1, $w2 ) = @_;
    return _are_orthons( $w1, $w2, $self->{'_EQ'} );
}

=head2 index_identical

 $count = $orthon->index_identical('String1', 'String2');

Returns a count: the number of letters that are identical and in the same serial position among two given letter-strings.

For example, given I<BANG> and I<BARN>, 2 is returned for the two index-identical letters, I<B> and I<A>; I<N> is in both strings, but it is ignored as it is the third letter in I<BANG> but the fourth letter in I<BARN>, and so not in the same serial position across the two words.

=cut

sub index_identical {
    my ( $self, $w1, $w2 ) = @_;
    return _index_identical( $w1, $w2, $self->{'_EQ'} );
}

=head2 index_diff

 $posint = $orthon->index_diff('String1', 'String2');

Assuming the two strings are single-substitution orthons, returns the single index (anchored at zero) at which their letters differ.  So if the two strings are "bring" and "being", the returned value is 1.

=cut

sub index_diff {
    my ( $self, $w1, $w2 ) = @_;
    my $idx = 0;
    for my $i ( 0 .. _smallest_len( $w1, $w2 ) - 1 ) {
        if ( not $self->{'_EQ'}->( map { substr $_, $i, 1 } ( $w1, $w2 ) ) ) {
            $idx = $i;
            last;
        }
    }
    return $idx;
}

=head2 char_diff

 @ari = $orthon->char_diff('String1', 'String2');

Returns a list of the first two characters (letters) that, reading from left to right, differ between two given strings. If the strings are single-substitution orthons, these are the characters that make them so. So if the two strings are "bring" and "being", the returned list is ('r', 'e') - the order of these characters in the returned list respecting the order of the given strings. The search across the strings terminates as soon there is a mismatch; otherwise, it continues only for as long as the length of the shortest string.

The identity match (or mismatch) is sensitive to the setting of the equality function per case and marks; see L<set_eq|Lingua::Orthon/set_eq>.

=cut

sub char_diff {
    my ( $self, $w1, $w2 ) = @_;
    my @ds = ();
    for my $i ( 0 .. _smallest_len( $w1, $w2 ) - 1 ) {
        my @tmp = map { substr $_, $i, 1 } ( $w1, $w2 );
        if ( not $self->{'_EQ'}->(@tmp) ) {
            @ds = @tmp;
            last;
        }
    }
    return @ds;
}

=head2 onc, coltheart_n

 $int = $orthon->onc(test => CHARSTR, sample => AREF); 

Returns the I<orthographic neighbourhood count> (ONC), a.k.a. Coltheart's I<N>: the number of single-letter substitution orthons a particular string has with respect to a list of strings (or "lexicon") (Coltheart et al., 1977). So I<bat> has two orthons (I<bad> and I<cat>) in the list (I<bad>, I<bed>, I<cat> and I<day>).

=cut

sub onc {
    my ( $self, %args ) = @_;
    my $test_str =
      hascontent( $args{'test'} )
      ? $args{'test'}
      : croak 'Need a single character string to test for orthons';
    my $sample_aref =
      ref $args{'sample'}
      ? $args{'sample'}
      : croak
      'Need a list (aref) of character-strings to sample for orthon listing';
    my $count = 0;
    for ( 0 .. ( scalar @{$sample_aref} - 1 ) ) {
        if ( _are_orthons( $test_str, $sample_aref->[$_], $self->{'_EQ'} ) ) {
            $count++;
        }
    }
    return $count;
}
*coltheart_n = \&index_indentical;

=head2 list_orthons

 $aref = $orthon->list_orthons(test => CHARSTR, sample => AREF);
 
Returns a reference to an array of single-substitution orthographic neighbours of a given B<test> character-string that are among a given list of B<sample> character-strings. The referenced is to an empty array if no orthons are found. The order of items in the returned array follows that in which they appear in the B<sample>.

=cut

sub list_orthons {
    my ( $self, %args ) = @_;
    my $test_str =
      hascontent( $args{'test'} )
      ? $args{'test'}
      : croak 'Need a single character string to test for orthons';
    my $sample_aref =
      ref $args{'sample'}
      ? $args{'sample'}
      : croak
      'Need a list (aref) of character-strings to sample for orthon listing';
    my @orthon_list = ();
    for ( 0 .. ( scalar @{$sample_aref} - 1 ) ) {
        if ( _are_orthons( $test_str, $sample_aref->[$_], $self->{'_EQ'} ) ) {
            push @orthon_list, $sample_aref->[$_];
        }
    }
    return \@orthon_list;
}

=head2 ldist, levenshtein

 $count = $orthon->ldist('String1', 'String2'); # minimal, strings will be lower-cased

Returns the Levenshtein Distance between two given letter strings, wrapping to various Perl module's that more or less implement the Levenshtein algorithm for efficiency and case-sensitivity. Specifically, if the match level has been set at 1 (to ignore case and diacritics), the method uses L<Text::Levenshtein::distance|Text::Levenshtein/distance> (which offers "ignoring diacritics"); otherwise, it uses L<Text::Levenshtein::XS::distance|Text::Levenshtein::XS/distance> to ignore case but not marks (given present limitations of this module). The required case- and mark-sensitivity are set in the L<new|Lingua::Orthon/new> or L<set_eq|Lingua::Orthon/set_eq> methods. By default, the match is made case- and mark-I<in>sensitively (by canned Perl L<eq|perlfunc/eq>).

=cut

sub ldist {
    my ( $self, $w1, $w2 ) = @_;
    my $ldist;
    if ( $self->{'_MATCH_LEVEL'} == 1 ) {
        require Text::Levenshtein;
        $ldist =
          Text::Levenshtein::distance( $w1, $w2, { ignore_diacritics => 1 } )
          ;    # also ignores case
    }
    else {
        require Text::Levenshtein::XS;
        if ( $self->{'_MATCH_LEVEL'} == 2 ) {
            ( $w1, $w2 ) = map { lc } ( $w1, $w2 );  # ignore case but not marks
        }
        $ldist = Text::Levenshtein::XS::distance( $w1, $w2 )
          ;    # ignores nothing on its own
    }
    return $ldist;
}

=head2 old

 $mean = $orthon->old(test => CHARSTR, sample => AREF, lim => INT);

Returns the mean orthographic Levenshtein distance (OLD) of the smallest B<lim> such edit distances for a given B<test> string to all the strings in a B<sample> list. Based on Yarkoni et al. (2008), where, with the value of B<lim> is set to 20, the measure substantially contributes to prediction of performance in word recognition tasks. Here, if B<lim> is not defined, not numeric, or greater than the size of the B<sample>, then it is set by default to the size of the sample.

Levenshtein distance is calculated per the method L<ldist|Lingua::Orthon/ldist>, wrapping to external modules with respect to the conditions of string equality set in L<new|Lingua::Orthon/new> or L<set_eq|Lingua::Orthon/set_eq>. Different settings lead to different speed of calculation. The slowest calculation (by far) occurs if B<match_level> => 1 so that case- and mark-insensitive matching occurs; this relies on the pure Perl implementation in Text::Levenshtein with its argument B<ignore_diacritics> => 1. The fastest calculation (the default) occurs by setting B<match_level> => 3, when exact characters are matched, e.g., I<B> in the test-string and I<b> in a sample-string at the same index across them are taken as unequal and so will count as a substitution. This relies on the C-implementation in Text::Levenshtein::XS. Ignore case but not marks with B<match_level> => 2.

=cut

sub old {
    my ( $self, %args ) = @_;
    my $test_str =
      hascontent( $args{'test'} )
      ? $args{'test'}
      : croak 'Need a single character string to calculate OLD';
    my $sample_aref =
      ref $args{'sample'}
      ? $args{'sample'}
      : croak 'Need a list (aref) of character-strings to calculate OLD';
    my @ldists = ();
    for ( 0 .. ( scalar @{$sample_aref} - 1 ) ) {
        push @ldists, $self->ldist( $test_str, $sample_aref->[$_] );
    }
    my $lim =
      ( is_numeric( $args{'lim'} ) and $args{'lim'} <= scalar @ldists )
      ? $args{'lim'}
      : scalar @ldists;
    return mean( ( sort { $a <=> $b } @ldists )[ 0 .. int $lim - 1 ] )
      ;    # mean of first/smallest $lim-th values
}

=head2 set_eq

 $orthon->set_eq(match_level => INT); # undef, 0, 1, 2 or 3

Sets the string-matching level used in the above methods. This is called implicitly in L<new|Lingua::Orthon/new> when given a B<match_level>, or with the default value of 0. This is adopted and slightly adapted from how L<Text::Levenshtein|Text::Levenshtein> controls for case/diacritic-sensitive matching.

=over 4

=item match_level = undef, 0

Match with respect to case and diacritics: same as B<3> but simply by Perl's eq. So, e.g., I<�clair> and I<eclair> would be taken as non-identical, just as would I<Eclair> and I<eclair>.

This is the fastest option. The higher levels, as follow, use the C<eq>() function in L<Unicode::Collate|Unicode::Collate>.

=item match_level = 1

Match ignoring case and diacritics: I<ber> to I<B�Z> involves 1 edit (from I<r> to I<Z> only)
 
=item match_level = 2

Match ignoring case but respect diacritics: "ber" to "B�Z" involves 2 edits (the "er" to "�Z") 

=item match_level = 3

Match with respect to case and diacritics: "ber" to "B�Z" involves 3 edits (of all its letters)

=back

So, for example, if the test string is "abb�", it could be picked up as having the single-substitution orthographic neighbour "able" if the match level is 1, but not if it is 0, 2 or 3.

=cut

sub set_eq {
    my ( $self, %args ) = @_;
    my $match_level_arg = $args{'match_level'};
    if ( nocontent($match_level_arg) or $match_level_arg == 0 ) {
        $self->{'_MATCH_LEVEL'} = 0;
        $self->{'_EQ'} = sub { return $_[0] eq $_[1] };
    }
    elsif ( any { $match_level_arg == $_ } ( 1 .. 3 ) ) {
        $self->{'_MATCH_LEVEL'} = $match_level_arg;
        my $collator = Unicode::Collate->new(
            normalization => undef,
            level         => $match_level_arg
        );
        $self->{'_EQ'} = sub {
            return $collator->eq(@_);
        };
    }
    else {
        croak "Invalid value '$match_level_arg' given as a match level";
    }
    return;
}

# private methods

sub _smallest_len {
    my @strs = @_;
    return ( sort { $a <=> $b } map { length } @strs )[0];
}

sub _are_orthons {
    my ( $w1, $w2, $eq_fn ) = @_;
    return 0 if length $w1 != length $w2;
    return _index_identical( $w1, $w2, $eq_fn ) == ( length $w1 ) - 1;
}

sub _index_identical {
    my ( $w1, $w2, $eq_fn ) = @_;
    my $count = 0;
    for my $i ( 0 .. _smallest_len( $w1, $w2 ) - 1 ) {
        if ( $eq_fn->( map { substr $_, $i, 1 } ( $w1, $w2 ) ) ) {
            $count++;
        }
    }
    return $count;
}

=head1 DIAGNOSTICS

=over 4

=item Invalid value '...' given as a match level

Argument B<match_level> in new() or set_eq() needs to be an integer in range 0 .. 3, or undefined.

=item Need a single character string to test for orthons

Argument B<test> for calculating ONC and OLD, and listing orthons, needs to be defined and not empty.

=item Need a single character string to test for orthons

Argument B<sample> should reference an array of character-strings when calculating ONC and OLD, and listing orthons.

=back

=head1 REFERENCES

Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), I<Attention and performance> (Vol. 6, pp. 535-555). London, UK: Academic.

Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart's I<N>: A new measure of orthographic similarity. I<Psychonomic Bulletin and Review>, I<15>, 971-979. doi: L<10.3758/PBR.15.5.971|http://dx.doi.org/10.3758/PBR.15.5.971>.

=head1 DEPENDENCIES

L<List::AllUtils|List::AllUtils>

L<Number::Misc|Number::Misc>

L<Statistics::Lite|Statistics::Lite>

L<String::Util|String::Util>

L<Text::Levenshtein|Text::Levenshtein>

L<Text::Levenshtein::XS|Text::Levenshtein::XS>

L<Unicode::Collate|Unicode::Collate>

=head1 AUTHOR

Roderick Garton, C<< <rgarton at cpan.org> >>

=head1 SEE ALSO

L<String::LCSS_XS|String::LCSS_XS>

L<String::Similarity|String::Similarity>

L<Text::Abbrev|Text::Abbrev>

=head1 BUGS AND LIMITATIONS

Please report any bugs or feature requests to C<bug-Lingua-Orthon-0.03 at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Orthon-0.03>.  I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Lingua::Orthon

You can also look for information at:

=over 4

=item * RT: CPAN's request tracker (report bugs here)

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Orthon-0.03>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/Lingua-Orthon-0.03>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/Lingua-Orthon-0.03>

=item * Search CPAN

L<http://search.cpan.org/dist/Lingua-Orthon-0.03/>

=back

=head1 LICENSE AND COPYRIGHT

Copyright 2011-2018 Roderick Garton.

This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.

See L<http://dev.perl.org/licenses/> for more information.

=cut

1;    # End of Lingua::Orthon