Yet Another CPAN Grep

WWW-LinkRot/lib/WWW/LinkRot.pod




=encoding UTF-8

=head1 NAME

WWW::LinkRot - check web page link rot

=head1 SYNOPSIS

    use WWW::LinkRot;

=head1 VERSION

This documents version 0.02 of WWW-LinkRot
corresponding to L<git commit e07a0ffb766775fc053e9820edf1f874ee40b78c|https://github.com/benkasminbullock/www-linkrot/commit/e07a0ffb766775fc053e9820edf1f874ee40b78c> released on Fri Apr 23 08:30:11 2021 +0900.

=head1 DESCRIPTION

Scan HTML files for links, try to access the links, and make a report.

The HTML files need to be in UTF-8 encoding.

This module is intended for people who run web sites to run, for
example, periodic checks over a large number of HTML files to find all
of the external links in those files, then given that list of links,
test each link to make sure that it is actually valid.

The reading function is L</get_links> which works on a list containing
file names such as might be created by a module like L<Trav::Dir> or
L<File::Find>. It looks for any C<https?://> links in the files and
makes a list.

The list of links may then be checked for validity using
L</check_links> which runs the C<get> method of L</LWP::UserAgent> on
them and stores the status. This outputs a JSON file containing the
link, the status, the location, and the files which contain the link.

The function L</html_report> generates an HTML representation of the
JSON file.

The function L</replace> is a batch editing function which inputs a
list of links and a list of files, then substitutes the redirected
links (the ones with status 301 or 302) with their replacement.

=head1 FUNCTIONS

=head2 check_links

    check_links ($links);

Check the links returned by L</get_links> and write to a JSON file
specified by the C<out> option.

    check_links ($links, out => "link-statuses.json");

Usually one would filter the links returned by L</get_links> to remove
things like internal links.

=head3 Options

=over

=item nook

If this is set to a true value, before running the link checks,
check_links reads in a previous copy of the file specified by the
C<out> option, and if the status is C<200> for that link, it doesn't
try to access again but assumes it is still OK.

This option is useful for the case when one has recently run the job
and then done work on fixing the dead links or moved links, then wants
to check whether the errors were fixed, without checking that all of
the pages are still OK.

=item out

Specify the file to write. Without this specified it will fail.

=item verbose

Print messages about what is to be done. Since checking the links
might take a long time, this is sometimes reassuring.

=back

=head3 The user agent

The user agent used by WWW::LinkRot is L</LWP::UserAgent> with the
C<timeout> option set to 5 seconds and the number of redirects set to
zero. If a timeout is not used, C<check_links> may take a very long
time to run. However, some links, like archive.org links may take more
than five seconds to respond.

The user agent set to the browser is C<WWW::LinkRot>.

=head2 get_links

    my $links = get_links (\@files);

Given a list of HTML files in C<@files>, extract all the links from
it. The return value C<$links> contains a hash reference whose keys
are the links and whose values are array references containing a list
of all the files of C<@files> which contain the link.

This looks for anything of the form C<href="*"> in the files and adds
what is between the quotes to the list of links.

=head2 html_report

    html_report (in => 'link-statuses.json', out => 'report.html');

Write an HTML report using the JSON output by L</get_links>. The
report consists of header HTML generated by L</HTML::Make::Page>
followed by a table consisting of rows with links in each row,
followed by its status, followed by the pages where it is used.

=head3 Options

=over

=item in

The input JSON file

=item nofiles

If set to a true value, don't add the final "files" column. For
example this may be used if only checking a single file for dead
links.

=item out

The output HTML file.

=item strip

Part of the file name which needs to be stripped from the file names
to make a URL, like "/home/users/jason/website".

=item url

Part of the URL which needs to be added to the file names to make a
URL, like "https://www.example.com/site";

=back

=head3 The output HTML file

Moved links are coloured pink, and dead links are coloured yellow.

Links are cut down to a maximum length of 100 characters.

=head2 replace

    replace (\%links, \@files, %options);

Make a regex of links with C<30*> (redirect) statuses like 301 and
302, and which also have a valid C<location>, then go through
C<@files> and replace the links with the new locations.

Options are

=over

=item verbose

Print messages about the links and the files being edited.

=back

=head1 DEPENDENCIES

=over

=item L<Convert::Moji>

This is used to make the regex used by L</replace>.

=item L<File::Slurper>

This is used for reading and writing files.

=item L<HTML::Make>

This is used to make the HTML report about the links.

=item L<HTML::Make::Page>

This is used to make the HTML report about the links.

=item L<JSON::Create>

This is used to make the report file about the links.

=item L<JSON::Parse>

This is used to read back the JSON report.

=item L<LWP::UserAgent>

This is used to check the links.

=back

=head1 SEE ALSO

=head2 CPAN

=over

=item L<HTML::LinkExtor>

=item L<HTTP::SimpleLinkChecker>

=item L<WebFetch>

=item L<W3C::LinkChecker>

=item L<WWW::LinkChecker::Internal>

=back

=head2 Other

=over

=item L<Xenu's link sleuth|https://en.wikipedia.org/wiki/Xenu%27s_Link_Sleuth>

We used this more than ten years ago, it seemed to work very well. It
hasn't been updated in ten years though.

=item L<W3C Link Checker|https://validator.w3.org/checklink>

A web site which checks the links on your website.

=back



=head1 AUTHOR

Ben Bullock, <bkb@cpan.org>

=head1 COPYRIGHT & LICENCE

This package and associated files are copyright (C) 
2021
Ben Bullock.

You can use, copy, modify and redistribute this package and associated
files under the Perl Artistic Licence or the GNU General Public
Licence.
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.