Perl6-Pugs/docs/notes/precompilation_cache.pod
=head1 NAME
Precompilation cache in Pugs
=head1 DESCRIPTION
Rather than parse every .pl or .pm file pugs sees over and over again as
they are needed, pugs stores the results of compilation in a cache. This
gives the benefit of speed without the awkwardness of an opaque object
file.
=head1 PHASE 1 - GET IT WORKING
=head2 Objectives
In the first version of the design, we aim for simplicity, reasonable
forward compatibility, but not kitchen sink flexibility. For now, this
is a Pugs mechanism. One cache should support several versions of Pugs,
sharing objects where possible, and allowing easy maintenance. If it gets
screwed up, the admin should be able to delete the cache directory and
not suffer from anything but subsequent temporary cold cache slowdowns.
=head2 What goes into the cache
Objects stored in the cache are compilation units after parsing. Perl
6 has separate compilation, and presumably every compilation unit has
one canonical abstract representation per version of Pugs. Where this is
not the case, e.g. if C<BEGIN> blocks intends to change the compilation
outcome of this unit, the author of the code should mark the unit as not
cachable. XXX: notation for this. XXX: Also need to add deps for used
modules that export macros. Especially macros in the prelude. Including
the dependencies hashes in the hash of the source code should suffice.
We now describe how cache objects are keyed in the cache, and the bytecode
format of a cached object.
=head2 Keying scheme
Example:
~joe/.pugscache/1/14/148071fa07847bc0de8df7d75cd03072f27239c2/9600-9658
# < $HOME .pugscache $H1 $H2 $SHA1 $ReleaseRev-$ParserRev >.catdir
The fast path for cache usage is successful lookup of a valid precompiled
unit.
The cache is (in the first insance) a filesystem directory under the
user's C<$HOME>. A global cache would have been nice for disk and CPU
performance on multiuser machines, but since original source code is
easily deducible from the compiled version, this is a security issue we
prefer to avoid. If we have pluggable backends we could potentially just
allow a C<memcached> backend as well.
Every compilation unit entering the cache is stored in a directoy of
its own. The entry location is a function of (a hash of) the source
code. Inside that directory there is a file named with Pugs version, and
Pugs parser version. This prevents us from loading a precompiled object
for a unit that has changed, or for a parser that would have emitted
a different structure than the one present in the cache. Using the
hash as a directory name, without the parser version/etc lets us check
in one stat if we have a cached entry or not, and then readdir for the
compatible version info, instead of always having to readdir the entire
cache level. (To keep file count in a given directory managable, the
file may be hashed into a directory inside the cache. Optimal hashing
depth may vary among systems and should be a user setting. On "modern"
systems no extra dirs might actually be fastest.)
For now the current keying scheme implies that the source file needs
to be read from storage even if a precompiled version of it exists in
the cache, because we require its hash. There may be ways to optimize
that part away, but they can be added in the future. I<Not> having the
compilation unit name as part of the hash key is nice, because it means
we don't have to worry about name canonicalization or source files with
similar names but different locations. It also means we can cache objects
that aren't files at all, such as C<eval $sting> results (as long as
C<$string> is the same across runs), or units received over the network
(as long as we can get their hashes before we attempt to compile them).
=head2 Bytecode format
The cached object is a compressed YAML document containing a serialized
Haskell Pugs structure.
=head3 Compression
We use gzip currently as the de-facto compression mechanism. This is the
easiest to deploy with Pugs: GHC bundles zlib, and Data.FastPackedString
which we use anyway for file IO contains enough bindings to read gzipped
files. Compression is desireable because serialized YAML for precompiled
units are large: for example, the 22 KB C<Prelude.pm> takes on the order
of 1.2 MB to serialize, but 47 KB in gzipped form. There are better
compression algorithms out there; for example, BZip2 compresses the
same file to only 20 KB. If you wish to work on a patch for changing the
compression scheme, it should not lose portability or deployability to
the current setup (i.e., must bundle or implement whatever scheme you
desire), and provide a transparent readFile function that identifies
the actual object by magic number.
=head3 Precompiled data
C<pugs -CParse-YAML File.pm> or C<Pugs::Internals::emit_yaml> output a
serialized form of the following structure. C<Pugs::Internals::eval_p6c>
or the module loader load it:
data CompUnit = MkCompUnit
{ ver :: Int -- currently 1
, desc :: String -- e.g., the name of the contained module
, glob :: (TVar Pad) -- pad for unit Env
, ast :: Exp -- AST of unit
}
The C<ver> field is currenly set to 1. It is always the first element in
the CompUnit structure, for forward compatibility. The C<desc> field is
for diagnostics. C<glob> is the global pad for this unit, and finally,
C<ast> is the parsed tree.
=head2 Module load
try {
my $hash = $HASHFUNC($source);
my $dir = cachedir($hash);
die "no precompiled version found" unless -d $dir;
for $dir.readdir.sort:{numerically} -> $fn {
my ($pugsrev, $parserrev) = $fn ~~ /(\d+)-(\d+)/ err next;
next if $XXX_handwaving($pugsrev, $parserrev); # against %?CONFIG<pugsrev> etc.
load_precompiled($fn) err {
$fn.rm;
die "error loading cached version: $!";
};
return; # success
}
die "no precompiled version found";
CATCH {
my $compunit = $source.parse;
$compunit.load;
return if $compunit.nocache;
cache_cleanups;
write_cache($compunit, $dir);
}
}
=head2 Maintenance
There are a few cache maint tools that can be added bit by bit. They
probably deserve a section of their own but for now:
=over 4
=item
Delete objects for Pugses older than X
=item
Clean least recently accessed objects (should happen as a matter of
course during regular cache B<write> usage). Proposal: use Pseudo-LRU
L<http://en.wikipedia.org/wiki/Pseudo-LRU>
=item
Warm up cache for an entire set of modules
=item
Change cache size limit
=back