Yet Another CPAN Grep

MarpaX-ESLIF/lib/MarpaX/ESLIF/Introduction.pod

# PODNAME: MarpaX::ESLIF::Introduction

# ABSTRACT: MarpaX::ESLIF's Introduction

__END__

=pod

=encoding UTF-8

=head1 NAME

MarpaX::ESLIF::Introduction - MarpaX::ESLIF's Introduction

=head1 VERSION

version 6.0.35.1

=head1 DESCRIPTION

  /*
   * Example of a calulator with ESLIF BNF:
   *
   * Automatic discard of whitespaces
   * Correct association for expressions
   * Embedded action using anonymous lua functions
   *
  */
  :discard ::= /[\s]+/

  exp ::=
      /[\d]+/
      |    "("  exp ")"    assoc => group action => ::copy[1]
     || exp (- '**' -) exp assoc => right action => ::luac->function(x,y) return x^y end
     || exp (-  '*' -) exp                action => ::luac->function(x,y) return x*y end
      | exp (-  '/' -) exp                action => ::luac->function(x,y) return x/y end
     || exp (-  '+' -) exp                action => ::luac->function(x,y) return x+y end
      | exp (-  '-' -) exp                action => ::luac->function(x,y) return x-y end

MarpaX::ESLIF is a Scanless Interface expressed in a BNF format.
That is, it uses L<marpaWrapper|https://github.com/jddurand/c-marpaWrapper>, which itself is a thin interface on top of the
L<libmarpa|https://jeffreykegler.github.io/Marpa-web-site/libmarpa.html> parser.

The L<MarpaX::ESLIF::BNF> is inspired by L<Marpa::R2's DSL|https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod>, though with some incompatible changes and add-ons:

=over

=item Native regular expression support, including regex callbacks into user-space and substitution

=item Syntactic exception

=item Unlimited number of sub-grammars

=item Streaming compatible architecture

=item Zero-length symbols

=item Embedded lua language into the grammar

=item Bindings to java, lua and perl languages

=item JSON (strict and extended) grammars for decoding/encoding

=item Declarative lua action directly into the grammar

=item Parameterized rules

=item Dynamic sub-grammars

=item Lookahead

=back

The following sections present the architecture of MarpaX::ESLIF architecture,
its features and things generally helpful to know.

=head1 EXAMPLE

Please look at L<MarpaX::ESLIF::Tutorial::Calculator>.

=head1 ARCHITECTURE

=head2 Grammars

The ESLIF is nothing else but a I<sparse array> of grammars, identified by an index called I<level> and starting with value 0, or a description:

  [------------------------------------------------------------------]
  | Indice | Level 0 | N/A | Level 2 | Level 3 | N/A | Level 5 | ... |
  | Name   | nameof0 | N/A | nameof2 | nameof3 | N/A | nameof5 | ... |
  [------------------------------------------------------------------]

There B<must> be a grammar at level index 0. Then any grammar can access any symbol of any other grammar:

  [-------------------------------------------------------------------]
  | Indice: | Level 0 | N/A | Level 2 | Level 3 | N/A | Level 5 | ... |
  | Name  : | nameof0 | N/A | nameof2 | nameof3 | N/A | nameof5 | ... |
  [-------------------------------------------------------------------]
  | Symbol: | +>X             +>X       +>Xx                          |
  | Symbol: | |  Y            |  Y      |  Yy                         |
  | Symbol: | |  |            |  |      |  |             +>Zzz        |
  | |  |            |  |      |  |             |   |        |         |
  | |  |____________|  |______|  |_____________|   |        |         |
  | |______________________________________________|        |         |
  [-------------------------------------------------------------------]

Let S[i] be our notation for I<the symbol S of grammar level i>.
Then the schema above says that Y[0] is a reference to X[2], that Y[2] is a reference to Xx[3], that Yy[3] is a reference to Zzz[5], and that Zzz[5] is a reference to Y[0]. Any symbol of any grammar that is accessed I<via a reference> is considered as part of a lexing phase, and the user will have no control until this phase is over, this symbol being recognized or not.

This is why it is strongly encouraged that there be a grammar at level 0 exist.
The author believes that it is common practice for the I<top level> grammar
to be at level 0. However is not enforced --
it is possible to specify a grammar
other than the one at level 0 as the starting grammar.

=head2 Recognizer

=head3 Top recognizer

Parsing from the point of an ESLIF user, consists of, for a given I<location> in the top-level grammar,

=over

=item assigning

a set of symbols (these are called I<alternatives>)

=item commiting

them (we say that we I<complete> the set of alternatives), and

=item moving

on to the next location.

=back

We call this interface "scanless" because you do not have to write your own lexer.
The grammars define what is expected, and recognizer determines all the alternatives for you,
commits them, and moves on in the input stream.
We say input stream: this is another dimension (we suppose from now on that the top-level grammar is at level 0):

  #
  # Note
  #
  #    ::= is an alias for grammar level 0
  #      ~ is an alias for grammar level 1
  # :[n]:= is the generic form for grammar level n
  #
          [---------------------------------------------------] STREAM MANAGEMENT
          | Rule is X ::= x y                                 |
          [---------------------------------------------------] STEP 0
          | Location is start of rule X[0]:                   |
          | X ::= . x y                                       |
          | Suppose that expected "terminals" are T1 and T2:  |
          [---------------------------------------------------] STEP 1
          | Try to match T1                                   |
          |   Nothing yet in the stream ?                     |<-----> Stream reader callback
          |   T1 may match but we are not sure                |<-----> Stream reader callback
          |   Repeat until T1 matches for sure or not         |
          [---------------------------------------------------] STEP 2
          | Try to match T2                                   |
          |   T2 may match but we are not sure                |<-----> Stream reader callback
          |   Repeat until T2 matches for sure or not         |
          [---------------------------------------------------]
          | No match ? End of scanning                        | STEP 3
          | Match ? Commit T1 and T2 and continue             |
          [---------------------------------------------------]

=head3 Sub-recognizers

The stream management mentioned above applies to all grammars:
As soon as "terminal" is in reality a referenced symbol,
a sub-recognizer is instantiated and it shares the stream with its parent:

                    TOP RECOGNIZER ON GRAMMAR LEVEL 0

          [---------------------------------------------------]           STREAM MANAGEMENT
          | Rule is X ::= x y                                 |
          [---------------------------------------------------] STEP 0.0
          | Location is start of rule X[0]:                   |
          | X ::= . x y                                       |
          | Suppose that expected "terminals" are T1 and T2:  |
          [---------------------------------------------------] STEP 0.1
          | Try to match T1                                   |
          |   Nothing yet in the stream ?                     |<-----> Stream reader callback
          |   T1 may match but we are not sure                |<-----> Stream reader callback
          |   Repeat until T1 matches for sure or not         |
          [---------------------------------------------------] STEP 0.2
          | Try to match T2                                   |
          |   T2 is a referenced symbol in grammar n          |
          [---------------------------------------------------]

                    SUB-RECOGNIZER ON GRAMMAR LEVEL n

            [-------------------------------------------------]
            | Rule is T2 :[n]:= a b                           |
            [-------------------------------------------------] STEP 1.0
            | Location is start of rule T2[n]:                |
            | T2 :[n]:= . a b                                 |
            | Suppose that expected "terminals" are U1 and U2:|
            [-------------------------------------------------] STEP 1.1
            | Try to match U1                                 |
            |   Nothing yet in the stream ?                   |<-----> Stream reader callback
            |   U1 may match but we are not sure              |<-----> Stream reader callback
            |   Repeat until U1 matches for sure or not       |
            [-------------------------------------------------] STEP 1.2
            | Try to match U2                                 |
            |   U2 may match but we are not sure              |<-----> Stream reader callback
            |   Repeat until U2 matches for sure or not       |
            [-------------------------------------------------]
            | No match ? End of scanning for T2[n]            | STEP 1.3
            | Match ? Commit U1 and/or U2 and continue        |
            [-------------------------------------------------]
            | Do internal valuation                           | STEP 1.4
            [-------------------------------------------------]

                 BACK TO TOP RECOGNIZER ON GRAMMAR LEVEL 0

          [---------------------------------------------------]
          | No match ? End of scanning                        | STEP 0.3
          | Match ? Commit T1 and/or T2 and continue          |
          | If T2 matches it is a parse tree value            |
          [---------------------------------------------------]

This is recursive: there will as many sub-recognizers instantiated as there are sub-grammars involved.
For instance, if terminal C<U2> above is a referenced symbol at grammar level C<l>,
a second sub-recognizer will be instantiated by the first sub-recognizer.
Every child recognizer shares all information about stream management.
The main difference between the top recognizer
and any child recognizer is that a child recognizer
is always doing an internal valuation to retreive the span in the input stream for,
and give that back to its parent.

The internal valuation is a forced mode that concatenates all matched bytes in the input stream.

=head3 Discard and sub-recognizers

You might ask, "Why explicitly do an internal valuation?"
After all, the match is where sub-recognizer started and where it ended.
But an explicit internal valuation is needed because any grammar can have it own I<discard> mechanism.
This mean that what a sub-recognizer matched may be shorter
than the number of bytes effectively consumed from the input stream.

This raises the issue of the I<discard> mechanism.
A discard symbol is an ordinary symbol of a grammar, but with a special semantics.
In the BNF, the name of this special semantics is C<:discard>.
For example:

  :discard   ::= whitespace
  whitespace   ~ /[\s]+/

This means that the grammar at level 0 tries to match the C<whitespace> symbol
when it fails to match any of the expected terminals.

As soon as there is no match, and if C<:discard> rule exist, any recognizer is always trying to get a match on it using a sub-recognizer, exactly like when it is executing a sub-recognizer for a terminal referencing a symbol in another grammar.
Furthermore nothing distinguishes the special symbol C<:discard> from the others: it can also reference any symbol in any other sub-grammar. Though there is one major difference between discard sub-recognizers and terminal sub-recognizers: a discard sub-recognizer will never instantiate another discard sub-sub-recognizer.
This means that in the following:

  :discard    ::= whitespace
  :discard      ~ somethingelse
  whitespace    ~ /[\s]+/
  somethingelse ~ 'a string'

if a discard tentative is instantiated on grammar at level 0 using the symbol C<whitespace> of level 1, and if C<whitespace> of level 1 does not match, there will be no tentative for try to discard in level 1, even it is has a C<:discard> rule that is defined to be C<somethingelse>.

=head1 STREAMING, CHARACTER AND BINARY MODES

Everytime any recognizer need more data, a callback to userspace is triggered. It is legal to not give the encoding when it is a character stream, then the engine will guess (the user should give enough data so that the guess is correct, though).

Internally, all chunks of characters are converted to UTF-8. This guarantees three things:

=over

=item Validation of well-formed characters

=item uniform internal processing

=item native compatibility with the regular expression engine

=back

A recognizer always owns a stream, the later is shared in two cases:

=over

=item Lexeme search

A sub-recognizer is started, and it shares the stream with its parent. Nevertheless parent stream is guaranteed to never crunch any data until this sub-recognizer finishes. At most, new data may be appended. When this sub-recognizer finishes, it updates the parent position in the stream if the lexeme it was looking for is found. The end-user never has control on such sub-recognizer.

=item Shared top-level recognizer

The end-user can create a new top-level recognizer that shares the stream with another top-level recognizer. Then, it is guaranteed that everytime one of them updates its stream, the other's stream changes the same way.

=back

=head1 TERMINALS AND REGULAR EXPRESSIONS

As mentionned above, regular expression are totally handled using L<PCRE2|http://www.pcre.org/>. Therefore the syntax of regular expression is the PCRE2 syntax. It is obvious that a regular expression define an internal "terminal", and there are three ways to define such a terminal, all of them being converted to a regular expression:

=over

=item String

=item Character class

=item Regular expression

=back

Each of these three terminal types support eventual modifiers. The most central modifier is the need or not of having the notion of "valid characters", especially outside of the ASCII range. This is called the PCRE2_UTF flag, and is mentionned thoroughly in the next sections.

=over

=item String

A string is delimited expression in the grammar, where allowed start/and delimiters are C<''> and C<"">. When a string is recognized in the grammar, escaping is allowed using the backslash C<\> character, and only the start delimited or backslash itself can be escaped. Absolutely any other character is taken C<as is>, eventually internally escaped by MarpaX::ESLIF to remove its signification in PCRE2, when there is one. For example:

=over

=item 'Example'

is translated to the UTF-8 pattern C<Example>

=item '{Example}'

is translated to the UTF-8 pattern C<\{Example>

=item "{Example}"

is translated to the UTF-8 pattern C<\{Example>

=item '{Example[]\}'

will trigger an error because only C<'> or C<\> itself can be backslashed.

=item '{Example[]\\}'

is translated to the UTF-8 pattern C<\{Example\[]\\}>

=item 'Black Heart Suite Character: ♥'

is translated to the UTF-8 pattern C<Black Heart Suite Character: \x{2665}>

=back

A string is always scanned character per character by MarpaX::ESLIF, and an ASCII compatible pattern is generated, using \x{...} codepoint notation whenever this is an ASCII special character or a character outside of original ASCII 7-bits character set. So MarpaX::ESLIF know if there is need for unicode support or not in PCRE2 terminology (which is: any code point greater than 255, else byte matching is enough). This is important because PCRE2 yells if a pattern is using a large codepoint and if this internal PCRE2_UTF flag is not set accordingly.

The presence of this flag has an important consequence: if at least one string in the grammar implied the PCRE2_UTF flag, then the whole remaining chunk of data is translated and validated as an UTF-8 sequence of bytes. In such a case, either the user input reader informed that this is stream of characters, then MarpaX::ESLIF prepared in advance the conversion/validation to UTF-8, either this is done lazily as soon as a match is attempted using a string requiring the PCRE2_UTF flag.

=item String modifiers

String modifiers must be appended directly after the end delimiter of the string. They are restricted to C<:i>, meaning that the match is caseless sensitive:

=over

=item 'Black Heart Suite Character: ♥':i

A dump of it in terms of PCRE2 (c.f. the API specification for dump facility) would show the C<PCRE2_CASELESS> flag:

  #      Pattern: Black Heart Suite Character: \x{2665}
  #        Flags: PCRE2_ANCHORED|PCRE2_CASELESS|PCRE2_UTF

You notice the presence of:

=over

=item C<PCRE2_ANCHORED>

Strings are always anchored at the point where match is attempted.

=item C<PCRE2_UTF>

This flag is automatically set when the scanning of the string that is in the grammar, done internally by MarpaX::ESLIF, reveal the need for it.

=back

=item 'Example':i

would give the following dump:

  #      Pattern: Example
  #        Flags: PCRE2_ANCHORED|PCRE2_CASELESS

=back

=item Character class

A character class is very closed to a regular expression (see later), except that it looks like a string, with start/end delimiters being C<[]>, and that the pattern is NOT scanned. MarpaX::ESLIF will lets PCRE2 eventually yell if there is a use of codepoints and if the internal PCRE2_UTF flag is not set.

MarpaX::ESLIF will try to guess the need for PCRE2_UTF flag by scanning the UTF-8 bytes composing the character class, but will do I<no modification>. For example:

=over

=item [a-z]

will be dumped as:

  #      Pattern:
  #     0x000000: 5b 61 2d 7a 5d                                  [a-z]
  #        Flags: PCRE2_ANCHORED

=item [a-z♥]

is dumped as:

  #      Pattern:
  #     0x000000: 5b 61 2d 7a e2 99 a5 5d                         [a-z...]
  #        Flags: PCRE2_ANCHORED|PCRE2_UTF

You notice that the sequence C<e299a5> that is the UTF-8 representation of the Black Heart Suite Character. MarpaX::ESLIF detected it C<as an explicit character>, so it was able to put the PCRE2_UTF flag automatically. But this will not work if you are using codepoints:

=item [a-z\x{2665}]

will yield automatically the following error, and this will come from the PCRE2 engine itself:

  /[a-z\x{2665}]/: pcre2_compile failure at offset 11: character code point value in \x{} or \o{} is too large.

So there is a need for a modifier. Please see the section on "Character class and Regular expression modifiers". For instance, here, one would say:

=item [a-z\x{2665}]:u

leaving to the following dump:

  #     0x000000: 5b 61 2d 7a 5c 78 7b 32 36 36 35 7d 5d          [a-z\x{2665}]
  #        Flags: PCRE2_ANCHORED|PCRE2_UTF

=back

=item Regular expression

Nothing really distinguished regular expression and character classes in the grammar, except that I<sequence modifiers> can be embedded directly in a regular expression, so that they are managed by PCRE2 instead of MarpaX::ESLIF, i.e:

=over

=item /[a-z]/

is stricly equivalent to the character C<[a-z]>.

=item /[a-z]+/

really mean that the sequence is embedded in the regular expression. The dump of the later would say:

  #      Pattern:
  #     0x000000: 5b 61 2d 7a 5d 2b                               [a-z]+

=back

In conclusion determining the need of the PCRE2_UTF will always be exact: either MarpaX::ESLIF will detect it correctly, either PCRE2 will yell, and you will have to explicitely set it using modifiers. Since character class is nothing else but a regular expression limited to a range of character, they both share the same possible set of modifiers.

=item Character class and Regular expression modifiers

The only difference between the twos is how modifiers are expressed: for a character class they must be preceeded by the C<:> character, while for a regular expression they can be set directly after the C</> end delimiter (as in the Perl language).

The explicit regular expression, being sent directly as-is to PCRE2, support de-facto all of the native PCRE2 pattern language, i.e. one can set regular expression options that have no single option equivalent when using a regular expression, for example:

=over

=item /(*LIMIT_MATCH=15)[a-z]+/

is setting an internal PCRE2 match limit to 15. The dump does not show that as an explicit flag:

  #      Pattern:
  #     0x000000: 28 2a 4c 49 4d 49 54 5f 4d 41 54 43 48 3d 31 35 (*LIMIT_MATCH=15
  #     0x000010: 29 5b 61 2d 7a 5d 2b                            )[a-z]+
  #        Flags: PCRE2_ANCHORED

=back

It is highly recommended to read the L<pcre2pattern|http://www.pcre.org/current/doc/html/pcre2pattern.html> documentation to know all the possible settings that can be I<embedded> into the regular expression. Explicit modifiers are insipired by the L<jpcre2|https://github.com/jpcre2/jpcre2> and L<Perl language|http://www.perl.org> (most of the descriptions below are copy/pasted from jpcre2). Please refer to L<MarpaX::ESLIF::BNF> for the exhaustive list of modifiers.

=back

=head1 GRAMMAR

Please refer to L<MarpaX::ESLIF::BNF> for the ESLIF BNF expressed with itself. Fundamentals of the ESLIF grammar are the followings (incompatible change v.s. L<Marpa::R2's DSL|https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod> are highlighted):

=over

=item Grammar levels

The number of levels is only limited by memory for your program -; Any symbol that have an impact on grammar at level C<n> must be defined with such level explicitely:

=over

=item C<::=> is an alias for level 0

=item C<~> is an alias for level 1

=item C<:[n]:=> is the general form for level n

=back

As a consequence the definition of a C<:discard> symbol is incompatible with L<Marpa::R2's DSL|https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod>, in which a discard rule affecting level 0 have the alias C<~>, for ESLIF it is C<::=>.

=item built-in actions and adverb lists

Any L<Marpa::R2|https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod>'s action or adverb list that require the Perl langage has been removed, for example C<::array>, C<::bless>

=item LATM is true by default

LATM (Longest Acceptable Token Match) is preventing the scanner to push alternatives of lower length than the longest of the alternatives.

=item pausing is allowed with discard events

=item C<:default> statement is unique per level

... instead of being lexically scoped with L<Marpa::R2's DSL|https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod>.

=item syntactic exception is supported

... and it is managed at I<valuation> phase.

=item native regular expression support (via L<PCRE2|http://www.pcre.org/>)

=item comments are extended to C<C++>/C<C> styles

=back

=head1 VERSIONING

MarpaX::ESLIF follows the L<Semantic Versioning 2.0.0|https://semver.org/spec/v2.0.0.html>, i:e:

=over

=item MAJOR for incompatible API changes

=item MINOR for added functionality in a backwards-compatible manner

=item PATCH for backwards-compatible bug fixes

=back

=head1 BUILD

You must use a L<CMake|http://cmake.org> version 3 at least. Recommended pattern is:

  cmake -Wno-dev -DCMAKE_BUILD_TYPE=RelWithDebInfo -DTCONV_USE_ICU=NO .
  make
  make check

If you use valgrind, the recommended C<cmake> arguments are:

  cmake -Wno-dev -DCMAKE_BUILD_TYPE=RelWithDebInfo -DTCONV_USE_ICU=NO -DPCRE2_SUPPORT_VALGRIND=1 .

=head1 SEE ALSO

L<MarpaX::ESLIF::BNF>, L<MarpaX::ESLIF::Tutorial::Calculator>, L<PCRE2|http://www.pcre.org/>, L<jpcre2|https://github.com/jpcre2/jpcre2>.

=head1 NOTES

The perl interface is using an I<all-in-one> version of the underlying L<marpaESLIF C library|https://github.com/jddurand/c-marpaESLIF>. This mean that character conversion relies on L<libiconv|https://www.gnu.org/software/libiconv> to do character translation if needed.

=head1 AUTHOR

Jean-Damien Durand <jeandamiendurand@free.fr>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Jean-Damien Durand.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut