Group
Extension

C-Tokenize/lib/C/Tokenize.pod





=encoding UTF-8

=head1 NAME

C::Tokenize - reduce a C file to a series of tokens

=head1 SYNOPSIS

    
    # Remove all C preprocessor instructions from a C program:
    my $c = <<EOF;
    #define X Y
    #ifdef X
    int X;
    #endif
    EOF
    use C::Tokenize '$cpp_re';
    $c =~ s/$cpp_re//g;
    print "$c\n";
    


produces output

    int X;
    


(This example is included as L<F<synopsis-cpp.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/synopsis-cpp.pl> in the distribution.)  

    
    # Print all the comments in a C program:
    my $c = <<EOF;
    /* This is the main program. */
    int main ()
    {
        int i;
        /* Increment i by 1. */
        i++;
        // Now exit with zero status.
        return 0;
    }
    EOF
    use C::Tokenize '$comment_re';
    while ($c =~ /($comment_re)/g) {
        print "$1\n";
    }
    


produces output

    /* This is the main program. */
    /* Increment i by 1. */
    // Now exit with zero status.
    


(This example is included as L<F<synopsis-comment.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/synopsis-comment.pl> in the distribution.)  

=head1 VERSION

This documents version 0.19 of C::Tokenize corresponding
to git commit L<98c04e11199496e4e99e4c6e3eb97b4efaf8636b|https://github.com/benkasminbullock/C-Tokenize/commit/98c04e11199496e4e99e4c6e3eb97b4efaf8636b> released on Tue Aug 5 18:14:49 2025 +0900.

=head1 DESCRIPTION

This module provides a tokenizer, L</tokenize>, which breaks C source
code into its smallest meaningful components, and the regexes which
match each of these components. For example, L</$comment_re> matches a
C comment.

As well as components of C, it supplies regexes for local include
statements, L</$include_local>, and C variables, L</$cvar_re>, as well
as extra functions, like L</decomment> to remove the comment syntax of
traditional C comments, and L</strip_comments>, which removes all
comments from a C program.

=head1 REGULAR EXPRESSIONS

The following regular expressions can be imported from this module
using, for example,

    use C::Tokenize '$cpp_re'

to import C<$cpp_re>.

The following regular expressions do not capture, except where
noted. To capture, add your own parentheses around the regular
expression.

=head2 Comments

=over

=item $trad_comment_re

Match C</* */> comments.

=item $cxx_comment_re

Match C<//> comments.

=item $comment_re

Match both C</* */> and C<//> comments.

=back

See also L</decomment> for converting a comment to a string, and
L</strip_comments> for removing all comments from a program.

=head2 Preprocessor instructions

=over

=item $cpp_re

Match all C preprocessor instructions, such as #define, #include,
#endif, and so on. A multiline preprocessor instruction is matched as
#one piece.

=item $include_local

Match an include statement which uses double quotes, like C<#include "some.c">.

This captures the entire statement in C<$1> and the file name in C<$2>.

This was added in version 0.10 of C::Tokenize.

=item $include

Match any include statement, like C<< #include <stdio.h> >>.

This captures the entire statement in C<$1> and the file name in C<$2>.

    
    use C::Tokenize '$include';
    my $c = <<EOF;
    #include <this.h>
    #include "that.h"
    EOF
    while ($c =~ /$include/g) {
        print "Include statement $1 includes file $2.\n";
    }


produces output

    Include statement #include <this.h> includes file this.h.
    Include statement #include "that.h" includes file that.h.


(This example is included as L<F<includes.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/includes.pl> in the distribution.)  

This was added in version 0.12 of C::Tokenize.

=back

=head2 Values

=over

=item $octal_re

Match an octal number, which is a number consisting of the digits 0 to
7 only which begins with a leading zero.

=item $hex_re

Match a hexadecimal number, a number with digits 0 to 9 and letters A
to F, case insensitive, with a leading 0x or 0X and an optional
trailing L or l for long.

=item $decimal_re

Match a decimal number, either integer or floating point.

=item $number_re

Match any number, either integer, floating point, hexadecimal, or
octal.

=item $char_const_re

Match a character constant, such as C<'a'> or C<'\-'>.

=item $single_string_re

Match a single C string constant such as C<"this">.

=item $string_re

Match a full-blown C string constant, including compound strings
C<"like" "this">.

=back

=head2 Operators, variables, and reserved words

=over

=item $operator_re

Match an operator such as C<+> or C<-->.

=item $word_re

Match a word, such as a function or variable name or a keyword of the
language. Use L</$reserved_re> to match only reserved words.

=item $grammar_re

Match non-operator syntactic characters such as C<{> or C<[>.

=item $reserved_re

Match a C reserved word like C<auto> or C<goto>. Use L</$word_re> to
match non-reserved words.

=item $cvar_re

Match a C variable, for example anything which may be an lvalue or a
function argument. It does not capture the result.

    
    use C::Tokenize '$cvar_re';
    my $c = 'func (x->y, & z, ** a, & q);';
    while ($c =~ /[,\(]\s*($cvar_re)/g) {
        print "$1 is a C variable.\n";
    }


produces output

    x->y is a C variable.
    & z is a C variable.
    ** a is a C variable.
    & q is a C variable.


(This example is included as L<F<cvar.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/cvar.pl> in the distribution.)  

Because in theory this can contain very complex things, this regex is
somewhat heuristic and there are edge cases where it is known to
fail. See F<t/cvar_re.t> in the distribution for examples.

This was added in version 0.11 of C::Tokenize.

=back

=head1 VARIABLES

=head2 @fields

The exported variable @fields contains a list of all the fields which
are extracted by L</tokenize>.

=head1 FUNCTIONS

=head2 decomment

    my $out = decomment ('/* comment */');
    # $out = " comment ";

Remove the traditional C comment marks C</*> and C<*/> from the
beginning and end of a string, leaving only the comment contents. The
string has to begin and end with comment marks.

=head2 tokenize

    my $tokens = tokenize ($text);

Convert the C program code C<$text> into a series of tokens. The
return value is an array reference which contains hash
references. Each hash reference corresponds to one token in the C
text. Each token contains the following keys:

=over

=item leading

Any whitespace which comes before the token (called "leading
whitespace").

=item type

The type of the token, which may be 

=over

=item comment

A comment, like 

    /* This */

or

    // this.

=item cpp

A C preprocessor instruction like

    #define THIS 1

or

    #include "That.h".

=item char_const

A character constant, like C<'\0'> or C<'a'>.

=item grammar

A piece of C "grammar", like C<{> or C<]> or C<< -> >>.

=item number

A number such as C<42>,

=item word

A word, which may be a variable name or a function.

=item string

A string, like C<"this">, or even C<"like" "this">.

=item reserved

A C reserved word, like C<auto> or C<goto>.

=back

All of the fields which may be captured are available in the variable
L</@fields> which can be exported from the module:

    use C::Tokenize '@fields';

=item $name

The value of the type. For example, if C<< $token->{name} >> equals
'comment', then the value of the type is in C<< $token->{comment} >>.

    if ($token->{name} eq 'string') {
        my $c_string = $token->{string};
    }

=item line

The line number of the C file where the token occured. For a
multi-line comment or preprocessor instruction, the line number refers
to the final line.

=back

=head2 strip_comments

    my $no_comment = strip_comments ($c);

This removes all comments from its input while preserving line breaks.

    
    use C::Tokenize 'strip_comments';
    my $c = <<EOF;
    char * not_comment = "/* This is not a comment */";
    int/* The X coordinate. */x;
    /* The Y coordinate.
       See https://en.wikipedia.org/wiki/Cartesian_coordinates. */
    int y;
    // The Z coordinate.
    int z;
    EOF
    print strip_comments ($c);


produces output

    char * not_comment = "/* This is not a comment */";
    int x;
     
    
    int y;
    
    int z;


(This example is included as L<F<strip-comments.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/strip-comments.pl> in the distribution.)  

This function was moved to this module from L<XS::Check> in version
0.14.

This function can also be used to strip C-style comments from JSON
without removing string contents:

    
    use C::Tokenize 'strip_comments';
    my $json =<<EOF;
    {
    /* Comment comment comment */
    "/* not comment */":"/* not comment */",
    "value":["//not comment"] // Comment
    }
    EOF
    print strip_comments ($json);


produces output

    {
     
    "/* not comment */":"/* not comment */",
    "value":["//not comment"]  
    }


(This example is included as L<F<strip-json.pl>|https://fastapi.metacpan.org/source/BKB/C-Tokenize-0.19/examples/strip-json.pl> in the distribution.)  

=head1 EXPORTS

Nothing is exported by default.

    use C::Tokenize ':all';

exports all the regular expressions and functions from the module, and
also L</@fields>.

=head1 SEE ALSO

=over

=item 

The regular expressions contained in this module are L<shown at this
web page|http://www.lemoda.net/c/c-regex/index.html>.

=item

L<This example of use of this
module|http://www.lemoda.net/perl/remove-unnecessary-c-headers/>
demonstrates using C::Tokenize (version 0.12) to remove unnecessary
header inclusions from C files.

=item

There is a C to HTML converter in the F<examples> subdirectory of the
distribution called F<c2html.pl>.

=back

=head1 BUGS

=over

=item No trigraphs

No handling of trigraphs.

This is L<issue 4|https://github.com/benkasminbullock/C-Tokenize/issues/4>.

=item Requires Perl 5.10

This module uses named captures in regular expressions, so it requires
Perl 5.10 or more.

=item No line directives

The line numbers provided by L</tokenize> do not respect C line
directives.

This is L<issue 11|https://github.com/benkasminbullock/C-Tokenize/issues/11>.

=item Insufficient tests

The module has been used somewhat, but the included tests do not
exercise many of the features of C.

=item $include and $include_local assume the included file end with .h

The C language does not impose a requirement that included file names
end in .h. You can include a file with any name. However, the regexes
L</$include> and L</$include_local> insist on a final .h.

=item $cvar_re misses some cases

See the discussion under L</$cvar_re>.

=back



=head1 AUTHOR

Ben Bullock, <benkasminbullock@gmail.com>

=head1 COPYRIGHT & LICENCE

This package and associated files are copyright (C) 
2012-2025
Ben Bullock.

You can use, copy, modify and redistribute this package and associated
files under the Perl Artistic Licence or the GNU General Public
Licence.





Powered by Groonga
Maintained by Kenichi Ishigaki <ishigaki@cpan.org>. If you find anything, submit it on GitHub.