Sah/lib/Sah.pod
package Sah; # just to make PodWeaver happy
# AUTHORITY
# DATE
our $DIST = 'Sah'; # DIST
# VERSION
1;
# ABSTRACT: Schema for data structures (specification)
__END__
=pod
=encoding UTF-8
=head1 NAME
Sah - Schema for data structures (specification)
=head1 SPECIFICATION VERSION
0.9
=head1 VERSION
This document describes version 0.9.51 of Sah (from Perl distribution Sah), released on 2022-10-21.
=head1 STATUS
In the 0.9.0 series, there will probably still be incompatible syntax changes
between revision before the spec stabilizes into 1.0 series.
=head1 ABOUT
This document specifies Sah, a schema language for validating data structures.
In this document, schemas and data structures are mostly written in pseudo-JSON
(JSON with comments C<// ...>, ellipsis C<...>, or some JavaScript).
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119.
=head1 SCHEMA
A schema is essentially B<a type definition>, stating a set of valid values for
data.
Sah schemas are regular data structures, specifically arrays:
[TYPE_NAME, CLAUSE_SET]
C<TYPE_NAME> is a string, C<CLAUSE_SET> is a hash of clauses.
Some examples:
["int", {"min": 0, "max": 100}]
// a definition of "uint" (non-negative integer)
["int": {"min": 0}]}}
// this schema "even_uint" (positive even natural number) is defined from (or
// based on) another schema "posint" (defined above)
["posint", {"div_by": 2}]
A shortcut string form containing only the type name is allowed when there are
no clauses. It will be normalized into the array form:
"int"
// normalized form of the above
["int", {}]
The type name can have a C<*> suffix as a shortcut for the C<< "req": 1 >>
clause. This shortcut exists because stating something is required is very
common.
"int*"
// equivalent to
["int", {"req": 1}]
["int*", {"min": 0}]
// equivalent to
["int", {"req": 1, "min": 0}]
A flattened array form is also supported. It will be normalized into the
non-flattened form. This shortcut exists to save a couple of keystrokes :-) And
also reduce the number of nested structure, which can get a bit unwieldy for
complex schemas.
["int", "min", 1, "max", 10]
// is equivalent to
["int", {"min": 1, "max": 10}]
=head1 TYPE
Type classifies data and specifies the possible values of data.
Sah defines several standard types like C<bool>, C<int>, C<float>, C<str>,
C<array>, C<hash>, and a few others. Please see L<Sah::Type> for the complete
list.
Type name must match this regular expression:
\A[A-Za-z_][A-Za-z0-9_]+(::[A-Za-z_][A-Za-z0-9_]+)*\z
A type can have B<clauses>. Most clauses declare constraints (thus, B<constraint
clauses>). B<Constraint clauses> are like functions, they accept an argument,
are evaluated against data and return a value. The returned value need not
strictly be boolean, but for the clause to succeed, the return value must
evaluate to true. The notion of true/false follows Perl's notion: undefined
value, empty string (C<"">), the string C<"0">, and number 0 are considered
false. Everything else is true.
For the schema to succeed, all constraint clauses must evaluate to true.
Aside from declaring constraints, clauses can also declare other stuffs. There
is the C<default> clause which specifies default value. There are B<metadata
clauses> which specify metadata, e.g. the C<summary>, C<description>, C<tags>
clauses.
Aside from clauses, type can also have B<type properties>. Properties are
different from clauses in the following ways: 1) they are used to find out
something about the data, not to test/validate data; 2) they are allowed to not
accept any argument. A type can have a property and a clause with the same name,
for example the C<str> type have a C<len> clause to test its length against an
integer, as well as a C<len> property which returns its length. Properties are
differentiated from clauses so that compilers to human text can generate a
description like "string where its length is at least 1".
Type properties can be validated against a schema using the C<prop> or C<if>
clause.
B<Base schema>. You can define a schema, declare it as a new type, and then
write subsequent schemas against that type, along with additional clauses. This
is very much like subtyping. See L</"BASE SCHEMA"> for more information.
=head1 BASE SCHEMA
As mentioned before, you can define a schema as a type and then write other
schemas against that type. For example:
// defined as "uint" type
["int", {"min": 0}]
and later:
// a positive integer that divisible by 5
["uint", {"div_by": 5}]
During data validation, base schemas will be replaced by its original
definition, and all the clause sets will be evaluated. Illustrated by the plus
sign:
["int", {"min": 0} + {"div_by": 5}]
Another, more involved, example:
// definition for the "single_dice_throw" schema
["int": {"in": [1, 2, 3, 4, 5, 6]}]
// definition for the "sdt" schema (short notation):
"single_dice_throw"
// definition for the "dice_pair_throw" schema
["array": {"len": 2, "elems": ["sdt", "sdt"]}]
// definition for the "dpt" schema (short notation)
"dice_pair_throw"
// definition for the "throw" schema
["any": {"of": ["sdt", "dpt"]}]
// definition for the "throws" schema
["array": {"of": "throw"}]
The above schema describes a list of dice throws (C<throws>). Each C<throw> can
be a single dice throw (C<sdt>) which is a number between 1 and 6, or a throw of
two dices (C<dpt>) which is a 2-element array (where each element is a number
between 1 and 6).
Examples of valid data for the "throws" schema:
[1, [1,3], 6, 4, 2, [3,5]]
Examples of invalid data:
1 // not an array
[1, [2, 3], 0] // the third throw is invalid (not between 1-6)
[1, [2, 0, 4], 4] // the second throw is invalid (not sdt nor dpt)
It is up to the implementation where the base schemas should be stored. The
L<Data::Sah> implementation looks up base schemas in C<Sah::Schema::*> modules
in the C<$schema> package variables. Old specification of Sah allows base
schemas to be put in the schema itself in the so-called "e.xtras" part, but for
simplicity this has been removed in later specification.
=head1 CLAUSE AND CLAUSE SET
A clause set is a defhash (see L<DefHash>) containing a mapping of clause name
and clause values or clause attribute names and clause attribute values. Defhash
properties map to Sah clauses, while defhash property attributes map to Sah
clause attributes.
{
"CLAUSENAME1": CLAUSEVALUE,
"CLAUSENAME1.ATTRNAME1": ATTRVALUE1,
"CLAUSENAME1.ATTRNAME2": ATTRVALUE2,
"CLAUSENAME1.ATTRNAME1.SUBATTR1": ...,
...
"_IGNORED": ...,
"CLAUSENAME1._IGNORED": ...
}
For convenience, there are also some shortcuts:
=over
=item * C<&> suffix (multiple clause values, all must succeed)
"CLAUSENAME&": [VAL, ...]
is equivalent to:
"CLAUSENAME": [VAL, ...],
"CLAUSENAME.op": "and"
=item * C<|> suffix (multiple clause values, only one must succeed)
"CLAUSENAME|": [VAL, ...]
is equivalent to:
"CLAUSENAME": [VAL, ...],
"CLAUSENAME.op": "or"
=item * C<!> prefix (negation)
"!CLAUSENAME": VAL
is a shortcut for this:
"CLAUSENAME": VAL,
"CLAUSENAME.op": "not"
=item * C<=> suffix (expression)
"CLAUSENAME=": EXPR
"CLAUSENAME.ATTRNAME1=": EXPR
are respectively equivalent to:
"CLAUSENAME.is_expr": 1
"CLAUSENAME": EXPR
"CLAUSENAME.ATTRNAME1.is_expr": 1
"CLAUSENAME.ATTRNAME1": EXPR
=item * C<(LANG)> suffix (value for alternate languages)
"CLAUSENAME(LANG)": VAL
"CLAUSENAME.ATTRNAME1(LANG)": VAL
are respectively equivalent to:
"CLAUSENAME.alt.lang.LANG": VAL
"CLAUSENAME.ATTRNAME1.alt.lang.LANG": VAL
Examples:
"name(id_ID)": "bilangan bulat positif"
"name(en_US)": ["positive integer", "positive integers"]
are equivalent to:
"name.alt.lang.id_ID": "bilangan bulat positif"
"name.alt.lang.en_US": ["positive integer", "positive integers"]
=back
Every clause has a B<priority> between 0 and 100 to determine the order of
evaluation (the lower the number, the higher the priority and the earlier the
clause is evaluated). Most constraint clauses are at priority 50 (normal) so the
order does not matter, but some clauses are early (like C<default> and
C<prefilters>) and some are late (like C<postfilters>). Variables mentioned in
expression also determine ordering, for example:
["int", {"min=": "0.5*$clause:max", "max": 10}]
In the above example, although C<max> and C<min> are both at priority 50, C<min>
needs to be evaluated first because it refers to C<max> (XXX syntax of variable
not yet finalized).
=head2 Clause name
This specification comes from DefHash: Clause names must begin with
letter/underscore and contain letters/numbers/underscores only. All clauses
which begin with an C<_> (underscore) is ignored. You can use this to embed
extra data for other purposes.
=head2 Clause attribute
This specification comes from DefHash: Attribute name must also only contain
letters/numbers/underscores, but it can be a dotted-separated series of parts,
e.g. C<alt.lang.id_ID>. As with clauses, clause attributes which begin with C<_>
(underscore) is ignored. You can use this to embed extra data.
Currently known general attributes:
=over
=item * prio
Int. Change the clause's priority for this clause set. Note that this only works
for clauses which have equal priorities. Otherwise, priority value from clause
definition takes precedence.
Example:
// both "min" and "max" clauses have priority of 50, but we want to make sure
// that "min" is evaluated first
["int*", {"min=": "some expr", "min.prio": 1, "max": 10}]
=item * op
Str. Specify operator for (multiple) clause values. Possible values for this
attribute include: C<and>, C<or>, C<none>, C<not>. Except for C<not>, the
presence of C<op> signifies that clause contains multiple values instead of a
single one.
There are shortcuts for C<and>, C<or>, and C<not>; see L</"CLAUSE AND CLAUSE
SET">.
C<and> specifies that all clause values must succeed for the clause to succeed.
Example:
["str", {"clause": [["min_len", 8], ["match", "\\W"]], "clause.op": "and"}]
The above schema requires a string to be at least 8 characters long, I<or>
contains a non-word character. Strings that would validate include: C<abcdefgh>
or C<$> or C<$abcdefg>. Strings that would not validate include: C<abcd> (fails
both C<min_len> and C<match> clauses) or C<abcdefgh> (fails the C<match>
clause).
C<or> specifies that any one of clause values must succeed for the clause to
succeed. Example:
["str", {"match": [RE1, RE2, RE3], "match.op": "or"}]
The above schema specifies that string can match any of the regexes RE1/RE2/RE3.
C<none> specifies that all clause values must fail for the clause to succeed.
For example:
["str", {"match": [RE1, RE2, RE3], "match.op": "none"}]
The above schema specifies that string must not match any of the regexes
RE1/RE2/RE3.
C<not> reverts the success status of clause (in other words,
clause must fail for validation to succeed). Example:
["str", {"match": RE, "match.op", "not"}]
The above schema specifies not string must not match regex RE.
=item * is_expr
Bool. Signify that clause contains expression (see L</"EXPRESSION">) instead of
literal value. Example:
// a string, minimum 4 characters
["str", {"min_len": 4}]
// same thing, albeit a bit fancier
["str", {"min_len.is_expr": 1, "min_len": "2*2"}]
// same thing, shortcut notation
["str", {"min_len=": "2*2"}]
// for default, we pick a random number between 1 and 10
["int", {"default=": "int(10*rand())+1"}]
Expression is useful for more complex schema, when a clause/attribute value
needs to be calculated in terms of other values, and/or using functions.
Note that an implementation might not support expression in some clauses or
attributes, especially clauses that accept argument containing schemas as
dynamically generated schemas needs the compiler to embed an interpreter or
compiler in the generated code.
When C<is_expr> attribute is true, and C<op> is also one that requires multiple
clause values (like C<and>, C<or>, C<none>), then the expression is expected to
return an array of values. Otherwise, the clause will fail. Example:
// number which must be divisible by 2, 3, 5
["int", {"div_by.is_expr": 1, "div_by.op": "and", "div_by": "[2, 3, 5]"}]
// string must not match any of the blacklist
["str", {
"contains.is_expr": 1,
"contains.op": "none",
"contains": "get_blacklist()"
}]
=item * err_level
Str, default C<error>. Valid value: C<fatal>, C<error>, C<warn>. Normally, when
clause checking fails, an error is generated and it causes validation of the
whole schema to fail. If C<err_level> is set to C<warn>, however, this only
generates a warning and does not cause the validation to fail.
// password
["str*", {"clset&": [
{"min_len": 4},
{"min_len": 8,
"min_len.err_level": "warn",
"min_len.err_msg": "Although a password less than 8 letters are " +
"valid it's highly recommended that a password is " +
"at least 8 letters long, for security reasons"}
]}]
In the above example, the C<err_level> and C<err_msg> are attributes for the
C<min_len> clause. The second clause set basically adds an optional restriction
for the password: when the C<min_len> clause is not satisfied, instead of making
the data fails the validation, only a warning is issued.
C<fatal> is the same as C<error> but will make validation exit early, without
collecting further errors. This only takes effect when validation collects full
errors instead of just stopping after the first error is found.
=item * err_msg[.alt.lang.LANGCODE]
From DefHash. This tells the compiler that instead of the default error message
from the type handler, a custom error message is supplied. You can add
translations by adding more attributes with language code suffixes. For example:
["str", {"match": "[^A-Za-z0-9_-]",
"match.err_msg": "Must not contain naughty characters",
"match.err_msg.alt.lang.id_ID": "Tidak boleh mengandung karakter aneh-aneh"
}]
Another example:
["str", {"!in": ["root", "admin"],
"in.err_msg": "Sorry, username is reserved",
"in.err_msg.alt.lang.id_ID": "Maaf, nama user dilarang digunakan"
}]
=item * human[.alt.lang.LANGCODE]
From DefHash. This is also ignored when validating data, but will be used by the
human compiler to supply description. You can add translations by adding more
attributes.
["str", {"match": "[^A-Za-z0-9_-]",
"match.human": "Must not contain naughty characters",
"match.human.alt.lang.id_ID": "Tidak boleh mengandung karakter aneh-aneh"
}]
=item * alt
This comes from DefHash, mainly used to store translations for C<name>,
C<summary>, C<description>.
=item * result_var
EXPERIMENTAL. Str. Specify variable name to store results in.
Aside from pass/failure, a clause or clause set can also produce some value.
This attribute specifies where to put the results in. The value can then be
used by referring to the variable in expression. Example:
["any", {
"of": [
["str*", {"min_len": 1, "max_len": 10}], // 1
["str*", {"min_len": 11}], // 2
["array*", {}], // 3
["hash*", {}] // 4
],
"of.result_var": "a"
}]
Aside from passing/failing the validation, the C<of> clause above also produces
an index to the schema in the list which matches. So if you validate an array,
C<$a> in the schema will be set to 3. If you validate a string with length 12,
C<$a> will be set to 2. If you pass an empty string (which does not pass the
C<of> clause, C<$a> will not be set.
Refer to each clause's documentation to find out what value the clause returns.
=item * c.COMPILER
This is a namespace for specifying compiler options. Each compiler will have its
specific options; see documentation on respective compiler to see available
options. For example:
// skip clauses which are not implemented in JavaScript. we'll check on the
// server-side anyway.
["str", {
"soundex": "E460",
"c.js.ignore_missing_clause_handler": true
}]
=item * x.WHATEVER
This comes from DefHash and is an alternative to underscore prefix for putting
extra data in a schema. The difference is that some processing tool might strip
the underscore clause/attribute.
=back
Aside from the above general attributes, each clause might recognize its own
specific attributes. See documentation of respective clauses.
=head2 Clause set merging
Clause set merging happens when a schema is based on another schema and the
child schema's clause set contains merge prefixes (explained later) in its keys.
For example:
// schema1
[TYPE1, CLSET1]
// schema2, based on schema1
[schema1, CLSET2]
// schema3, based on schema2
[schema2, CLSET3]
When compiling/evaluating C<schema2>, Sah will check against C<TYPE1> and
C<CLSET1> and then C<CLSET2>. However, when C<CLSET2> contains a merge prefix
(marked with an asterisk here for illustration), then Sah will check against
C<TYPE1> and C<< merge(CLSET1, *CLSET2) >>.
When compiling/evaluating C<schema3>, Sah will check against C<TYPE1> and
C<CLSET1> and then C<CLSET2> and then C<CLSET3>. However, when C<CLSET2>
contains a merge prefix, then Sah will check against C<TYPE1>, C<< merge(CLSET1,
*CLSET2) >>, and then C<CLSET3>. When C<CLSET2> and C<CLSET3> contains merge
prefixes, Sah will check against C<TYPE1> and C<< merge(CLSET1, *CLSET2,
*CLSET3) >>. So merging will be done from left to right.
The base schema's clause set must not contain any merge prefixes.
Merging is done using L<Data::ModeMerge>, with merge prefixes changed to
'merge.add.', 'merge.delete.' and so on. In merging, Data::ModeMerge allows keys
on the right side hash not only to replace but also add, subtract, remove keys
from the left side. This is powerful because it allows schema definition to not
only add clauses (restrict types even more), but also replace clauses (change
type restriction) as well as delete clauses (relax type restriction). For more
information, refer to the Data::ModeMerge documentation.
Illustration:
int + {"div_by": 2} + {"div_by": 3} // must be divisible by 2 & 3
int + {"div_by": 2} + {"merge.normal.div_by": 3} // will be merged and become:
int + {"div_by": 3} // must be divisible by 3 ONLY
int + {"div_by": 2} + {"merge.delete.div_by": 0} // will be merged and become:
int + {} // need not be divisible
int + {"in": [1,2,3,4,5]} + {"in": [6]} // impossible to satisfy
int + {"in": [1,2,3,4,5]} + {"merge.add.in": [6]} // will be merged and become:
int + {"in": [1,2,3,4,5,6]}
int + {"in": [1,2,3,4,5]}, {"merge.subtract.in": [4]} // will become:
int + {"in": [1,2,3, 5]}
Merging is performed before schema is normalized.
Merging is not recursive.
=head1 EXPRESSION
XXX: Syntax of variables not yet finalized.
Sah supports expressions, using L<Language::Expr> minilanguage. See
L<Language::Expr::Manual::Syntax> for details on the syntax. You can specify
expression in the C<check> clause, e.g.:
["int", {"check": "$_ >= 4"}]
Alternatively, expression can also be specified in any clause's attribute:
["int", {"min=": "floor(4.9)"}]
The above three schemas are equivalent to:
["int", {"min", 4}]
Expression can refer to elements of data and (normalized) schema, and can call
functions, enabling more complex schema to be defined, for example:
["array*", {"len": 2, "elems": [
["str*", {"match": "^\w+$"}],
["str*", {"match=": "${../../0/clause_sets/0/match}",
"min_len=": "2*length(${data:../0})"}]
]}]
The above schema requires data to be a two-element array containing strings,
where the length of the second string has to be at least twice the length of the
first. Both strings have to comply to the same regex, C<^\w+$> (which is
declared on the first string's clause and referred to in the second string's
clause).
=head1 FUNCTION
Functions can be used in expressions. The syntax of calling function is:
func()
func(ARG, ...)
Functions in Sah can sometimes accept several types of arguments, e.g.
len(ARRAY) will return the number of elements in the ARRAY, while len(STR) will
return the number of characters in the string. However, when an inappropriate
argument is given, an exception will be thrown.
=head1 HOMEPAGE
Please visit the project's homepage at L<https://metacpan.org/release/Sah>.
=head1 SOURCE
Source repository is at L<https://github.com/perlancar/perl-Sah>.
=head1 SEE ALSO
L<DefHash>
L<Sah::Type>, L<Sah::FAQ>
=head1 HISTORY
=head2 2012-07-21
Split specification to Sah.
=head2 2011-11-23
L<Data::Sah>.
=head2 2009-03-30
Data::Schema (first CPAN release). Previous incarnation as Schema-Nested
(internal).
=head1 AUTHOR
perlancar <perlancar@cpan.org>
=head1 CONTRIBUTING
To contribute, you can send patches by email/via RT, or send pull requests on
GitHub.
Most of the time, you don't need to build the distribution yourself. You can
simply modify the code, then test via:
% prove -l
If you want to build the distribution (e.g. to try to install it locally on your
system), you can install L<Dist::Zilla>,
L<Dist::Zilla::PluginBundle::Author::PERLANCAR>,
L<Pod::Weaver::PluginBundle::Author::PERLANCAR>, and sometimes one or two other
Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond
that are considered a bug and can be reported to me.
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2022, 2020, 2019, 2017, 2016, 2015, 2014, 2013, 2012 by perlancar <perlancar@cpan.org>.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=head1 BUGS
Please report any bugs or feature requests on the bugtracker website L<https://rt.cpan.org/Public/Dist/Display.html?Name=Sah>
When submitting a bug or request, please include a test-file or a
patch to an existing test-file that illustrates the bug or desired
feature.
=cut