DTA-CAB/CAB/WebServiceHowto.pod
=encoding utf8
=pod
=head1 NAME
DTA::CAB::WebServiceHowto - User documentation for DTA::CAB web-service
=cut
##========================================================================
## DESCRIPTION
=pod
=head1 DESCRIPTION
This document describes the use of the L<DTA::CAB|http://odo.dwds.de/~jurish/software/dta-cab/> web-service accessible
at L<http://www.deutschestextarchiv.de/public/cab/> and its aliases
L<http://www.deutschestextarchiv.de/demo/cab/> and L<http://www.deutschestextarchiv.de/cab/>.
The CAB web-service provides
error-tolerant linguistic analysis for historical German text,
including normalization of historical orthographic variants
to "canonical" modern forms using the method described in
L<Jurish (2012)|http://./#jurish2012>.
Due to legal restrictions on some of the underlying resources,
not all available analysis layers can be returned
by the publically accessible web-service, but it is hoped that the available
layers (linguistically salient TEI-XML serialization,
sentence- and word-level tokenization,
orthographic normalization,
part-of-speech tags, and (normalized) lemmata) should suffice for most purposes.
=cut
##========================================================================
## Interface Elements
=pod
=head2 Interface Elements
Upon accessing the top-level web-service URL ( L<http://www.deutschestextarchiv.de/cab/> )
in a web browser, the user is presented with a graphical interface in which CAB queries can
be constructed and submitted to the underlying server. This section describes the various
elements of that interface.
=begin html
<a href="cab-screenshot-annotated.png"><img style="display:block;margin-left:auto;margin-right:auto;max-width:50%;" src="cab-screenshot-annotated.png"/></a>
=end html
=over 4
=item Query Form
At the top of the web-service interface is a query form on a gray background including input fields for the CAB
query parameters ("Query", "Analyzer", "Format", etc.).
Each query form input element should display a tooltip briefly describing its function if you hover your mouse
pointer over it in a browser.
See L<DTA::CAB::HttpProtocol/query Requests> for more details on supported query request parameters.
=item Status Line
Immediately beneath the L<query form|/Query Form> is a status display line ("URL line") on a white background with no border,
which contains a link to the raw response data for the current query, if any. In the case of
singleton (1-word) queries, the status line also contains
a simple heuristic "traffic-light" indicator of the query word's morphological security status, where
green indicates a "safe" known modern form, red indicates an unknown (assumedly historical) form,
and yellow indicates a known modern form which is judged unsafe for identity canonicalization (typically
a proper name).
=item Response Data
Immediately below the L<URL line|/Status Line> is the response data area ("data area") on a white background with a gray border, which
displays the results for the current query, if any.
=item Link Buttons
Below the L<data area|/Response Data> are a number of static link buttons for
the file upload ("File Upload") or the live user-input interface ("Live Query"),
the list of analyzers supported by the underlying CAB instance ("Analyzers"),
the list of I/O formats supported by the underlying CAB instance ("Formats"),
administrative data for the CAB server instance ("Status"),
and the CAB documentation ("Documentation").
=item Footer
Below the L<link button area|/Link Buttons> is a short footer on a gray background containing
administrative information about the underlying CAB server.
=back
=cut
##========================================================================
## Basic Usage
=pod
=head2 Basic Usage
This section briefly describes the basic usage offered by the DTA::CAB
web-service by reference to some simple examples.
=cut
##========================================================================
## Basic Usage: Simple Query
=pod
=head3 A Simple Query
Most CAB parameters in the query form are initialized with sensible default values, with the exception
of the "Query" parameter itself, which should contain the text string to be analyzed.
Say we wish to analyze the text string "C<Elephanten>":
simply entering this string (or copy & paste it) into the text input box associated with the "Query" parameter,
and then pressing the I<Enter> key or clicking the "submit" button will cause the query to be
submitted to the underlying CAB server and the response data to be displayed in the L<data area|/Response Data>:
L<http://www.deutschestextarchiv.de/cab/?q=Elephanten>
The results are displayed by default in CAB's native "L<Text|/Text>" format,
in which the first line contains the input surface form (C<Elephanten>), and the remaining lines are the CAB
attributes for the query word,
where each attribute line is indicated by an initial TAB character, a plus ("+") sign,
and the L<attribute label|/Analyzers and Attributes> enclosed in square brackets ("[...]"), followed by the attribute value.
Useful attributes include "moot/word" (canonical modern form), "moot/tag" (part-of-speech tag), and
"moot/lemma" (modern lemma).
The example query for instance should produce a response such as:
Elephanten
+[moot/word] Elefanten
+[moot/tag] NN
+[moot/lemma] Elefant
... indicating that the query was correctly normalized to the canonical modern form "Elefanten",
tagged as a common noun ("NN"), and assigned the correct canonical lemma "Elefant".
=cut
##========================================================================
## Basic Usage: Sentence Query
=pod
=head3 A Sentence Query
Suppose we wish to take advantage of context information while normalizing
a whole sentence of historical text, as described in
L<Jurish (2012), Chapter 4|http://./#jurish2012>.
Simply enter the entire text of the sentence to be analyzed in the "Query"
input, and ensure that the checkbox for the "tokenize" flag is checked,
and submit the query, e.g.
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
The output now contains multiple tokens (words), where the analysis for
each token begins with a line containing only its surface form (no leading whitespace).
Token attribute lines (introduced by a leading TAB character) refer to the token
most recently introduced.
The output for the example query can be directly accessed
L<here|http://www.deutschestextarchiv.de/cab/?q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.>,
and should look something like the following:
EJn
+[moot/word] Ein
+[moot/tag] ART
+[moot/lemma] eine
zamer
+[moot/word] zahmer
+[moot/tag] ADJA
+[moot/lemma] zahm
Elephant
+[moot/word] Elefant
+[moot/tag] NN
+[moot/lemma] Elefant
gillt
+[moot/word] gilt
+[moot/tag] VVFIN
+[moot/lemma] gelten
ohngefähr
+[moot/word] ungefähr
+[moot/tag] ADJD
+[moot/lemma] ungefähr
zweyhundert
+[moot/word] zweihundert
+[moot/tag] CARD
+[moot/lemma] zweihundert
Thaler
+[moot/word] Taler
+[moot/tag] NN
+[moot/lemma] Taler
.
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
=cut
##========================================================================
## Basic Usage: Multi-Sentence Query
=pod
=head3 A Multi-Sentence Query
The CAB web service can analyze multiple sentences as well, for example:
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
Ceterum censeo Carthaginem esse delendam.
The corresponding output can be viewed L<here|http://www.deutschestextarchiv.de/cab?q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>, which should look something like:
%% $s:lang=de
EJn
+[moot/word] Ein
+[moot/tag] ART
+[moot/lemma] eine
...
.
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
%% $s:lang=la
Ceterum
+[moot/word] Ceterum
+[moot/tag] FM.la
+[moot/lemma] ceterum
censeo
+[moot/word] censeo
+[moot/tag] FM.la
+[moot/lemma] censeo
Carthaginem
+[moot/word] Carthaginem
+[moot/tag] FM.la
+[moot/lemma] carthaginem
esse
+[moot/word] esse
+[moot/tag] FM.la
+[moot/lemma] esse
delendam
+[moot/word] delendam
+[moot/tag] FM.la
+[moot/lemma] delendam
.
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
Here, blank lines indicate sentence boundaries, and comments (non-tokens) are lines
introduced by two percent signs ("%%"). The special comments immediately preceding each sentence
of the form "C<%% $s:lang=>I<LANG>" indicate the result of CAB's language-guessing module
L<DTA::CAB::Analyzer::LangId::Simple|DTA::CAB::Analyzer::LangId::Simple>.
The blank line between the final "." of the first sentence and the first word of the second
sentence indicates that the sentence boundary was correctly detected,
and the "C<%% $s.lang=>I<LANG>" comments indicate that the source language of both
sentences was correctly guessed ("de" indicating German in the former case, and "la" indicating
Latin in the latter). Due to the language-guesser's assignment for the second sentence,
all words in that sentence are tagged as foreign material ("FM"), with the suffix ".la" indicating
the language-guesser's output. Otherwise, no analysis (normalization or lemmatization) is performed
for sentences recognized as non-German.
=cut
##========================================================================
## Basic Usage: Document Query
=pod
=head3 A File Query
In addition to the default "live" query interface, the CAB web-service interface
also offers users the opportunity to upload an entire document file to be analyzed
and allowing the analysis results to be saved to the user's local machine.
The CAB file interface is accessible via the "File Upload"
button in the L<link area|/Link Buttons>, which resolves to L<http://www.deutschestextarchiv.de/cab/file|http://www.deutschestextarchiv.de/cab/file>
Suppose we have a simple plain-text file L<F<elephant.raw>|http://./elephant.raw> containing the document to be analyzed:
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
Ceterum censeo Carthaginem esse delendam.
First, save the input file L<F<elephant.raw>|http://./elephant.raw> to your local computer.
Then, in the CAB L<file input form|http://www.deutschestextarchiv.de/cab/file>,
click on on the "Choose File" button and select F<elephant.raw> from wherever you saved it.
Clicking on the "submit" button will cause the contents of the selected file to be sent
to the CAB server, analyzed, and prompt you for a location to which the analysis results
should be saved (by default F<elephant.raw.txt>).
Assuming the default options were active, you should have a result file
resembling
L<this|http://www.deutschestextarchiv.de/cab/query?q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>,
identical to the results displayed in the L<data area|/Response Data> for
L<the multi-sentence query example|/A Multi-Sentence Query>.
=head4 Caveats
Due to bandwidth limitations, the CAB server currently only accepts input files of size B<E<lt>= 1 MB>.
If you need analyze a large amount of data, you will first need to split your input files into
chunks of no more than 1 MB each, sending each chunk to the server individually. In this case,
please refrain from "hammering" the CAB server with an uninterrupted stream of requests: wait at least
3-5 seconds between requests to avoid blocking the server for other users. Alternatively,
if you need to analyze a large corpus, you can L<contact the I<Deutsches Textarchiv>|http://deutschestextarchiv.de/doku/impressum#kontakt>.
=cut
##========================================================================
## Analysis Chains
=pod
=head2 Analysis Chains
The CAB server supports a number of different analysis modes, corresponding to
different sorts of input data and/or different user tasks. The various analysis
modes are implemented in terms of different analysis chains (a.k.a. "analyzers" or just "chains")
supported by the underlying analysis dispatcher class, L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA>.
The analysis mode to be used for a particular CAB request is specified by the
"analyzer" or "a" parameter, which is initially set to use the "default" analysis chain
(which is itself just an alias for the "norm" chain).
This section briefly describes some alternative analysis chains and situations in which
they might be useful.
For a full list of available analysis chains, see the list returned by the
"L<Analyzers|http://www.deutschestextarchiv.de/cab/analyzers>" button in the L<link area|/Link Buttons>,
and see L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA>
for a list of the available atomic analyzers and aliases for complex analysis chains.
For details on individual atomic analyzers, see the appropriate L<DTA::CAB::Analyzer|DTA::CAB::Analyzer>
subclass documentation.
=cut
##========================================================================
## Analysis Chains: Type-wise
=pod
=head3 Type-wise Analysis
As noted L<above|/"A Sentence Query">, the default "norm" analysis chain uses
sentential context to improve the precision of the normalization process
as described in
L<Jurish (2012), Chapter 4|http://./#jurish2012>.
This behavior is not always desirable, however. In particular, if your data
is not arranged into linguistically meaningful sentence-like units -- e.g.
a simple flat list of surface types -- then no real context information is available,
and the "sentential" context CAB would use would more likely hinder the normalization
than help it. For such cases, the "norm1" analysis chain can be employed instead
of the default "norm" chain. The "norm1" chain uses only unigram-based probabilities
during normalization, so is less likely to be "confused" by non-sentence-like inputs.
Consider for example the input:
Fliegende Fliegen Fliegen Fliegenden Fliegen nach.
Passing this list to the "norm1" chain inhibits context-dependent processing
and results in L<the following|http://www.deutschestextarchiv.de/cab?a=norm1&q=Fliegende+Fliegen+Fliegen+Fliegenden+Fliegen+nach.>
Fliegende
+[moot/word] Fliegende
+[moot/tag] ADJA
+[moot/lemma] fliegend
Fliegen
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
Fliegen
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
Fliegenden
+[moot/word] Fliegenden
+[moot/tag] NN
+[moot/lemma] Fliegende
Fliegen
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
nach
+[moot/word] nach
+[moot/tag] APPR
+[moot/lemma] nach
.
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
contrast this to L<the output of the default "norm" chain|http://www.deutschestextarchiv.de/cab?a=norm&q=Fliegende+Fliegen+Fliegen+Fliegenden+Fliegen+nach.>
Fliegende
+[moot/word] Fliegende
+[moot/tag] ADJA
+[moot/lemma] fliegend
Fliegen
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
Fliegen
+[moot/word] Fliegen
+[moot/tag] VVFIN
+[moot/lemma] fliegen
Fliegenden
+[moot/word] Fliegenden
+[moot/tag] ADJA
+[moot/lemma] fliegend
Fliegen
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
nach
+[moot/word] Nach
+[moot/tag] PTKVZ
+[moot/lemma] nach
.
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
In the above example, all instances of the surface form "Fliegen" are analyzed as common nouns (NN) with lemma "Fliege"
by the unigram-based analyzer "norm1". If sentential context is considered, the second instance of "Fliegen" is correctly
analyzed as a finite verb form (VVFIN) of the lemma "fliegen". Similarly, the unigram-based analyzer mis-tags "Fliegenden"
as a noun rather than an attributive adjective (NN vs. ADJA) and assigns a corresponding (incorrect) lemma "Fliegende".
The final particle "nach" is mis-tagged as a preposition (APPR vs. PTKVZ) by the unigram-based model, but this has
no effect on the lemma assigned.
Although use of the "norm1" analyzer does not alter any canonical modern forms in this example,
such cases are possible.
=cut
##========================================================================
## Analysis Chains: Expansion
=pod
=head3 Term Expansion
It is sometimes useful to have a list of all known orthographic variants of a given input form, e.g.
for runtime queries of a database which indexes only surface forms. For such tasks, the analysis
chain "expand" can be used. To
see all the variants of the surface form "Elephant" in the
L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> corpus for example, one could query
L<http://deutschestextarchiv.de/cab?a=expand&q=Elephant&tokenize=0>,
and expect a response something like:
Elephant
+[moot/word] Elefant
+[moot/tag] NN
+[moot/lemma] Elefant
+[eqpho] Elephant <0>
+[eqpho] Elefant <14>
+[eqpho] elephant <17>
+[eqpho] elevant <17>
+[eqpho] Elephand <18>
+[eqpho] Elevant <18>
+[eqpho] elefant <18>
+[eqpho] Elephandt <19>
+[eqpho] Elephanth <19>
+[eqrw] Elefant <0>
+[eqrw] Elephant <0>
+[eqrw] Elephandt <8.44527626037598>
+[eqrw] elefant <8.44683265686035>
+[eqrw] Elephanth <8.70806312561035>
+[eqrw] elephant <9.01417255401611>
+[eqrw] Elephand <18.6624526977539>
+[eqrw] Eliphant <18.7045001983643>
+[eqrw] Elephants <21.1982593536377>
+[eqrw] elevant <21.3945064544678>
+[eqrw] Elphant <23.2134704589844>
+[eqrw] Elesant <27.7278366088867>
+[eqrw] Elephanta <30.2710800170898>
+[eqlemma] Elefannten <0>
+[eqlemma] Elefant <0>
+[eqlemma] Elefanten <0>
+[eqlemma] Elefantin <0>
+[eqlemma] Elefantine <0>
+[eqlemma] Elephandten <0>
+[eqlemma] Elephant <0>
+[eqlemma] Elephanten <0>
+[eqlemma] Elesant <0>
+[eqlemma] elefant <0>
+[eqlemma] elephanten <0>
Here, the "eqpho" attribute contains all surface forms recognized as phonetic variants of the query term,
"eqrw" contains those surface forms recognized as variants by the heuristic rewrite cascade,
and "eqlemma" contains the surface forms most likely to be mapped to the same modern lemma as the query term.
This online expansion strategy
is used by the L<DTA Query Lizard|http://kaskade.dwds.de/dstar/dta/lizard?q=Elephant>,
and was also used by an earlier version of the L<DTA corpus index|http://kaskade.dwds.de/dstar/dta/>
as described in
L<Jurish et al. (2014)|http://./#jtw2014>,
but has since been replaced there by an online lemmatization query using the "lemma" expander,
in conjunction with a direct query of the underlying corpus $Lemma index.
The request includes the C<tokenize=0> option,
which informs the CAB server that the query does not need to be tokenized, effectively forcing use of the
L<C<qd> parameter|DTA::CAB::HttpProtocol/Query Parameters> to the low-level service. This is generally
a good idea when using single-token queries or pre-tokenized documents, since it speeds up processing.
=cut
##========================================================================
## Analysis Chains: Date-optimized
=pod
=head3 Date-optimized Analysis
As of DTA::CAB v1.78, the L<DTA Dispatcher|DTA::CAB::Chain::DTA>
includes specialized L<rewrite models|DTA::CAB::Chain::DTA/rw>
(C<rw.1600-1700>, C<rw.1700-1800>, C<rw.1800-1900>),
and provides a number of L<high-level convenience chains|DTA::CAB::Chain::DTA/setupChains>
(C<norm.1600-1700>, C<norm1.1600-1700>, etc.) using these models
instead of the default "generic" rewrite cascade (C<rw>)
to provide canonicalization hypotheses for unknown words.
The weights for the specialized rewrite models
were trained on a modest number of manually assigned canonicalization pairs
from the period in question
extracted from the L<CabErrorDb|http://kaskade.dwds.de/caberr/>
error database (4,000-8,000 pair types per model), and may provide a slight
improvement in canonicalization accuracy with respect to the generic model,
provided that you specify the appropriate analysis chain ("Analyzer") in your request.
Compare for example the outputs of the various chains for the input forms
I<avf>, I<Auffichten>, and I<Büberchens>:
=over 4
=item *
L<generic|http://www.deutschestextarchiv.de/cab/?a=default1&q=avf%20Auffichten%20B%C3%BCberchens>
=item *
L<1600-1700|http://www.deutschestextarchiv.de/cab/?a=default1.1600-1700&q=avf%20Auffichten%20B%C3%BCberchens>
=item *
L<1700-1800|http://www.deutschestextarchiv.de/cab/?a=default1.1700-1800&q=avf%20Auffichten%20B%C3%BCberchens>
=item *
L<1800-1900|http://www.deutschestextarchiv.de/cab/?a=default1.1800-1900&q=avf%20Auffichten%20B%C3%BCberchens>
=back
=cut
##========================================================================
## Analysis Chains: Format Conversion
=pod
=head3 Format Conversion
The CAB server can be used to convert between various
supported L<IE<sol>O Formats|/IE<sol>O Formats>. In this mode,
no analysis is performed on the input data
(with the exception of tokenization for raw untokenized input),
but the input document is parsed and re-formatted according to
the selected output format. The analysis chain "null" can be
selected for such tasks. To tokenize a simple text
string for instance, you can select the "null" analyzer and the
"text" format, and expect output such as
L<this|http://www.deutschestextarchiv.de/cab?a=null&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>.
This mode of operation is mostly useful in conjunction with
L<file upload queries|/A File Query> to convert analyzed files.
If you only need to tokenize raw text files, consider using
the more efficient L<WASTE tokenizer web-service|http://www.dwds.de/waste/>
directly,
or the CLARIN-D L<WebLicht|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/> tool-chainer,
which offers a number of different tokenizer components.
=cut
##========================================================================
## Analyzers and Attributes
=pod
=head2 Analyzers and Attributes
Each L<analysis chain|/Analysis Chains> is (as the name suggests) implemented as a
finite list of atomic L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> objects, the
"component analyzers". The component analyzers in a chain are run in strict serial order
(one after the other), and later-running components can and do refer to the results of
analysis performed by components which have already been run.
Each L<component analyzer|/Analysis Components>
populates zero or more
L<token-|DTA::CAB::Token>,
L<sentence-|DTA::CAB::Sentence>, and/or
L<document-|DTA::CAB::Document>-level attributes; typically, the component C<X>
will write the results of its analysis to the token-attribute C<X>
(e.g. the C<morph> component populates the attribute C<$w.morph>).
In the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration,
each individual L<component analyzer|/Analysis Components> C<X> can be addressed explicitly by means of the analysis chain C<sub.X>,
and the default chain of components up to and including C<X> can be invoked by the analysis chain C<default.X>,
e.g. the morphological analysis component L<C<morph>|/morph> can be invoked directly by selecting the chain
C<sub.morph>, and together with its prerequisites by selecting the chain C<default.morph>.
=head3 System Architecture
The following diagram is a simplified sketch of the dataflow relationships between the major L</Analysis Components>
for the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> analysis chain operating on
a "raw" L<TEI-XML|/TEI> input document.
=begin html
<a href="sysarch.png"><img id="sysarch" style="display:block;margin-left:auto;margin-right:auto;max-width:50%;" src="sysarch.png"/></a>
=end html
=cut
##------------------------------------------------------
## Analysis Components
=pod
=head3 Analysis Components
This section describes the atomic analysis components
provided by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration.
=over 4
=item L<static|DTA::CAB::Analyzer::Cache::Static>
Static type-wise analysis cache for
the L<attributes|/Analysis Attributes>
C<eqphox>, C<errid>, C<exlex>, C<f>, C<lts>, C<mlatin>, C<morph>, C<msafe>, C<rw>, C<xlit>, C<lang>, and C<pnd>
based on the most recent release of the L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> corpus,
typically less than one week old.
=item L<exlex|DTA::CAB::Analyzer::ExLex>
Type-wise exception lexicon extracted from the
L<DTA EvalCorpus|http://odo.dwds.de/~jurish/software/dtaec/>
and the DTA::CAB error database (demo L<here|http://kaskade.dwds.de/caberr/>),
typically updated weekly.
=item L<tokpp|DTA::CAB::Analyzer::TokPP>
Type-wise heuristic token preprocessor used to identify punctuation, numbers, quotes, etc.
=item L<xlit|DTA::CAB::Analyzer::Unicruft>
Deterministic type-wise character transliterator based on L<libunicruft|http://odo.dwds.de/~moocow/software/unicruft/>,
mostly useful for handling extinct characters and diacritics.
=item L<lts|DTA::CAB::Analyzer::LTS>
Deterministic type-wise phonetization ("letter-to-sound" mapping)
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer
as described in L<Jurish (2012), Ch. 1|http://./#jurish2012>.
=item L<morph|DTA::CAB::Analyzer::Morph>
Type-wise morphological analysis of the (transliterated) surface form
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer.
The default DTA analysis chanin uses a modified version of the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology FST.
=item L<mlatin|DTA::CAB::Analyzer::Morph::Latin>
Type-wise Latin pseudo-morphology for (transliterated) surface forms
based on the finite word-list distributed with
the
"L<William Whitaker's Words|https://sourceforge.net/projects/wwwords/>"
Latin dictionary.
=item L<msafe|DTA::CAB::Analyzer::MorphSafe>
Heuristics for detecting "suspicious" analyses supplied
by the L</morph> component (L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>),
as described in L<Jurish (2012), App. A.4|http://./#jurish2012>.
=item L<langid|DTA::CAB::Analyzer::LangId::Simple>
Simple sentence-wise language guesser based on stopword lists
extracted from the python L<NLTK project|http://www.nltk.org/>.
Also supports the pseudo-language C<XY>, which is typically assigned
for mathematical notation, abbreviations, or other extra-lexical material.
=item L<rw|DTA::CAB::Analyzer::Rewrite>
Type-wise I<k>-best weighted finite-state rewrite cascade conflator ("nearest neighbors")
via L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade.
as described in L<Jurish (2012), Ch. 2|http://./#jurish2012>
=item L<eqphox|DTA::CAB::Analyzer::EqPhoX>
Type-wise pohonetic equivalence conflator using a
L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade;
requires prior L</lts> analysis.
Unlike the presentation in L<Jurish (2012), Ch. 3|http://./#jurish2012>,
the current implementation uses a I<k>-best search strategy
over an infinite target language derived from the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology for improved recall.
=item L<dmoot|DTA::CAB::Analyzer::Moot::Boltzmann>
Sentence-wise conflation candidate disambiguator as described in
L<Jurish (2012), Ch. 4|http://./#jurish2012>. Attempts to determine
the "best" modern form from the canidate conflations provided by
the L</exlex>, L</xlit>, L</eqphox>, and L</rw> components,
after consideration of the properties provided by the
L</morph>, L</msafe>, L</mlatin>, and L</langid> components
(e.g. sentences already identified as consisting primarily of
foreign-language material will B<not> be "forced" onto contemporary
German).
=item L<dmootsub|DTA::CAB::Analyzer::MootSub>
Sentence-wise post-processing for the L</dmoot> HMM.
Mostly useful for performing L<morphological analysis|/morph> on
non-trivial canonicalizations supplied by L</dmoot>.
=item L<moot|DTA::CAB::Analyzer::Moot>
Sentence-wise part-of-speech (PoS) tagging using
the L<moot|http://kaskade.dwds.de/~moocow/mirror/projects/moot/> tagger
on the observations (word forms) provided by L</dmoot> or the raw
input token text and the morphological ambiguity classes supplied
by L</dmootsub> or L</morph>.
=item L<mootsub|DTA::CAB::Analyzer::MootSub>
Sentence-wise post-processing for the L</moot> tagger.
Mostly useful for determining the "best" lemma for the canonical
word form (L</dmoot> or token text) and PoS-tag selected by
by L</moot> from the set of canonical morphological analyses
(L</dmootsub> or L</morph>).
=back
=cut
##------------------------------------------------------
## Analysis Attributes
=pod
=head3 Analysis Attributes
This section describes the most common analysis attributes
used by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration.
Each attribute is described by a template such as:
data: $OBJ->{ATTR} = CODE
text: +[LABEL] TEXT
hidden: HIDDEN
where:
=over 4
=item *
C<$OBJ-E<gt>{ATTR} = CODE> is Perl notation for the underlying data-structure of the attribute.
C<$OBJ> is one of C<$w>, C<$s>, or C<$doc> to indicate
a
L<token-|DTA::CAB::Token>,
L<sentence-|DTA::CAB::Sentence>,
or L<document-|DTA::CAB::Document>-level attribute, respectively.
If unspecified, C<ATTR> is identical to the attribute name itself,
and C<CODE> is a simple string containing the atomic attribute value.
=item *
C<+[LABEL] TEXT> is the L</Text>-Format notation for the corresponding attribute.
If unspecified, C<LABEL> is identical to the attribute name itself,
and C<TEXT> is a simple string containing the atomic attribute value.
C<+[LABEL] ...> indicates that the label C<LABEL> can occur more than once per
object -- this typically means that the corresponding attribute is list-valued.
=item *
C<HIDDEN> indicates whether the attribute's value is suppressed in the
output of the publicly accessible web-service.
If unspecified, the attribute is not hidden.
=back
The following attributes are used by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> chain:
=over 4
=item dmoot
perl: $w->{dmoot} = { tag=>$CANON, analyses=>[ { tag=>$CANDIDATE, prob=>$COST }, ... ] }
text: +[dmoot/tag] CANON
+[dmoot/analysis] CANDIDATE <COST>
+[dmoot/analysis] ...
HMM conflation candidate disambugiation supplied by the L</dmoot> component.
C<$CANON> represents the "best" modern form for the input token L</text>, and the candidate
conflations are represented by elements of the C<analyses> array.
=item eqphox
perl: $w->{eqphox} = [ { w=>$COST, hi=>$CANDIDATE }, ... ]
text: +[eqphox] CANDIDATE <COST>
+[eqphox] ...
Phonetic conflation candidate(s) supplied by the L</eqphox> component.
=item exlex
L<Exception lexicon|/exlex> entry (preferred modern form) for the input type (L</text>), if any.
=item errid
Error-ID from the DTA::CAB error database giving rise to the
L<exception lexicon|/exlex> entry for the input type (L</text>), if any.
The C<errid> attribute value may also be the designated string "C<ec>", indicating that the exception lexicon entry
was automatically generated from the L<DTA EvalCorpus|http://odo.dwds.de/~jurish/software/dtaec/>.
=item f
Frequency of the surface type in the DTA corpus, if available.
=item hasmorph
Boolean integer indicating whether or not at least one L</morph>
analysis was present for the current token.
Unlike the L<morph|/morph> attribute, the C<hasmorph> attribute
is B<not> suppressed in the output from the publicly accessible
web-service.
=item lang
perl: $w->{lang} = [ $LANG, ... ]
text: +[lang] LANG
+[lang] ...
Language(s) in which the current token is known to occur, as determined by the
L</morph>, L</mlatin>, and/or L</lang> components.
perl: $s->{lang} = $LANG
text: %% $s:lang=LANG
As a sentence-attribute, C<lang> indicates the single best guess source language for the sentence.
The "best guess" is determined for each sentence by counting the total number of token L</text> characters
for each C<lang> attribute value associated with any token in that sentence,
and choosing the C<lang> attribute value
with the largest character count; the default fallback value for the
DTA chain is "de" (i.e. German).
=item lts
perl: $w->{lts} = { w=>$COST, hi=>$PHO }
text: +[lts] PHO <COST>
Approximated phonetic form for the input type (L</text>), as determined by the L</lts> component.
=item mlatin
perl: $w->{mlatin} = [ { w=>0, hi=>"[_FM][lat]" } ]
text: +[morph/lat] [_FM][lat] <0>
Pseudo-morphological analysis returned by the L</mlatin> component.
=item moot
perl: $w->{moot} = { word=>$WORD, tag=>$TAG, lemma=>$LEMMA, details=>\%DETAILS, analyses=>\@ANALYSES }
text: +[moot/word] WORD
+[moot/tag] TAG
+[moot/lemma] LEMMA
+[moot/details] ...
+[moot/analysis] ...
Represents the part-of-speech tag (C<TAG>) supplied by the L</moot> component and
post-tagging lemmatization (C<LEMMA>) provided by the L</mootsub> component.
For convenience, the sub-attribute C<WORD> indicates the final "canonical" surface form
which was ultimately used as input to the part-of-speech tagger, and thus CAB's bottom line
best-guess regarding the "proper" extant equivalent word form for the input token L</text>.
C<DETAILS> and C<ANALYSES> are suppressed in the output of the publicly accessible web-service.
=item morph
perl: $w->{morph} = [ { w=>$COST, hi=>"$DEEP\[_$TAG]$FEATURES"}, ... ]
text: +[morph] DEEP [_TAG]FEATURES
+[morph] ...
hidden: yes
Morphological analysis returned by the L</morph> component.
C<$DEEP> is the "deep-form" of the corresponding lexical item, C<$TAG> is a part-of-speech tag
supplied by the morphological component, and C<$FEATURES> are additional morphosyntactic features (if any).
=item msafe
perl: $w->{msafe} = $SAFE
text: +[morph/safe] SAFE
Morphological security heuristics returned by the L</msafe> component.
C<$SAFE> is a Boolean integer, where 0 (zero) indicates a "suspicious" type
and 1 (one) indicates a "trustworthy" type.
=item pnd
If present, indicates that the surface form occurs in an
L<Integrated Authority File (GND)|https://www.dnb.de/EN/Standardisierung/GND/gnd_node.html>
record for at least one author whose work is included in the I<Deutsches Textarchiv> core corpus.
Probably not really very useful in general.
=item rw
perl: $w->{rw} = [ { w=>$COST, hi=>$CANDIDATE }, ... ]
text: +[rw] CANDIDATE <COST>
+[rw] ...
Rewrite conflation candidate(s) supplied by the L</rw> component.
=item text
Raw UTF-8 surface token text as appearing in the input stream.
If the input was untokenized and the source token was split over multiple input lines,
then the C<text> attribute will B<not> contain the hyphen or line-break.
=item xlit
perl: $w->{xlit} = { isLatin1=>$IS_LATIN_1, isLatinExt=>$IS_LATIN_EXT, latin1Text=>$LATIN1_TEXT }
text: +[xlit] l1=IS_LATIN_1 lx=IS_LATIN_EXT l1s=LATIN1_TEXT
Deterministic transliteration of input token L</text> and associated properties provided by the L</xlit> component.
C<$IS_LATIN_1> is 1 (one) iff the input token L</text> consists entirely of characters from the latin-1 subset, and otherwise 0 (zero).
Similarly,
C<$IS_LATIN_EXT> is 1 (one) iff the input token L</text> consists entirely of characters from the latin-1 subset and/or latin extensions,
and otherwise 0 (zero).
C<$LATIN1_TEXT> is a Latin-1 transliteration of the token's L</text> attribute.
=back
=cut
##========================================================================
## Formats
=pod
=head2 I/O Formats
The CAB web-service supports a number of different input- and output-formats for
document data. This section presents a brief outline of some of the more
popular formats. See L<DTA::CAB::Format/SUBCLASSES> for a list of currently
implemented format subclasses, and see
the "L<Formats|http://www.deutschestextarchiv.de/cab/formats>" link
in the CAB web-service interface L<link area|/Link Buttons> for a list of format aliases supported
by the server. Formatted input documents are passed to the low-level service
using the L<C<qd> query parameter|DTA::CAB::HttpProtocol/Query Parameters>,
the use of which is controlled by the C<tokenize> option in the L<Query Form|/Query Form>
section of the graphical interface.
=cut
##========================================================================
## Formats: Text-based
=pod
=head3 Text-based Formats
CAB supports various text-based formats for human consumption and/or further processing.
While typically not as flexible or efficient as the "pure" data-oriented formats
described L<below|/Data-oriented Formats>, CAB's native text-based formats offer
a reasonable compromise between human- and machine-readability.
All text-based CAB formats expect and return data encoded in L<UTF-8|https://en.wikipedia.org/wiki/UTF-8>,
B<without> a L<Byte-order mark|https://en.wikipedia.org/wiki/Byte_order_mark>.
=cut
##========================================================================
## Formats: Text-based: Raw
=pod
=over 4
=item L<Raw|DTA::CAB::Format::Raw>
[L<example input|http://./elephant.raw>,
L<example output|http://www.deutschestextarchiv.de/cab?fmt=raw&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Format for "raw" unstructured, untokenized UTF-8 text, implicitly used for query strings
passed in via the L<C<q> parameter|DTA::CAB::HttpProtocol/Query Parameters>.
Input will be tokenized with a language-specific L<WASTE|http://kaskade.dwds.de/waste/about.perl>
model characteristic for the underlying C<DTA::CAB> server
(typically L<C<de-dta-tiger>|http://kaskade.dwds.de/waste/downloads.perl>).
As an output format,
L<DTA::CAB::Format::Raw> writes all and only the canonical ("modern", "normalized") surface form
of each token to the output stream. Individual output tokens are separated by a single space
character (ASCII 0x20, C<" ">) and individual output sentences are separated by a single newline
character (ASCII 0x0A, C<"\n">).
If you are interested only in modern lemmata (rather than surface forms),
consider using the TAB-separated L</CSV> format and projecting the 5th output
column (LEMMA).
If you are working from XML input such as L</TEI> or L</TEI-ling>, consider
using the default (information-rich) output format and post-processing it
with the L</ling2plain.xsl> stylesheet.
=cut
##========================================================================
## Formats: Text-based: Text
=pod
=item L<Text|DTA::CAB::Format::Text>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=text&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple human-readable text format as described under "L<Basic Usage|/Basic Usage>", above.
Blank lines indicate sentence boundaries,
comments are lines beginning with "%%",
a line with no leading whitespace contains the surface text of a new token (word),
and subsequent token-lines are attribute values beginning with a TAB character
and a plus sign ("+"), followed by the attribute label enclosed in square brackets
"[...]" and the attribute value as a text-string.
B<Caveat>: CAB's "Text" format is B<not> "plain text" in the usual sense of an unstructured serial
stream of character content: use the L</Raw> format if you need to tokenize and analyze (UTF-8 encoded)
plain text.
Primarily useful for direct inspection and debugging.
=cut
##========================================================================
## Formats: Text-based: CONLLU
=pod
=item L<CONLLU|DTA::CAB::Format::CONLLU>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=conllu&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
"Vertical" text format containing conforming to the
CONLL-U format conventions as described under
L<https://universaldependencies.org/format.html>.
Supports various additional annotations in the CONLL-U "MISC" field,
including full dump of CAB-internal token structure as for the L</TJ> format.
This may be a good choice for data interchange with other NLP tools
which also support the CONLL-U format, but note that additional
postprocessing will be required if you want CAB's canonical modern
forms to appear in the CONLL-U "FORM" field.
=cut
##========================================================================
## Formats: Text-based: CSV
=pod
=item L<CSV|DTA::CAB::Format::CSV>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=csv&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple fixed-width "vertical" text format containing only selected
attribute values.
Each line is either a
comment introduced by "%%", an empty line indicating a sentence boundary,
or a TAB-separated token line. Token lines are of the form
SURFACE_TEXT XLIT_TEXT CANON_TEXT POS_TAG LEMMA ?DETAILS
where I<SURFACE_TEXT> is the surface form of the token,
I<XLIT_TEXT> is the result of a simple deterministic transliteration
using L<unicruft|http://odo.dwds.de/~jurish/software/unicruft/>,
I<CANON_TEXT> is the automatically determined canonical modern form
for the token, I<POS_TAG> is the part-of-speech tag assigned
by the L<moot|http://kaskade.dwds.de/~jurish/projects/moot/> tagger,
I<LEMMA> is the modern lemma form determined for I<CANON_TEXT> and I<POS_TAG>
by the L<TAGH morphological analyzer|http://www.tagh.de/>,
and I<DETAILS> if present are additional details.
This is the most compact of the text-based formats supported by CAB,
but lacks flexibility.
=cut
##========================================================================
## Formats: Text-based: TT
=pod
=item L<TT|DTA::CAB::Format::TT>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=tt&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple machine-readable "vertical" text format similar to that used by
L<Corpus WorkBench|http://cwb.sourceforge.net/>. Each line is either a
comment introduced by "%%", an empty line indicating a sentence boundary,
or a TAB-separated token line. Each token line's initial column is the token
surface text, and subsequent columns are
are the token's attribute values, where each attribute value column
begins with the attribute label enclosed in square brackets
"[...]" and is followed by the attribute value as a text string,
as for the L</Text> format without the leading "+".
Useful for further quick and dirty script-based processing.
=cut
##========================================================================
## Formats: Text-based: TJ
=pod
=item L<TJ|DTA::CAB::Format::TJ>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=tj&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple machine-readable "vertical" text format based on the L<TT|/TT>
format but using L<JSON|http://json.org/> to encode sentence- and token-level attributes
rather than an explicit attribute labelling scheme.
Each line is either a
comment introduced by "%%", an empty line indicating a sentence boundary,
a document attribute line,
a sentence attribute line,
or a TAB-separated token line.
Document-attribute lines are comments of the form
"C<%%$TJ:DOC=>I<JSON>", where I<JSON> is a JSON object representing
auxilliary document attributes.
Sentence-attribute lines are analogousd comments of the form
"C<%%$TJ:SENT=>I<JSON>".
Token lines consist of the the token surface text,
followed by a TAB character,
followed by a JSON object representing the internal token structure.
Useful for further script-based processing.
=cut
##========================================================================
## Formats: Text-based: CSV
=pod
=item L<XList|DTA::CAB::Format::ExpandList>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=xlist&a=expand.eqlemma&q=Elephant>]
Format used for online term expansion by
DDC L<Expand CAB|http://odo.dwds.de/~moocow/software/ddc/ddc_opt.html#Cab>.
Parses pre-tokenized input using the L</TT> format, and formats output as a flat list
of expanded types separated by TABs, newlines, and/or carriage returns.
Primarily useful for online L</Term Expansion> in conjunction with
L</Analysis Chains> such as C<expand>, C<expand.eqlemma>, C<expand.eqpho>, or C<expand.eqrw>.
=back
=cut
##========================================================================
## Formats: XML-based
=pod
=head3 XML-based Formats
The CAB web-service supports a number of XML-based formats for data exchange.
XML data formats are in general less efficient to parse and/or generate
than L<text-based|/Text-based Formats> or L<data-oriented|/Data-oriented Formats> formats,
but they do retain some degree of human-readability and the easy availability
of XML processing software packages such as L<libxml|http://www.xmlsoft.org/>
or L<XMLStarlet|http://xmlstar.sourceforge.net/>
makes such formats reasonable candidates for archiving and cross-platform data sharing.
=cut
##========================================================================
## Formats: XML-based: XmlTokWrap
=pod
=over 4
=item L<XmlTokWrap|DTA::CAB::Format::XmlTokWrap>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=twxml&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple serial XML-based format as used by the L<DTA::TokWrap|http://odo.dwds.de/~jurish/software/dta-tokwrap/>
module. Supports arbitrary token attribute substructure, but fairly slow.
=cut
##========================================================================
## Formats: XML-based: XmlTokWrapFast
=pod
=item L<XmlTokWrapFast|DTA::CAB::Format::XmlTokWrapFast>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=ftwxml&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Simple serial XML-based format as used by the L<DTA::TokWrap|http://odo.dwds.de/~jurish/software/dta-tokwrap/>
module. Faster than the L<XmlTokWrap|/XmlTokWrap> formatter, but doesn't support all attributes.
=cut
##========================================================================
## Formats: XML-based: XmlTokWrapLing
=pod
=item L<XmlLing|DTA::CAB::Format::XmlLing>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=ltwxml&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Flat XML-based format similar to L</XmlTokWrapFast> but using only TEI L<att.linguistic|http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>
attributes to represent token information. Faster than the L<XmlTokWrap|/XmlTokWrap> formatter, but doesn't support all attributes.
=cut
##========================================================================
## Formats: XML-based: TEI
=pod
=item L<TEI|DTA::CAB::Format::TEI>
[L<example input|http://./elephant.tei-xml>,
L<example output|http://www.deutschestextarchiv.de/cab/query?fmt=tei&raw=1&qd=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%0A%3CTEI%3E%0A+%3Ctext%3E%0A+++%3C%21--+running+headers+are+ignored+by+the+tokenizer+--%3E%0A+++%3Cfw%3ERunning+header%3C%2Ffw%3E%0A+++%3Cp%3EEJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler%3C%2Fp%3E%0A+++%3C%21--+paragraph+boundaries+imply+sentence+boundaries%2C+even+without+punctuation+--%3E%0A+++%3Cp%3ECeterum+censeo+Carthaginem+esse+delendam%2E%3C%2Fp%3E%0A+%3C%2Ftext%3E%0A%3C%2FTEI%3E%0A>]
Parses raw un-tokenized TEI-like input (with or without //c elements) using
L<DTA::TokWrap|http://odo.dwds.de/~jurish/software/dta-tokwrap/> to reserialize and tokenize the
source text, and splices analysis results into the resulting XML document
using the L<XmlTokWrap|/XmlTokWrap> format.
Any C<E<lt>sE<gt>> or C<E<lt>sE<gt>> elements in the input will be ignored and the input will be (re-)tokenized.
Output data is itself parseable by by the L<TEIws|/TEIws> formatter.
B<Be warned> that output sentence- and token-nodes (E<lt>sE<gt> and E<lt>wE<gt> elements, respectively)
may be I<fragmented> in the final output file. A "fragmented" node
in this sense is a logical unit (sentence or token) realized in the output TEI-XML file as multiple
elements. Fragmented nodes are encoded using the TEI "linking" attributes
L<@prev|http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-att.global.linking.html#tei_att.prev>
and
L<@next|http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-att.global.linking.html#tei_att.next>,
and only the first element of a fragmented node should contain the CAB attribute substructure for
that node.
Input to this class need not strictly conform to the L<TEI Guidelines|http://www.tei-c.org/>;
in fact, the only structural requirement is that at least one C<E<lt>textE<gt>> element be present --
any input outside of the scope of a C<E<lt>textE<gt>> element is ignored. Input files must however
be encoded in L<UTF-8|https://en.wikipedia.org/wiki/UTF-8>. In particular, XML documents
conforming to the L<DTABf Guidelines|http://www.deutschestextarchiv.de/doku/basisformat>
should be handled gracefully.
Primarily useful for analyzing native TEI-like XML corpus data without losing
structural information encoded in the source XML itself.
=cut
##========================================================================
## Formats: XML-based: TEI-fast
=pod
=item L<TEI-fast|DTA::CAB::Format::TEI>
[L<example input|http://./elephant.tei-xml>,
L<example output|http://www.deutschestextarchiv.de/cab/query?fmt=tei-fast&raw=1&qd=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%0A%3CTEI%3E%0A+%3Ctext%3E%0A+++%3C%21--+running+headers+are+ignored+by+the+tokenizer+--%3E%0A+++%3Cfw%3ERunning+header%3C%2Ffw%3E%0A+++%3Cp%3EEJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler%2C%3C%2Fp%3E%0A+++%3C%21--+paragraph+boundaries+imply+sentence+boundaries%2C+even+without+punctuation+--%3E%0A+++%3Cp%3ECeterum+censeo+Carthaginem+esse+delendam%2E%3C%2Fp%3E%0A+%3C%2Ftext%3E%0A%3C%2FTEI%3E%0A>]
Wrapper for the L</TEI> parser/formatter class using L</XmlTokWrapFast> to format the low-level token data
See L</TEI> and L</XmlTokWrapFast> for caveats.
=cut
##========================================================================
## Formats: XML-based: TEI-ling
=pod
=item L<TEI-ling|DTA::CAB::Format::TEI>
[L<example input|http://./elephant.tei-xml>,
L<example output|http://www.deutschestextarchiv.de/cab/query?fmt=tei-ling&raw=1&qd=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%0A%3CTEI%3E%0A+%3Ctext%3E%0A+++%3C%21--+running+headers+are+ignored+by+the+tokenizer+--%3E%0A+++%3Cfw%3ERunning+header%3C%2Ffw%3E%0A+++%3Cp%3EEJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler%2C%3C%2Fp%3E%0A+++%3C%21--+paragraph+boundaries+imply+sentence+boundaries%2C+even+without+punctuation+--%3E%0A+++%3Cp%3ECeterum+censeo+Carthaginem+esse+delendam%2E%3C%2Fp%3E%0A+%3C%2Ftext%3E%0A%3C%2FTEI%3E%0A>]
Wrapper for the L</TEI> parser/formatter class using L</XmlLing> to format the low-level token data
using only TEI L<att.linguistic|http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>
attributes. See L</TEI> and L</XmlLing> for caveats.
=cut
##========================================================================
## Formats: XML-based: TEIws
=pod
=item L<TEIws|DTA::CAB::Format::TEIws>
[L<example input|http://./elephant.teiws-xml>,
L<example output|http://www.deutschestextarchiv.de/cab/query?fmt=teiws&raw=1&qd=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%0A%3CTEI%3E%0A++%3Ctext%3E%0A++++%3C%21--+untokenized+material+is+ignored+--%3E%0A++++%3Cfw%3ERunning+header+%28untokenized%29%3C%2Ffw%3E%0A++++%3Cp%3E%0A++++++%3C%21--+every+sentence+%2F%2Fs+and+token+%2F%2Fw+must+have+an+%40id+or+%40xml%3Aid+attribute+--%3E%0A++++++%3Cs+id%3D%22s1%22%3E%0A++++++++%3Cw+id%3D%22w1%22%3EEJn%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w2%22%3Ezamer%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w3%22%3EElephant%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w4%22%3Egillt%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w5%22%3Eohngef%C3%A4hr%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w6%22%3Ezweyhundert%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w7%22%3EThaler%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w8%22%3E.%3C%2Fw%3E%0A++++++%3C%2Fs%3E%0A++++++%3Cs+id%3D%22s2%22%3E%0A++++++++%3Cw+id%3D%22w9%22%3ECeterum%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w10%22%3Ecenseo%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w11%22%3ECarthaginem%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w12%22%3Eesse%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w13%22%3Edelendam%3C%2Fw%3E%0A++++++++%3Cw+id%3D%22w14%22%3E.%3C%2Fw%3E%0A++++++%3C%2Fs%3E%0A++++%3C%2Fp%3E%0A++%3C%2Ftext%3E%0A%3C%2FTEI%3E%0A>]
High-level parser/formatter class for pre-tokenized (and possibly fragmented) TEI-like XML
as output by the L<TEI|/TEI> formatter.
Input files should be encoded in L<UTF-8|https://en.wikipedia.org/wiki/UTF-8>,
and every input sentence C<//s> and token C<//w> must have an C<@id> or C<@xml:id> attribute.
Potentially useful for analyzing pre-tokenized TEI-like XML data,
but primarily used for converting to other, script-friendlier formats
such as L<CSV|/CSV>.
=cut
##========================================================================
## Formats: XML-based: TEIws+ling
=pod
=item L<TEIws-ling|DTA::CAB::Format::TEIws>
[L<example|http://./elephant.teiws-ling-xml>]
Wrapper for the L</TEIws> parser/formatter class using L</XmlLing> to format the low-level token data
using only TEI L<att.linguistic|http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>
attributes, which are also parsed from the input document if present. See L</TEIws> and L</XmlLing> for caveats.
=cut
##========================================================================
## Formats: XML-based: TCF
=pod
=item L<TCF|DTA::CAB::Format::TCF>
[L<example input (untokenized)|http://./elephant.raw.tcf>,
L<example input (pre-tokenized)|http://./elephant.tok.tcf>,
L<example output|http://www.deutschestextarchiv.de/cab?fmt=tcf&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Monolithic stand-off XML format used by CLARIN-D, in particlar by the
CLARIN-D L<WebLicht|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/> tool-chainer.
See L<the TCF format documentation|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format>
for details on the TCF format.
CAB currently handles only the C<text>, C<tokens>, C<sentences>, C<POStags>, C<lemmas>, and C<orthography>
TCF layers.
=cut
##========================================================================
## Formats: XML-based: XML-RPC
=pod
=item L<XML-RPC|DTA::CAB::Format::XmlRpc>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=xmlrpc&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Flexible but obscenely inefficient format used for data transfer
by the L<XML-RPC Protocol|https://en.wikipedia.org/wiki/XML-RPC>.
Avoid it if you can.
=back
=cut
##========================================================================
## Formats: Data-oriented
=pod
=head3 Data-oriented Formats
The following formats provide direct dumps of the
underlying L<DTA::CAB::Document|DTA::CAB/Data Model> used internally by CAB itself.
They are efficient to parse and to produce, but may not be suitable for direct
human consumption.
=cut
##========================================================================
## Formats: Data-oriented: JSON
=pod
=over 4
=item L<JSON|DTA::CAB::Format::JSON>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=json&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Direct L<JSON|http://json.org/> dump of the underlying
L<DTA::CAB::Document|DTA::CAB/Data Model> structure
using the Perl L<JSON::XS|https://metacpan.org/release/JSON-XS> module.
Very fast and flexible, suitable for further automated processing.
=cut
##========================================================================
## Formats: Data-oriented: YAML
=pod
=item L<YAML|DTA::CAB::Format::YAML>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=yaml&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Direct dump of the underlying
L<DTA::CAB::Document|DTA::CAB/Data Model> structure
as L<YAML|https://en.wikipedia.org/wiki/YAML> markup.
Fast, flexible, and supports shared substructures, unlike the L<JSON|/JSON> formatter.
=cut
##========================================================================
## Formats: Data-oriented: Perl
=pod
=item L<Perl|DTA::CAB::Format::Perl>
[L<example|http://www.deutschestextarchiv.de/cab?fmt=perl&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Direct dump of the underlying
L<DTA::CAB::Document|DTA::CAB/Data Model> structure
using the Perl L<Data::Dumper|https://metacpan.org/release/Data-Dumper> module.
Mainly useful for further automated processing with Perl
while retaining some degree of human readability.
=cut
##========================================================================
## Formats: Data-oriented: Storable
=pod
=item L<Storable|DTA::CAB::Format::Storable>
[L<example|http://www.deutschestextarchiv.de/cab/query?fmt=bin&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>]
Direct binary dump of the underlying
L<DTA::CAB::Document|DTA::CAB/Data Model> structure
using the Perl L<Storable|https://metacpan.org/release/Storable> module.
This is currently the fastest I/O class for both in- and output,
mainly useful for further automated processing with Perl.
=back
=cut
##========================================================================
## Formats: Unsupported Formats
=pod
=head3 Unsupported Formats
If you have a source document you wish to pass to the CAB web-service which is not already
encoded in a format supported directly by the web-service, you will first have to convert
it to such a format before you can analyze it.
The L<OxGarage|https://oxgarage-paderborn.tei-c.org/> conversion suite is particularly
useful for such tasks, and is capable of producing
L<TEI-XML|/TEI> or L<plain text|/Raw> from a number of popular document formats.
Similarly, if you find that none of the
supported output formats suit your needs, you may need to perform additional conversion
to the analysis data returned by the web-service: the L<CSV|/CSV> format can for example
be imported into most spreadsheet programs, or you can use some third-party conversion
software such as
L<OxGarage|https://oxgarage-paderborn.tei-c.org/>
or L<LibreOffice|https://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/>.
=cut
##========================================================================
## Advanced Usage
=pod
=head2 Advanced Usage
The CAB web-service is a request-oriented service:
it accepts a user request as a set of I<parameter>=I<value> pairs
and returns the analyzed data as a L<DTA::CAB::Document|DTA::CAB::Document> object
encoded according to the L<output format|/IE<sol>O Formats> specified by the C<ofmt> parameter.
Parameters are passed to the DiaCollo web-service
L<RESTfully|https://en.wikipedia.org/wiki/Representational_state_transfer>
via the L<URL query string|http://en.wikipedia.org/wiki/Query_string>
or L<HTTP POST request|http://en.wikipedia.org/wiki/POST_(HTTP)#Use_for_submitting_web_forms>
as for a standard L<web form|http://en.wikipedia.org/wiki/Form_(HTML)>.
The URL for the low-level request including all user parameters
is displayed in the web front-end in the L<status line|/Status Line>.
See L<DTA::CAB::HttpProtocol/query Requests> for more details on the
RESTful CAB request protocol and a list of supported parameters.
Since CAB requests are really nothing more than standard HTTP form requests,
a large variety of existing software packages can be used to generate and dispatch CAB
requests, e.g
L<LWP::UserAgent|https://metacpan.org/pod/LWP::UserAgent>,
L<curl|http://curl.haxx.se/>,
or L<wget|https://www.gnu.org/software/wget/>.
When automating CAB requsts,
please respect the L<caveats|/Caveats> mentioned above in the L<file query example|/A File Query>.
=cut
##========================================================================
## Advanced Usage: curl
=pod
=head3 Querying CAB with curl
To analyze a L<TEI-like|/TEI> XML file F<FILE.tei.xml> using L<curl|http://curl.haxx.se/>
and save the analysis results to a "spliced" TEI file F<FILE.teiws.xml>,
the following command-line ought to suffice:
curl -X POST -sSLF "qd=@FILE.tei.xml" -o "FILE.teiws.xml" "http://www.deutschestextarchiv.de/public/cab/query?fmt=tei"
Alternative L<IO formats|/IE<sol>O Formats> and L<request parameters|DTA::CAB::HttpProtocol/query-Requests>
can be accommodated by inserting them into the L<URL query string|http://en.wikipedia.org/wiki/Query_string>
passed to L<C<curl>|http://curl.haxx.se/docs/manpage.html>.
You can also make use of the
inline POSTDATA mechanism (a.k.a. "xpost") described in L<the DTA::CAB::HttpProtocol manpage|DTA::CAB::HttpProtocol/query-Requests>
in order to save yourself and the CAB server the effort of encoding/decoding the document data.
In this case, you need to specify an appropriate "C<Content-Type>" header, e.g:
curl -X POST -sSLH "Content-Type: text/tei+xml; charset=utf8" --data-binary "@FILE.tei.xml" -o "FILE.teiws.xml" "http://www.deutschestextarchiv.de/public/cab/query?fmt=tei"
You might be interested in the L<cab-curl-post.sh|http://./cab-curl-post.sh> and/or L<cab-curl-xpost.sh|http://./cab-curl-xpost.sh>
wrapper scripts, which encapsulate some of the common curl command-line options.
The preceding 2 example C<curl> calls should be equivalent to:
bash cab-curl-post.sh "?fmt=tei" "FILE.xml" -o "FILE.teiws.xml"
bash cab-curl-xpost.sh "?fmt=tei" "FILE.xml" -o "FILE.teiws.xml"
Note that when accessing the CAB web-service API directly via HTTP in this fashion,
auto-detection of L<input file format|/IE<sol>O Formats> is not supported,
so you must specify at least the L<"fmt"|DTA::CAB::HttpProtocol/fmt-fmt1> parameter
in the URL query string if your files are not in the global default format
(usually L<TT|/TT>).
=cut
##========================================================================
## Advanced Usage: TEI
=pod
=head3 Post-processing TEI XML
The following L<XSL|https://en.wikipedia.org/wiki/XSL> scripts are provided for post-processing the
L<"spliced" TEI-like output format|/TEI>.
=over 4
=item L<spliced2ling.xsl|http://./spliced2ling.xsl>
Removes native CAB-markup from TEI-like XML files, encoding the remaining token analysis information using
the TEI L<C<att.linguistic>|http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>
inventory. All tokens remain encoded as C<E<lt>wE<gt>> elements (rather than C<E<lt>pcE<gt>> elements),
and only the C<att.linguistic> attributes C<@lemma>, C<@pos>, and C<@join> are inserted, so that e.g.
<w id="w2" t="EJn"><moot word="Ein" lemma="ein" tag="ART"/>EJn</w>
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/>Elephant</w><w>!</w>
becomes:
<w id="w2" lemma="ein" pos="ART" norm="Ein">Ein</w>
<w id="w3" join="right" lemma="Elefant" pos="NN" norm="Elefant">Elephant</w>
<w join="left">!</w>
Note that the C<@join> attribute will be correctly generated only if the relevant C<w> elements are truly immediately adjacent
in the input file: any intervening newlines or other whitespace will prohibit insertion of a C<@join> attribute.
=item L<spliced2norm.xsl|http://./spliced2norm.xsl>
Removes C<//s> and C<//w> elements, replacing the surface text of each token
with its CAB-normalized form, so that e.g.
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
becomes simply the text node
Elefant
=item L<spliced2orig+reg.xsl|http://./spliced2orig+reg.xsl>
Removes C<//s> and C<//w> elements, replacing non-identity normalizations
with a C<choice> element containing a daughter C<orig> with the original surface form
and a C<reg[@resp="#cab"]> daughter containing the CAB-normalized form, so that e.g.
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
becomes
<choice><orig>Elephant</orig><reg resp="#cab">Elefant</reg></choice>
=item L<spliced2cab.xsl|http://./spliced2cab.xsl>
Resolves fragmented nodes and converts a "spliced" TEI-like XML file into
a serial format as output by the L<XmlTokWrap|/XmlTokWrap> formatter class.
=item L<spliced2clean.xsl|http://./spliced2clean.xsl>
Removes some extraneous CAB-markup from TEI-like XML files.
Probably not really useful for files returned by the CAB web-service.
=item L<spliced2cleaner.xsl|http://./spliced2cleaner.xsl>
Removes most CAB-markup from TEI-like XML files, so that e.g.
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
becomes
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/>Elephant</w>
=item L<spliced2clean+cabns.xsl|http://./spliced2clean+cabns.xsl>
Removes most CAB-markup from TEI-like XML files and assigns CAB-internal
attributes to the L<XML namespace|https://en.wikipedia.org/wiki/XML_namespace> "cab" (C<http://deutschestextarchiv.de/ns/cab/1.0>),
so that e.g.
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
becomes
<w xml:id="w3" cab:t="Elephant" cab:word="Elefant" cab:tag="NN" cab:lemma="Elefant">Elephant</w>
=back
=cut
##========================================================================
## Advanced Usage: TEI-ling
=pod
=head3 Post-processing TEI-ling XML
The following L<XSL|https://en.wikipedia.org/wiki/XSL> scripts are provided for post-processing the
L<TEI-ling output format|/TEI-ling>.
=over 4
=item L<ling2norm.xsl|http://./ling2norm.xsl>
Variant of L<spliced2norm.xsl|/spliced2norm.xsl> for use with L</TEI-ling> input.
=item L<ling2plain.xsl|http://./ling2plain.xsl>
Variant of L<ling2norm.xsl|/ling2norm.xsl> which outputs plain (normalized) text
rather than XML. Note that the output text is B<ALWAYS> in strict TEI-document
serial order, since this script does not respect any
serialization hints encoded by TEI "linking" attributes or
provided by L<DTA::TokWrap|http://odo.dwds.de/~jurish/software/dta-tokwrap/>.
If you want to ensure a linguistically plausible serial order,
you should prefer a "flat" serial document format such as L</XmlTokWrapFast>
or L</XmlLing>.
=back
=cut
##========================================================================
## Advanced Usage: TCF
=pod
=head3 Post-processing TCF XML
=over 4
=item L<tcf-orthswap.xsl|http://./tcf-orthswap.xsl>
Swaps the text content of 1:1-corresponding C<//tokens/token> and C<//orthography/correction[@operation="replace"]> elements,
so that e.g.
<tokens>
<token ID="w1">Ein</token>
<token ID="w2">Elephant</token>
</tokens>
...
<orthography>
<correction tokenIDs="w2" operation="replace">Elefant</correction>
</orthography>
becomes
<tokens>
<token ID="w1">Ein</token>
<token ID="w2">Elefant</token>
</tokens>
...
<orthography>
<correction tokenIDs="w2" operation="replace">Elephant</correction>
</orthography>
Potentially useful for preparing CAB-annotated TCF data for submission to other text-sensitive TCF processors
which themselves do not respect the C<//orthography> layer.
=back
=cut
##========================================================================
## REFERENCES
=pod
=head1 SOURCE CODE
The C<DTA::CAB> source code distribution is available on L<CPAN|https://metacpan.org/release/DTA-CAB>.
=head2 Batteries Not Included
You should be aware that the L<source code distribution|https://metacpan.org/release/DTA-CAB> alone is B<not>
sufficient to set up and run a complete C<DTA::CAB> analysis pipeline on your local site. In order to do that,
you would also need various assorted language models and additional resources which are not themselves part
of C<DTA::CAB> (which aspires to be language-agnostic), and therefore not part of the source code distribution.
See L<Jurish (2012)|http://./#jurish2012> and the source code L<documentation|https://metacpan.org/release/DTA-CAB> for
more details.
=cut
##========================================================================
## REFERENCES
=pod
=head1 REFERENCES
The author would appreciate CAB users citing its use in any related publications.
As a general CAB-related reference,
and for analysis and canonicalizaion of historical text to modern forms in particular, you can cite:
=begin html
<div class="bib">
=end html
L<Jurish, B.|name:jurish2012> I<Finite-state Canonicalization Techniques for Historical German.>
PhD thesis, Universität Potsdam, 2012 (defended 2011).
URN urn:nbn:de:kobv:517-opus-55789,
[L<epub|http://opus.kobv.de/ubp/volltexte/2012/5578/>,
L<PDF|http://kaskade.dwds.de/~jurish/pubs/jurish2012diss.pdf>,
L<BibTeX|http://kaskade.dwds.de/~jurish/pubs/jurish2012diss.bib>]
=begin html
</div>
=end html
For the concrete architecture of the CAB system as used by the
L<I<Deutsches Textarchiv> (DTA)|http://www.deutschestextarchiv.de>
project, you can cite:
=begin html
<div class="bib">
=end html
L<Jurish, B.|name:jurish2013> "Canonicalizing the Deutsches Textarchiv." In
L<Proceedings of I<Perspektiven einer corpusbasierten historischen Linguistik und Philologie>|https://edoc.bbaw.de/solrsearch/index/search/searchtype/collection/id/16386>
(Berlin, Germany, 12th-13th December 2011), volume 4 of
L<Thesaurus Linguae Aegyptiae|http://aaew.bbaw.de/tla/>,
Berlin-Brandenburgische Akademie der Wissenschaften, 2013.
[L<epub|http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:b4-opus-24433>,
L<PDF|http://edoc.bbaw.de/volltexte/2013/2443/pdf/Jurish.pdf>,
L<BibTeX|http://kaskade.dwds.de/~jurish/pubs/jurish2013canonicalizing.bib>]
=begin html
</div>
=end html
For online term expansion with the "L<expand|/Term Expansion>" analysis chain,
you can cite:
=begin html
<div class="bib">
=end html
L<Jurish, B., C. Thomas, & F. Wiegand.|name:jtw2014> "Querying the Deutsches Textarchiv."
In U. Kruschwitz, F. Hopfgartner, & C. Gurrin (editors),
Proceedings of the Workshop I<L<MindTheGap 2014: Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research Communities|http://ceur-ws.org/Vol-1131/>>
Berlin, Germany, 4th March, 2014, pages 25-30, 2014.
[L<PDF|http://ceur-ws.org/Vol-1131/mindthegap14_7.pdf>,
L<BibTeX|http://kaskade.dwds.de/~jurish/pubs/jtw2014querying.bib>]
=begin html
</div>
=end html
=cut
##========================================================================
## SEE ALSO
=pod
=head1 SEE ALSO
=over 4
=item *
The L<CAB software page|http://odo.dwds.de/~jurish/software/dta-cab/>
is the top-level repository for CAB documentation, news, etc.
=item *
The C<DTA::CAB> source code distribution lives on L<CPAN|https://metacpan.org/release/DTA-CAB>.
=item *
The L<DTA::CAB|DTA::CAB> manual page contains a basic introduction
to the the CAB architecture.
=item *
The L<DTA::CAB::Format|DTA::CAB::Format> manual page describes the
abstract CAB I/O Format API, and includes a list of supported
L<format classes|DTA::CAB::Format/SUBCLASSES>.
=item *
The L<DTA::CAB::HttpProtocol|DTA::CAB::HttpProtocol> manual page describes
the conventions used by the CAB web-service API.
=item *
The L<DTA 'Base Format' Guidelines (DTABf)|http://www.deutschestextarchiv.de/doku/basisformat>
describes the subset of the L<TEI|http://www.tei-c.org/> encoding guidelines which can reasonably be
expected to be handled gracefully by the CAB L<TEI|/TEI> and/or L<TEIws|/TEIws> formatters.
=item *
L<Jurish (2012)|http://./#jurish2012>
describes the abstract method used by CAB for canonicalizaion of historical text to modern forms.
=item *
L<Jurish (2013)|http://./#jurish2013>
describes the concrete architecture of the CAB system as used by the
L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> project.
=item *
L<Jurish et al. (2014)|http://./#jtw2014>
describes the use of CAB's online L<term expansion chain|/Term Expansion>
for runtime evaluation of database queries.
=back
=cut
##======================================================================
## Footer
##======================================================================
=pod
=head1 AUTHOR
Bryan Jurish E<lt>L<jurish@bbaw.de|mailto:jurish@bbaw.de>E<gt>
=cut