Lingua-JA-Moji/lib/Lingua/JA/Moji.pod
=encoding UTF-8
=head1 NAME
Lingua::JA::Moji - Handle many kinds of Japanese characters
=head1 SYNOPSIS
Convert romanised Japanese to and from kana:
use utf8;
use Lingua::JA::Moji qw/kana2romaji romaji2kana/;
my $romaji = kana2romaji ('あいうえお');
print "$romaji\n";
my $kana = romaji2kana ($romaji);
print "$kana\n";
produces output
aiueo
アイウエオ
(This example is included as L<F<synopsis.pl>|https://fastapi.metacpan.org/source/BKB/Lingua-JA-Moji-0.60/examples/synopsis.pl> in the distribution.)
Convert between different forms of kana:
use utf8;
use Lingua::JA::Moji ':all';
my $h = 'あいうえおがっぷぴょん';
print kata2hira ($h), "\n";
print hira2kata (kata2hira ($h)), "\n";
print kana2hw ($h), "\n";
print kata2hira (hw2katakana (kana2hw ($h))), "\n";
# Silly circled kana
print kana2circled ($h), "\n";
produces output
あいうえおがっぷぴょん
アイウエオガップピョン
アイウエオガップピョン
あいうえおがっぷぴょん
㋐㋑㋒㋓㋔㋕゛ッ㋫゜㋪゜ョン
(This example is included as L<F<syn-kana.pl>|https://fastapi.metacpan.org/source/BKB/Lingua-JA-Moji-0.60/examples/syn-kana.pl> in the distribution.)
=head1 VERSION
This document describes Lingua::JA::Moji version 0.60
corresponding to git commit L<9ad3d6b5308d54f0c1eae61dc5bf7119c2670074|https://github.com/benkasminbullock/Lingua-JA-Moji/commit/9ad3d6b5308d54f0c1eae61dc5bf7119c2670074> made on Wed Feb 14 15:11:13 2024 +0900.
=head1 DESCRIPTION
This module provides methods to convert different written forms of
Japanese into one another. It enables conversion between romanized
Japanese, hiragana, and katakana. It also includes a number of unusual
encodings such as Japanese braille and morse code, as well as
conversions between Japanese and Cyrillic and Hangul. It also handles
conversion between the Chinese characters (kanji) used before and
after the character reforms of 1949, as well as the various bracketed
and circled forms of kana and kanji.
All the functions in this module assume the use of Unicode
encoding. All input and output strings must be encoded using Perl's
"UTF-8" format.
The module loads the various data format conversion files on demand,
thus the various obscure conversions hopefully do not cause a memory
burden.
This module does not handle the conversion of kanji words into kana,
or kana into kanji.
=head1 ROMANIZATION
These functions convert Japanese letters to and from romanized forms.
=head2 is_romaji
use Lingua::JA::Moji 'is_romaji';
# The following line returns "undef"
is_romaji ("abcdefg");
# The following line returns a defined value
is_romaji ('loyehye');
# The following line returns a defined value
is_romaji ("atarimae");
This detects whether a string of alphabetical characters, which may
also include characters with macrons or circumflexes, "looks like"
romanized Japanese. If the test is successful, it returns a true
value, and if the test is unsuccessful, it returns a false value. If
the string is empty, it returns a false value. Hyphens are not allowed
as the first character.
This works by converting the string to kana via L</romaji2kana> and
seeing if it converts cleanly or not.
The "true" value returned is the output of the round-trip conversion,
converted into wapuro format.
=head2 is_romaji_semistrict
use Lingua::JA::Moji 'is_romaji_semistrict';
# The following line returns "undef"
is_romaji_semistrict ("abcdefg");
# The following line returns "undef"
is_romaji_semistrict ('loyehye');
# The following line returns a defined value
is_romaji_semistrict ("atarimae");
# The following line returns a defined value
is_romaji_semistrict ("pinku no dorufin");
Halfway between L</is_romaji> and L</is_romaji_strict>, this allows
some formations like "pinku no dorufin" but not the really unlikely
stuff which "is_romaji" allows.
=head2 is_romaji_strict
use Lingua::JA::Moji 'is_romaji_strict';
# The following line returns "undef"
is_romaji_strict ("abcdefg");
# The following line returns "undef"
is_romaji_strict ('loyehye');
# The following line returns a defined value
is_romaji_strict ("atarimae");
This detects whether a string of alphabetical characters, which may
also include characters with macrons or circumflexes, "looks like"
romanized Japanese. If the test is successful, it returns a true
value, and if the test is unsuccessful, it returns a false value. If
the string is empty, it returns a false value.
This test is much stricter than L</is_romaji>. It insists that the
word does not contain constructions which may be valid as inputs to an
IME, but which do not look like Japanese words.
The "true" value returned is the output of the round-trip conversion,
converted into wapuro format.
This was added to the module in version L</0.27>.
=head2 is_voiced
use Lingua::JA::Moji 'is_voiced';
if (is_voiced ('が')) {
print "が is voiced.\n";
}
Given a kana or romaji input, C<is_voiced> returns a true value if the
sound is a voiced sound like I<a>, I<za>, I<ga>, etc. and the
undefined value if not.
=head2 kana2romaji
Convert kana to romaji.
use Lingua::JA::Moji 'kana2romaji';
$romaji = kana2romaji ("うれしいこども");
# $romaji = 'uresîkodomo'
Convert kana to a romanized form.
An optional second argument, a hash reference, controls the style of
conversion.
use utf8;
$romaji = kana2romaji ("しんぶん", {style => "hepburn"});
# $romaji = "shimbun"
The options are
=over
=item style
The style of romanization. The default style of romanization is
"Nippon-shiki". The user can set the conversion style to "hepburn" or
"passport" or "kunrei" or "common". If Hepburn is selected, then the
following option C<use_m> is set to "true", and the C<ve_type> is set
to "macron". The "common" style is the same as the Hepburn style, but
it does things like changing "ジェット" to "jetto" rather than
ignoring the small vowel.
Possible styles are as follows:
=over
=item none/empty
Without a style, the L<Nippon-shiki
romanization|https://www.sljfaq.org/afaq/nippon-shiki.html> is
used. This is the only romanisation style which allows round trips from kana to romanised and back.
=item common
This is a modification of the Hepburn system which also changes
combinations of large kana + small vowel kana into the commonest
romanized form. For example "ジェット" becomes "jetto" and "ウェ"
becomes "we".
=item hepburn
This gives L<Hepburn
romanization|https://www.sljfaq.org/afaq/hepburn.html>. This is
strictly defined to be the actual Hepburn system, so you may prefer to
use L</common> if your kana contains things like ファ which you want
to turn into "fa".
=item kunrei
This gives L<Kunrei-shiki romanisation|https://www.sljfaq.org/afaq/kunrei-shiki.html>, the form
of romanisation used in childrens' education. This is similar to Nippon-shiki except for a few consonant-vowel combinations.
=item passport
This gives "passport romaji" where long "o" vowels get turned into
"oh" and other long vowels are deleted. In this system "おおの" turns
into "ohno" and "ゆうすけ" turns into "yusuke".
=back
=item use_m
If this is true, L</syllabic n>s (ん) which come before "b" or "p"
sounds, such as the first "n" in "shinbun" (しんぶん, newspaper) will
be converted into "m" rather than "n".
It is automatically set to a true value if you choose L</hepburn> or
L</passport> styles of romanisation, but you can override that by
setting it to a false, but not undefined, value, something like this:
my $romaji = kana2romaji ($hiragana,
{style => 'hepburn',
ve_type => 'wapuro',
use_m => 0,});
I apologise for the convoluted interface. See L</HISTORY> for more on
the haphazard design of the module.
=item ve_type
The C<ve_type> option controls how long vowels are written. The
default is to use circumflexes to represent long vowels. If C<style>
is set to C<hepburn> or C<common>, the default is set to use
macrons. If C<style> is set to C<passport>, the value of C<ve_type> is
also set to C<passport>. The choices are:
=over
=item undef
A circumflex is used.
=item macron
A macron is used.
=item passport
"Oh" is used to write long "o" vowels, and other long vowels are
ignored.
=item none
Long vowels are not indicated.
=item wapuro
The L</chouon> marks become hyphens, and おう becomes ou.
=back
=item wo
kana2romaji ("ちりぬるを", { wo => 1 });
If "wo" is set to a true value, "を" becomes "wo", otherwise it
becomes "o".
=back
=head2 kana_consonant
use Lingua::JA::Moji 'kana_consonant';
$consonant = kana_consonant ('ざる');
# $consonant = 's'
Given a kana input, return the "dictionary order" consonant of the
first kana. If the first kana is any of あいうえお, it returns an
empty string. If the kana is an unvoiced kana, it returns the
corresponding consonant of the first kana in the Nippon-shiki
romanisation. If the kana is a voiced kana, it returns the
corresponding consonant of the unvoiced version of the first kana in
the Nippon-shiki romanisation.
This enables Japanese words to be sorted into the order used in
Japanese dictionaries, where the voiced/unvoiced distinction between,
for example, za and sa, or ta and da, is ignored.
=head2 normalize_romaji
use Lingua::JA::Moji 'normalize_romaji';
$normalized = normalize_romaji ('tsumuji');
C<normalize_romaji> converts romanized Japanese to a canonical form,
which is based on the Nippon-shiki romanization, but without
representing long vowels using a circumflex. In the canonical form,
sokuon (っ) characters are converted into the string "xtu". If there
is kana in the input string, this will also be converted to romaji.
C<normalize_romaji> is for comparing two Japanese words which may be
represented in different ways, for example in different romanization
systems, to see if they refer to the same word despite the difference
in writing. It does not provide a standardized or
officially-sanctioned form of romanization.
=head2 romaji2hiragana
Convert romaji to hiragana.
use Lingua::JA::Moji 'romaji2hiragana';
$hiragana = romaji2hiragana ('babubo');
# $hiragana = 'ばぶぼ'
Convert romanized Japanese into hiragana. This takes the same options
as L</romaji2kana>. It also switches on the "wapuro" option, which
uses long vowels with a kana rather than a L</chouon>.
=head2 romaji2kana
Convert romaji to kana.
use Lingua::JA::Moji 'romaji2kana';
$kana = romaji2kana ('yamaguti');
# $kana = 'ヤマグチ'
Convert romanized Japanese to katakana. The romanization is highly
liberal and will attempt to convert any romanization it sees into
katakana. The rules of romanization are based on the behaviour of the
Microsoft IME (input method editor). To convert romanized Japanese
into hiragana, use L</romaji2hiragana>.
An optional second argument to the function contains options in the
form of a hash reference,
$kana = romaji2kana ($romaji, {wapuro => 1});
Use an option C<< wapuro => 1 >> to convert long vowels into the
equivalent kana rather than L</chouon>.
$kana = romaji2kana ($romaji, {ime => 1});
Use the C<< ime => 1 >> option to approximate the behaviour of an
IME. For example, input "gumma" becomes グッマ and input "onnna"
becomes オンナ. Passport romaji ("Ohshimizu") is disallowed if this
option is switched on.
See also L</is_romaji>, L</is_romaji_strict>, and
L</is_romaji_semistrict> for validation of romanised Japanese inputs.
=head2 romaji_styles
use Lingua::JA::Moji 'romaji_styles';
my @styles = romaji_styles ();
# Returns a true value
romaji_styles ("hepburn");
# Returns the undefined value
romaji_styles ("frogs");
Given an argument, this returns a true value if it is a known style of
romanization.
Without an argument, it returns a list of possible styles, as an array
of hash references, with each hash reference containing the short name
under the key "abbrev" and the full name under the key "full_name".
=head2 romaji_vowel_styles
use Lingua::JA::Moji 'romaji_vowel_styles';
Returns a list of valid styles of romaji vowels.
=head1 KANA
These functions convert one form of kana into another.
=head2 cleanup_kana
use Lingua::JA::Moji 'cleanup_kana';
This function converts any of hiragana, halfwidth katakana, or romaji
input into katakana. It also converts various confusable kanji
characters into kana. For example the "one" kanji 一 is converted into
a L</chouon>, ー, and the "mouth" kanji 口 is converted into the
katakana ロ (ro).
This is used as the "front end" function for L<this katakana to
English web application|https://www.sljfaq.org/cgi/k2e.cgi>.
This was added to the module in version L</0.46>.
=head2 hira2kata
Convert hiragana to katakana.
use Lingua::JA::Moji 'hira2kata';
$katakana = hira2kata ('ひらがな');
# $katakana = 'ヒラガナ'
C<hira2kata> converts hiragana into katakana. The input may be a
single string or a list of strings. If the input is a list, it
converts each element of the list, and in list context it returns a
list of the converted inputs. In scalar context it returns a
concatenation of the strings.
my @katakana = hira2kata (@hiragana);
This does not convert L</chouon> signs.
=head2 hw2katakana
Convert halfwidth katakana to katakana.
use Lingua::JA::Moji 'hw2katakana';
$full_width = hw2katakana ('アイウカキギョウ。');
# $full_width = 'アイウカキギョウ。'
C<hw2katakana> converts L</halfwidth katakana> and halfwidth Japanese
punctuation to fullwidth katakana and fullwidth punctuation. Its
function is similar to the Emacs command
C<japanese-zenkaku-region>. For the opposite function, see L<kana2hw>.
=head2 InHankakuKatakana
use Lingua::JA::Moji 'InHankakuKatakana';
use utf8;
if ('ア' =~ /\p{InHankakuKatakana}/) {
print "ア is half-width katakana\n";
}
C<InHankakuKatakana> is a character class for use in regular
expressions with C<\p> which can validate L</halfwidth katakana>.
=head2 InKana
use Lingua::JA::Moji 'InKana';
$is_kana = ('アイウエオ' =~ /^\p{InKana}+$/);
# $is_kana = '1'
A character class for use in regular expressions which matches all
kana characters. This class catches meaningful combinations of
hiragana, katakana, halfwidth katakana, circled katakana, and katakana
combined words. It does not match the hentaigana characters of
Unicode.
This is a combination of the existing Perl character classes
C<Katakana>, C<InKatakana>, and C<InHiragana>, minus unassigned
characters, plus the "halfwidth katakana prolonged sound mark"
(U+FF70) <ー> (chouon), the "halfwidth katakana voiced sound mark"
(U+FF9E) <゙> (L</dakuten>) and the "halfwidth katakana semivoiced
sound mark" (U+FF9F) <゚> (L</handakuten>), minus '・', Unicode 30FB,
"KATAKANA MIDDLE DOT". It is somewhat like the following:
qr/\p{Katakana}|\p{InKatakana}|\p{InHiragana}|ー|゙|゚>/
except that the unassigned points which are matched by C<\p{Katakana}>
are not matched and KATAKANA MIDDLE DOT is not matched.
=head2 is_hiragana
use Lingua::JA::Moji 'is_hiragana';
This function returns a true value if its argument is a string of
hiragana, and an undefined value if not. The entire string from
beginning to end must all be kana for this to return true. The kana
cannot include punctuation marks or L</chouon>.
=head2 is_kana
use Lingua::JA::Moji 'is_kana';
This function returns a true value if its argument is a string of
kana, or an undefined value if not. The input cannot contain
punctuation or L</chouon>.
=head2 is_katakana
use Lingua::JA::Moji 'is_katakana';
Returns a true value if the string is katakana. At the moment this
doesn't do the half-width katakana or squared symbol katakana.
=head2 is_small
use Lingua::JA::Moji 'is_small';
$is_small = ('ぁ');
Returns a true value for small kana, kana which have a bigger version
as well, such as ぁ and あ.
=head2 join_sound_marks
use Lingua::JA::Moji 'join_sound_marks';
$joined = join_sound_marks ('か゛は゜つ゛');
# $joined = 'がぱづ'
Join L</dakuten> and L</handakuten> (Unicode U+3099-U+309C) to kana
where possible. Where they cannot be joined, strip them out. This only
works on full width kana. The return value is the joined text.
This was added to the module in version L</0.53>.
=head2 kana2hw
Convert kana to halfwidth katakana.
use Lingua::JA::Moji 'kana2hw';
$half_width = kana2hw ('あいウカキぎょう。');
# $half_width = 'アイウカキギョウ。'
C<kana2hw> converts hiragana, katakana, and fullwidth Japanese
punctuation to L</halfwidth katakana> and halfwidth punctuation. Its
function is similar to the Emacs command C<japanese-hankaku-region>.
For the opposite function, see L<hw2katakana>. See also
L</katakana2hw> for a function which only converts katakana.
=head2 kana2katakana
Convert kana to katakana.
use Lingua::JA::Moji 'kana2katakana';
This converts any of katakana, L</halfwidth katakana>, circled
katakana and hiragana to full width katakana. It also joins
L</dakuten> and L</handakuten> marks to kana where possible, or
removes them, using L</join_sound_marks>.
=head2 kana_to_large
use Lingua::JA::Moji 'kana_to_large';
$large = kana_to_large ('ぁあぃい');
# $large = 'ああいい'
Convert small-sized kana such as 「ぁ」 into full-sized kana such as
「あ」.
=head2 kata2hira
Convert katakana to hiragana.
use Lingua::JA::Moji 'kata2hira';
$hiragana = kata2hira ('カキクケコ');
# $hiragana = 'かきくけこ'
C<kata2hira> converts full-width katakana into hiragana. If the input
is a list, it converts each element of the list, and in list context,
returns a list of the converted inputs, otherwise it returns a
concatenation of the strings.
my @hiragana = hira2kata (@katakana);
This function does not convert L</chouon> signs into long vowels. It
also does not convert half-width katakana into hiragana.
=head2 katakana2hw
Convert katakana to halfwidth katakana.
use Lingua::JA::Moji 'katakana2hw';
$hw = katakana2hw ("あいうえおアイウエオ");
# $hw = 'あいうえおアイウエオ'
This converts katakana to L</halfwidth katakana>, leaving hiragana
unchanged. See also L</kana2hw>.
=head2 katakana2square
use Lingua::JA::Moji 'katakana2square';
$sq = katakana2square ('カロリーアイウエオウォン');
# $sq = '㌍アイウエオ㌆'
Convert katakana into a square thing if possible.
=head2 katakana2syllable
use Lingua::JA::Moji 'katakana2syllable';
$syllables = katakana2syllable ('ソーシャルブックマークサービス');
This breaks the given string into syllables. If the string is broken
up character by character, it becomes 'ソ', 'ー', 'シ', 'ャ', 'ル'.
However, by themselves, 'ー' and 'ャ' can't be spoken.
This breaks the string up into pronouncable syllables, so that
C<$syllables> becomes 'ソー', 'シャ', 'ル'. A L</syllabic n> is
attached to the preceding sequence, so for example フラナガン is
broken up into four syllables, フ, ラ, ナ, ガン.
This routine is used as the basis of this L<Change your name to kanji
web application|https://www.sljfaq.org/cgi/name-kanji.cgi>. The name
is converted from English to kana, then this function is used to break
the kana name into pieces to which a kanji may be attached. It's also
used in L<this Katakana to English
converter|https://www.sljfaq.org/cgi/k2e.cgi> for the case that no
words can be matched, and suggestions are made for how to split the
word into possible components.
This was added to the module in version L</0.24>.
=head2 nigori_first
use Lingua::JA::Moji 'nigori_first';
my @list = (qw/カン スウ ハツ オオ/);
nigori_first (\@list);
# Now @list = (qw/カン スウ ハツ オオ ガン ズウ バツ パツ/);
Given a list of kana, add all the possible versions of the words with
the first kana with either a L</dakuten> or a L</handakuten> added.
This was intended for a search for a particular kanji in a
dictionary. It is not actually in use anywhere at the moment.
This was added to the module in version L</0.36>.
=head2 smallize_kana
use Lingua::JA::Moji 'smallize_kana';
$smallize = smallize_kana ('オキヤクサマガカツタ');
# $smallize = 'オキャクサマガカッタ'
Given katakana input, convert possible "old-style" kana usage with
large kanas used for L</youon> or L</sokuon> into smaller kana. If the
conversion succeeds, return the converted value, otherwise return the
undefined value. (I found the undefined value works better as a return
value on failure than returning the text itself, since it saves the
user from having to check whether the text has changed.)
The conversion is not intelligent, it just attempts
to do as much as possible, so although it will work to convert
"shiyotsuchiyuu" ("シヨツチユウ") into "shotchuu" ("ショッチュウ"), it
will also do stupid things like converting "chiyoda" (ちよだ) into
"choda" (ちょだ).
This was added to the module in version L</0.46>.
=head2 split_sound_marks
use Lingua::JA::Moji 'split_sound_marks';
$split = split_sound_marks ('ガパヅ');
# $split = 'カ゛ハ゜ツ゛'
Split L</dakuten> and L</handakuten> from kana where possible. U+309B
and U+309C are chosen rather than U+3099 and U+309A. (This choice was
somewhat arbitrary. I'm not sure which of the pairs should be used. I
chose these because they were the ones already in use internally in
the module in L</kana2braille> and L</kana2morse>.) This only
works on full width kana. The return value is the split text.
This was added to the module in version L</0.53>.
=head2 square2katakana
use Lingua::JA::Moji 'square2katakana';
$kata = square2katakana ('㌆');
# $kata = 'ウォン'
Convert a square katakana box into its components.
=head2 strip_sound_marks
use Lingua::JA::Moji 'strip_sound_marks';
Strip sound marks from kana, so that for example パン (katakana pan)
becomes ハン (katakana han).
This was added to the module in version L</0.59>.
=head1 HENTAIGANA
Variant kana forms. Hentaigana are new in Unicode 10.0 (June 2017).
=head2 hentai2kana
use Lingua::JA::Moji 'hentai2kana';
Convert hentaigana into hiragana. Hentaigana with multiple
interpretations are converted into a list of kana separated by a
middle dot character.
This was added to the module in version L</0.43>.
=head2 hentai2kanji
use Lingua::JA::Moji 'hentai2kanji';
$kanji = hentai2kanji ('𛀢');
# $kanji = '家'
Convert hentaigana into their equivalent kanji.
This was added to the module in version L</0.43>.
=head2 kana2hentai
use Lingua::JA::Moji 'kana2hentai';
$hentai = kana2hentai ('ケンブ');
# $hentai = '𛀢・𛀲・𛀳・𛀴・𛀵・𛀶・𛀷𛄝・𛄞𛂰・𛂱・𛂲゛'
Convert kana to equivalent hentaigana. If more than one hentaigana
exists, they are returned joined with a middle dot. The L</dakuten>
and L</handakuten> are split out of the kana using
L</split_sound_marks> before the conversion.
This was added to the module in version L</0.43>.
=head2 kanji2hentai
use Lingua::JA::Moji 'kanji2hentai';
$kanji = kanji2hentai ('家');
# $kanji = '𛀢'
Convert kanji to equivalent hentaigana, where they exist.
This was added to the module in version L</0.43>.
=head1 WIDE ASCII FUNCTIONS
Functions for handling L</wide ASCII>.
=head2 ascii2wide
Convert printable ASCII characters to wide ASCII characters.
use Lingua::JA::Moji 'ascii2wide';
$wide = ascii2wide ('abCE019');
# $wide = 'abCE019'
Convert ASCII into L</wide ASCII>. It also converts the ASCII space,
ASCII C<0x20> into a fullwidth space, C<U+3000>.
=head2 InWideAscii
use Lingua::JA::Moji 'InWideAscii';
use utf8;
if ('A' =~ /\p{InWideAscii}/) {
print "A is wide ascii\n";
}
This is a character class for use with \p which matches L</wide
ASCII>. It also matches the fullwidth space, C<U+3000>.
=head2 wide2ascii
Convert wide ASCII characters to printable ASCII characters.
use Lingua::JA::Moji 'wide2ascii';
$ascii = wide2ascii ('abCE019');
# $ascii = 'abCE019'
Convert L</wide ASCII> into ASCII. It also converts the fullwidth
space, C<U+3000>, into an ASCII space, ASCII C<0x20>.
=head1 OTHER TYPES OF LETTERING
=head2 braille2kana
Convert Japanese braille to kana.
use Lingua::JA::Moji 'braille2kana';
Converts Japanese braille (I<tenji>) into the equivalent katakana.
=head2 circled2kana
Convert circled katakana to kana.
use Lingua::JA::Moji 'circled2kana';
$kana = circled2kana ('㋐㋑㋒㋓㋔');
# $kana = 'アイウエオ'
This function converts the "circled katakana" of Unicode into
full-width katakana. See also L</kana2circled>.
=head2 kana2braille
Convert kana to Japanese braille.
use Lingua::JA::Moji 'kana2braille';
This converts kana into the equivalent Japanese braille (I<tenji>)
forms.
=head3 Bugs
This is not an adequate Japanese braille converter. Creating Japanese
braille requires breaking Japanese sentences up into individual words,
but this does not attempt to do that. People who are interested in
building a Perl braille converter could start here.
=head2 kana2circled
Convert kana to circled katakana.
use Lingua::JA::Moji 'kana2circled';
$circled = kana2circled ('アイウエオガン');
# $circled = '㋐㋑㋒㋓㋔㋕゛ン'
This function converts kana into the "circled katakana" of Unicode,
which have code points from 32D0 to 32FE. See also L</circled2kana>.
There is no circled form of the ン kana, L</syllabic n>, so this is
left untouched. The L</dakuten> and L</handakuten> are split from the
kana using L</split_sound_marks>.
Circled katakana appear as Unicode code points U+32D0 to U+32FE.
=head2 kana2morse
Convert kana to Japanese morse code (wabun code).
use Lingua::JA::Moji 'kana2morse';
$morse = kana2morse ('ショッチュウ');
# $morse = '--.-. -- .--. ..-. -..-- ..-'
Convert Japanese kana into Morse code. Japanese morse code does not
have any way of representing small kana characters, so converting to
and then from morse code will result in ショッチュウ becoming シヨツチユウ.
The function L</smallize_kana> may work to fix these outputs in some cases.
=head2 morse2kana
Convert Japanese morse code (wabun code) to kana.
use Lingua::JA::Moji 'morse2kana';
$kana = morse2kana ('--.-. -- .--. ..-. -..-- ..-');
# $kana = 'シヨツチユウ'
Convert Japanese Morse code into kana. Each Morse code element must be separated by whitespace from the next one.
=head1 KANJI
=head2 bad_kanji
use Lingua::JA::Moji 'bad_kanji';
my @bad_kanji = bad_kanji ();
Returns a list of kanji with negative meanings. See also
L<https://www.lemoda.net/japanese/offensive-kanji/index.html>.
This was added to the module in version L</0.47>.
=head2 bracketed2kanji
use Lingua::JA::Moji 'bracketed2kanji';
$kanji = bracketed2kanji ('㈱');
# $kanji = '株'
Convert bracketed form of kanji into unbracketed form.
=head2 circled2kanji
use Lingua::JA::Moji 'circled2kanji';
$kanji = circled2kanji ('㊯');
# $kanji = '協'
Convert the circled forms of kanji into their uncircled equivalents.
=head2 kanji2bracketed
use Lingua::JA::Moji 'kanji2bracketed';
$kanji = kanji2bracketed ('株');
# $kanji = '㈱'
Convert an unbracketed form of kanji into bracketed form, if it
exists, otherwise do nothing with it.
=head2 kanji2circled
use Lingua::JA::Moji 'kanji2circled';
$kanji = kanji2circled ('協嬉');
# $kanji = '㊯嬉'
Convert the usual forms of kanji into circled equivalents, if they
exist. Note that only a limited number of kanji have circled forms.
=head2 new2old_kanji
Convert Modern kanji to Pre-1949 kanji.
use Lingua::JA::Moji 'new2old_kanji';
$old = new2old_kanji ('三国 連太郎');
# $old = '三國 連太郎'
Convert new-style (post-1949) kanji (Chinese characters) into old-style (pre-1949) kanji.
=head3 Bugs
The list of characters in this converter may not contain every pair of
old/new kanji.
It will not correctly convert 弁 since this has three different
equivalents in the old system.
=head2 old2new_kanji
Convert Pre-1949 kanji to Modern kanji.
use Lingua::JA::Moji 'old2new_kanji';
$new = old2new_kanji ('櫻井');
# $new = '桜井'
Convert old-style (pre-1949) kanji (Chinese characters) into new-style
(post-1949) kanji.
=head2 yurei_moji
use Lingua::JA::Moji 'yurei_moji';
my @yurei = yurei_moji ();
Returns a list of the yurei moji (幽霊文字), kanji which don't
actually exist but were mistakenly included in a computer
standard. See L<https://www.sljfaq.org/afaq/yuureimoji.html> for more
information.
This was added to the module in version L</0.47>.
=head1 CYRILLIZATION
This is an experimental cyrillization of kana based on the information
in a Wikipedia article,
L<http://en.wikipedia.org/wiki/Cyrillization_of_Japanese>. The module
author does not know anything about cyrillization of kana, so any
assistance in correcting this is very welcome.
=head2 cyrillic2katakana
Convert the Cyrillic (Russian) alphabet to katakana.
use Lingua::JA::Moji 'cyrillic2katakana';
$kana = cyrillic2katakana ('симбун');
# $kana = 'シンブン'
=head2 kana2cyrillic
Convert kana to the Cyrillic (Russian) alphabet.
use Lingua::JA::Moji 'kana2cyrillic';
$cyril = kana2cyrillic ('シンブン');
# $cyril = 'симбун'
=head1 HANGUL (KOREAN LETTERS)
=head2 kana2hangul
use Lingua::JA::Moji 'kana2hangul';
$hangul = kana2hangul ('すごわざ');
# $hangul = '스고와자'
=head3 Bugs
=over
=item May be incorrect
This is based on lists found on the internet at
L<http://kajiritate-no-hangul.com/kana.html> and
L<http://lesson-hangeul.com/50itiranhyo.html>. There is currently no
proof of correctness.
=item No reverse conversion
There is currently no hangul to kana conversion.
=back
=head1 SEE ALSO
Other Perl modules on CPAN include
=head2 Japanese kana/romanization
=over
=item L<Data::Validate::Japanese>
This contains four validators for kanji and kana, C<is_hiragana>,
corresponding to L</is_hiragana> in this module, and three more,
C<is_kanji>, C<is_katakana>, and C<is_h_katakana>, for half-width
katakana.
=item L<Lingua::JA::Fold>
Full/half width conversion, collation of Japanese text, including
handling of line breaks.
=item L<Lingua::JA::Hepburn::Passport>
Passport romanization, which means converting long vowels into
"OH". This corresponds to L</kana2romaji> in the current module using
the C<< passport => 1 >> option, for example
$romaji = kana2romaji ("かとう", {style => 'hepburn', passport => 1});
=item L<Lingua::JA::Jtruncate>
Handle character boundaries over bytes in the old Japanese encodings
EUC, JIS, and Shift-JIS, for people who don't like converting to
Unicode.
Until about 2008, I used to use CP932 (Microsoft variant of Shift-JIS)
in Perl programs, until I had the bad experience of tracking down a
very strange bug caused by the "kanji space", U+3000, containing an @
mark when written in CP932, and being interpreted by Perl as an array.
=item L<Lingua::JA::Kana>
This contains convertors for hiragana, half width and full width
katakana, and romaji. As of version 0.07 [Aug 06, 2012], the romaji
conversion is less complete than this module.
=item L<Lingua::JA::NormalizeText>
A huge collection of normalization functions for Japanese text. If
Lingua::JA::Moji does not have it, Lingua::JA::NormalizeText may do.
=item L<Lingua::JA::Onbiki>
Convert a Japanese tilde character into the appropriate vowel. To
achieve this with Lingua::JA::Moji, see the following example:
use utf8;
use Lingua::JA::Moji ':all';
for (qw/あったか〜い つめた〜い ん〜 アッタカ〜イ/) {
my $word = $_;
while ($word =~ /(\p{InKana})〜/ && $1 ne 'ん') {
my $kana = $1;
my $romaji = kana2romaji ($kana);
$romaji =~ s/[^aiueo]//g;
my $vowel = romaji2kana ($romaji);
if ($kana =~ /\p{InHiragana}/) {
$vowel = kata2hira ($vowel);
}
$word =~ s/$kana〜/$kana$vowel/g;
}
print "$_ -> $word\n";
}
produces output
あったか〜い -> あったかあい
つめた〜い -> つめたあい
ん〜 -> ん〜
アッタカ〜イ -> アッタカアイ
(This example is included as L<F<onbiki.pl>|https://fastapi.metacpan.org/source/BKB/Lingua-JA-Moji-0.60/examples/onbiki.pl> in the distribution.)
=item L<Lingua::JA::Regular::Unicode>
This includes hiragana to katakana, full width / half width, and wide
ascii conversion. The strange name is due to its being an extension of
L<Lingua::JA::Regular> using Unicode-encoded strings.
=item L<Lingua::JA::Romaji>
Romaji to kana/kana to romaji conversion.
=item L<Lingua::JA::Romaji::Valid>
Validate romanized Japanese. This module does something similar to
L</is_romaji>, L</is_romaji_strict>, and L</is_romaji_semistrict> in
Lingua::JA::Moji, but it has some extra options as well.
=item L<Lingua::JA::Romanize::Japanese>
Romanization of Japanese. The module also includes romanization of
kanji via the kakasi kanji to romaji convertor, and other functions.
=back
=head2 Kana/kanji conversion
=over
=item L<Lingua::JA::Romanize::Japanese>
Romanization of Japanese language via kakasi.
=item L<Lingua::JA::Romanize::MeCab>
Romanization of Japanese language with MeCab
=item L<Text::MeCab>
=back
=head2 Related modules
=over
=item L<Data::HanConvert>
🐉 "The data for converting between traditional and simplified Chinese
languages"
=item L<Encode::CNMap>
🐉 "enhanced Chinese encodings with Simplified-Traditional auto-mapping"
=item L<Encode::HanConvert>
🐉 "Traditional and Simplified Chinese mappings"
=item L<Lingua::KO::Munja>
This is similar to the present module for Korean.
=item L<Lingua::ZH::HanConvert>
🐉 "Convert between Traditional and Simplified Chinese characters"
=item L<Regexp::Chinese::TradSimp>
🐉 "Take a string containing Chinese text, and turn it into a
traditional-simplified-insensitive regexp."
=back
=head2 Books
Parts of this module are covered in the book "Perl CPAN Module Guide"
by Naoki Tomita (in Japanese), ISBN 978-4862671080, published by
WEB+DB PRESS plus, April 2011.
=head1 NOTES
This section explains some of the Japanese-language-specific
terminology used elsewhere in the documentation. The headers in this
section are in lower case for the benefit of internal documentation
links. The explanatory links here go to the "sci.lang.japan Frequently
Asked Questions", a Usenet FAQ about Japanese language.
=over
=item chouon
The long vowel marker, "ー", or I<chōon>, which is used in Japanese
katakana to indicate a lengthened vowel. See L<What is the long line symbol used in katakana?|http://www.sljfaq.org/afaq/chouon.html>
=item dakuten
=item handakuten
Dakuten, 濁点, literally "voicing mark", and handakuten, 半濁点,
literally "half voicing mark", are diacritic marks which appear on
some kana to convert them to a voiced consonant. In modern Japanese
encodings, these are usually displayed as part of the kana, but in
L</halfwidth katakana> they are displayed separately from the kana to
reduce the number of characters which need to be encoded.
This module offers L</split_sound_marks> and L</join_sound_marks> to
associate or dissociate the marks from kana, which may be used, for
example, for the case of Morse code, Braille, or halfwidth kana
conversion, as well as L</strip_sound_marks>, which removes all
dakuten and handakuten from text.
=item halfwidth katakana
Halfwidth katakana, I<hankaku katakana> (半角かたかな) is a legacy
encoding of katakana based on an eight-bit encoding. See
L<What is half-width katakana?|http://www.sljfaq.org/afaq/half-width-katakana.html>
for full details.
=item sokuon
Sokuon, 促音, is the use of a small kana tsu to indicate a doubled
consonant. This smaller letter was not used in some kinds of older
encoding such as Morse codes.
=item syllabic n
In this document, "syllabic n" means the kana ん or ン. See L<What is syllabic n?|http://www.sljfaq.org/afaq/syllabic-n.html> for full details.
=item wide ASCII
Wide ASCII, fullwidth ASCII, or I<zenkaku eisūji> (全角英数字) are a
legacy of bitmapped fonts which has survived into the present
day. "Wide ascii" characters were originally special bitmapped font
characters created to be the same size as one kanji or kana
character. The name for normal ASCII characters in Japanese is
I<hankaku eisūji> (半角英数字), literally "half width English letters
and numerals". See L<What is "wide ASCII"?|http://www.sljfaq.org/afaq/wide-ascii.html> for full details.
=item youon
Youon (拗音) means the use of kana ending in "i" with a small ya, yu,
or yo kana, such as しゃ (sha) or きょ (kyo). These are called
"glides" by linguists.
=back
=head1 EXPORT
This module exports its functions only on request. To export all the
functions in the module,
use Lingua::JA::Moji ':all';
=head1 DEPENDENCIES
=over
=item L<Carp>
Carp is used to report errors.
=item L<Convert::Moji>
This is used for most of the work of the module.
=item L<JSON::Parse>
This is used to read in some of the data.
=back
=head1 ACKNOWLEDGEMENTS
Thanks to Naoki Tomita, David Steinbrunner, and Neil Bowers for fixes.
=head1 HISTORY
"Moji" (文字) means "letters" in Japanese. I started Lingua::JA::Moji
out of a need for more comprehensive handling of Japanese text than
was offered by any of the existing modules on CPAN. There were a lot
of modules offering piecemeal romaji/kana conversions or
hiragana/katakana conversions, but nothing comprehensive or
robust. Lingua::JA::Moji was originally a private module. Most of the
functions in the module are things I needed for my own projects.
The design using L<Convert::Moji> was part of an abandoned plan to
make a cross-language module which could produce, say, a JavaScript
converter doing the same things as this Perl one, using the same text
sources.
I wasn't really sure whether to release it, but eventually I released
it to CPAN as a result of requests for the source code of an online
romaji/kana converter by website users. The module interface, in
particular the hash reference options to L</kana2romaji> and
L</romaji2kana>, is rather messy, and some of the defaults are rather
strange, but since it was described in Naoki Tomita's book, and some
people may be using it as is, I'm not very keen to change it in
incompatible ways.
=over
=item 0.24
This version added L</katakana2syllable>.
=item 0.27
This version added L</is_romaji_strict>.
=item 0.36
This version added the L</nigori_first> function.
=item 0.37
This version added L</is_romaji_semistrict>.
=item 0.43
This version added support for
L<hentaigana|http://www.sljfaq.org/afaq/hentaigana.html>. This is
based on copy and paste of the Unicode 10.0 standard draft
documents. See the directory L<F<data>|https://github.com/benkasminbullock/Lingua-JA-Moji/tree/master/data> in the github repository for the files used to
generate this data.
=item 0.46
This version disallowed hyphens as the first character of a romaji
string and added L</smallize_kana> and L</cleanup_kana>.
=item 0.47
This version added a list of the "Yūrei moji" (幽霊文字), false kanji, and
changed romanisation somewhat.
=item 0.48
This version changed L</kana2romaji> to be consistent with the
documentation for the long vowel options C<wapuro> and C<none>.
=item 0.53
This version added L</join_sound_marks> and L</split_sound_marks> to
the module.
=item 0.54
This version removed a function C<kana_order> from the module. It
improved the behaviour of L</is_romaji_strict> after comparing its
negatives and positives with a large number of English and nonsense
words. It improved the behaviour of L</smallize_kana> with regard to
the "tsu" kana. L</cleanup_kana> was improved to deal with stray
dakuten and handakuten and some other odd kanji/kana confusions.
=item 0.58
This added L</kana_consonant>.
=item 0.59
This added L</strip_sound_marks>.
=back
=head1 AUTHOR
Ben Bullock, <bkb@cpan.org>
=head1 COPYRIGHT & LICENCE
This package and associated files are copyright (C)
2008-2024
Ben Bullock.
You can use, copy, modify and redistribute this package and associated
files under the Perl Artistic Licence or the GNU General Public
Licence.