transliterator (version 0.1) | index /http://www.alanlittle.org/projects/transliterator/transliterator.py |
Transliterate texts between unicode and standard transliteration schemes.
Transliterate texts between non-latin scripts and commonly-used latin
transliteration schemes. Uses standard Unicode character blocks --
e.g. DEVANAGARI U+0900 ... U+097F -- and transliteration schemes --
e.g. the IAST convention for transliteration of Sanskrit to latin-with-dots.
The following character blocks and transliteration schemes are included:
DEVANAGARI
IAST
ITRANS -- http://www.aczoom.com/itrans/#itransencoding (Sanskrit only)
Harvard Kyoto
CYRILLIC
ISO 9:1995 (Russian only)
New character blocks and transliteration schemes can be added by creating
new CharacterBlock and TransliterationScheme objects.
COMMAND LINE USAGE
----------------------------
python transliterator.py text inputFormat outputFormat
... writes the transliterated text to stdout
text -- the text to be transliterated OR the name of a file containing the text
inputFormat -- the name of the character block or transliteration scheme that
the text is to be transliterated FROM, e.g. 'CYRILLIC', 'IAST'.
Not case-sensitive
outputFormat -- the name of the character block or transliteration scheme that
the text is to be transliterated TO, e.g. 'CYRILLIC', 'IAST'.
Not case-sensitive
USAGE
--------
Transliterate a text:
>>> import transliterator
>>> transliterator.transliterate('yogazcittavRttinirodhaH', 'harvardkyoto',
... 'devanagari', {'outputASCIIEncoded' : True})
'योगश्चित्तवृत्तिनिरोधः'
Create a new CharacterBlock and TransliterationScheme:
>>> import transliterator
>>> cb = transliterator.CharacterBlock('NEWBLOCK', range(0x901, 0x9FF))
>>> scheme = transliterator.TransliterationScheme(cb.name, 'NEWSCHEME',
... {'ab': 0x901, 'cd': 0x902})
>>> transliterator.transliterate('abcd', scheme, cb, {'outputASCIIEncoded' : True})
'ँं'
COPYRIGHT AND DISCLAIMER
------------------------------------
Transliterator is:
version 0.1 software - use at your own risk.
The IAST, ITRANS and Harvard-Kyoto transliteration schemes have been
tested for classical Sanskrit, not for any other language.
The Cyrillic alphabet and ISO 9:1995 transliteration (for Russian only)
are included but have been even more lightly tested than Devanagari.
Copyright (c) 2005 by Alan Little
By obtaining, using, and/or copying this software and/or its
associated documentation, you agree that you have read, understood,
and will comply with the following terms and conditions:
Permission to use, copy, modify, and distribute this software and
its associated documentation for any purpose and without fee is
hereby granted, provided that the above copyright notice appears in
all copies, and that both that copyright notice and this permission
notice appear in supporting documentation, and that the name of
the author not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior permission.
THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR
CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Modules | ||||||
|
Classes | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Functions | ||
|
Data | ||
HARVARDKYOTO = {"'": 2365, 'A': 2310, 'D': 2337, 'Dh': 2338, 'G': 2329, 'H': 2307, 'I': 2312, 'J': 2334, 'M': 2306, 'N': 2339, ...} IAST = {"'": 2365, '.': 2404, '..': 2405, '0': 2406, '1': 2407, '2': 2408, '3': 2409, '4': 2410, '5': 2411, '6': 2412, ...} ITRANS = {'.': 2404, '..': 2405, '.a': 2365, '.h': 2307, '.m': 2306, '.n': 2306, '0': 2406, '1': 2407, '2': 2408, '3': 2409, ...} UNRECOGNISED_ECHO = 1 UNRECOGNISED_FAIL = 0 UNRECOGNISED_SUBSTITUTE = 2 __version__ = '0.1' characterBlocks = {'CYRILLIC': {u'\u0400': <transliterator.TLCharacter object a... <transliterator.TLCharacter object at 0x2a98f0>}, 'DEVANAGARI': {u'\u0902': <transliterator.DevanagariCharacter ...iterator.DevanagariCharacter object at 0x233870>}} options = {'handleUnrecognised': 0, 'inputEncoding': 'utf-8', 'outputASCIIEncoded': False, 'outputEncoding': 'utf-8', 'substituteChar': '?'} |