Python: module transliterator

transliterator (version 0.1)

index
/http://www.alanlittle.org/projects/transliterator/transliterator.py

Transliterate texts between unicode and standard transliteration schemes. Transliterate texts between non-latin scripts and commonly-used latin transliteration schemes. Uses standard Unicode character blocks -- e.g. DEVANAGARI U+0900 ... U+097F -- and transliteration schemes -- e.g. the IAST convention for transliteration of Sanskrit to latin-with-dots. The following character blocks and transliteration schemes are included: DEVANAGARI IAST ITRANS -- http://www.aczoom.com/itrans/#itransencoding (Sanskrit only) Harvard Kyoto CYRILLIC ISO 9:1995 (Russian only) New character blocks and transliteration schemes can be added by creating new CharacterBlock and TransliterationScheme objects. COMMAND LINE USAGE ---------------------------- python transliterator.py text inputFormat outputFormat ... writes the transliterated text to stdout text -- the text to be transliterated OR the name of a file containing the text inputFormat -- the name of the character block or transliteration scheme that the text is to be transliterated FROM, e.g. 'CYRILLIC', 'IAST'. Not case-sensitive outputFormat -- the name of the character block or transliteration scheme that the text is to be transliterated TO, e.g. 'CYRILLIC', 'IAST'. Not case-sensitive USAGE -------- Transliterate a text: >>> import transliterator >>> transliterator.transliterate('yogazcittavRttinirodhaH', 'harvardkyoto', ... 'devanagari', {'outputASCIIEncoded' : True}) 'योगश्चित्तवृत्तिनिरोधः' Create a new CharacterBlock and TransliterationScheme: >>> import transliterator >>> cb = transliterator.CharacterBlock('NEWBLOCK', range(0x901, 0x9FF)) >>> scheme = transliterator.TransliterationScheme(cb.name, 'NEWSCHEME', ... {'ab': 0x901, 'cd': 0x902}) >>> transliterator.transliterate('abcd', scheme, cb, {'outputASCIIEncoded' : True}) 'ँं' COPYRIGHT AND DISCLAIMER ------------------------------------ Transliterator is: version 0.1 software - use at your own risk. The IAST, ITRANS and Harvard-Kyoto transliteration schemes have been tested for classical Sanskrit, not for any other language. The Cyrillic alphabet and ISO 9:1995 transliteration (for Russian only) are included but have been even more lightly tested than Devanagari. Copyright (c) 2005 by Alan Little By obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions: Permission to use, copy, modify, and distribute this software and its associated documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appears in all copies, and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of the author not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Modules

getopt
sys
unicodedata

Classes



__builtin__.dict(__builtin__.object)

CharacterBlock

DevanagariCharacterBlock(CharacterBlock, _Devanagari)

TransliterationScheme

DevanagariTransliterationScheme(TransliterationScheme, _Devanagari)

__builtin__.object

TLCharacter

DevanagariCharacter

class CharacterBlock(__builtin__.dict)

    Dictionary-like representation of a set of unicode characters. For our purposes, a character block corresponds to an alphabet/script that we want to be able to transliterate to or from, e.g. Cyrillic, Devanagari. Keys are unicode characters. Values are TLCharacter instances.

Method resolution order:

CharacterBlock

__builtin__.dict

__builtin__.object

Methods defined here:

__init__(self, name, charRange, charClass=<class 'transliterator.TLCharacter'>)
Set up a character block corresponding to a range of code points. Keyword arguments: name -- a string containing the name of the character block.         (should normally use a standard Unicode character block name) range -- a list of code points. Reserved code points are ignored. charClass -- the class to be used to create the characters.              Should be a subclass of TLCharacter.

class DevanagariCharacter(TLCharacter)

    Special processing for Devanagari characters.

Method resolution order:

DevanagariCharacter

TLCharacter

__builtin__.object

Methods defined here:

__init__(self, unicodeHexValue, block)
Create an object representing a Devanagari character. Extends TLCharacter.__init__ to distinguish Devanagari standalone vowels, dependent vowels and consonants. Raises ValueError -- for characters in the Devanagari dependent vowel range.               We want these as variants of the corresponding standalone               vowels, not as separate characters.

Methods inherited from TLCharacter:

addEquivalent(self, equivName, equivalent)
Add an equivalent for the character. Arguments: equivName -- the name of a TransliterationScheme equivalent -- string/unicode equivalent in the named               TransliterationScheme for this code point.

class DevanagariCharacterBlock(CharacterBlock, _Devanagari)

    Class representing the Devanagari Unicode character block.

Method resolution order:

DevanagariCharacterBlock

CharacterBlock

__builtin__.dict

_Devanagari

__builtin__.object

Methods defined here:

__init__(self, name, charRange)
Set up the Devanagari character block. Extends CharacterBlock.__init__ by specifiying that the characters created should be instances of DevanagariCharacter.

class DevanagariTransliterationScheme(TransliterationScheme, _Devanagari)

    Class representing a Devanagari transliteration scheme.

Method resolution order:

DevanagariTransliterationScheme

TransliterationScheme

__builtin__.dict

_Devanagari

__builtin__.object

Methods defined here:

__init__(self, blockName, schemeName, data, swapTable=None)
Set up a Devanagari transliteration scheme. Extends TransliterationScheme.__init__

class TLCharacter(__builtin__.object)

    Class representing a Unicode character with its equivalents. Public attributes: unicodeHexValue -- the numeric value of the Unicode code point. unichr -- the character value of the Unicode code point. name -- the name of the Unicode code point. equivalents -- a dict containing the character's equivalents in                various transliteration schemes, in the format:                {'Scheme A': 'A', 'Scheme B': 'aah', }                where keys are TransliterationScheme names,                values are transliterated equivalents of the                character.

Methods defined here:

__init__(self, unicodeHexValue, block)
Set up a unicode character. Arguments: unicodeHexValue -- an integer that should correspond to a                    Unicode code point. block -- the CharacterBlock this character belongs to. Raises: ValueError -- if unicodeHexValue is not a valid code point.

addEquivalent(self, equivName, equivalent)
Add an equivalent for the character. Arguments: equivName -- the name of a TransliterationScheme equivalent -- string/unicode equivalent in the named               TransliterationScheme for this code point.

class TransliterationScheme(__builtin__.dict)

    Dictionary-like representation of a transliteration scheme. e.g. the Harvard-Kyoto, IAST or ITRANS schemes for transliterating Devanagari to or from the latin alphabet. Keys are unicode strings representing the letter-equivalents used in the transliteration scheme. Values are TLCharacter instances.

Method resolution order:

TransliterationScheme

__builtin__.dict

__builtin__.object

Methods defined here:

__init__(self, blockName, schemeName, data, swapTable=None)
Set up a transliteration scheme. Keyword arguments: blockName -- a string containg the name of the character block this              transliteration scheme is used for,              e.g. 'CYRILLIC', 'DEVANAGARI'. schemeName -- the name of the transliteration scheme.               Must be unique. data -- a dict containing the data for the transliteration scheme.         Keys are transliterated Unicode characters or strings.         Values are integers corresponding to Unicode code points.         For examples, see the data for the built-in transliteration         schemes. swapTable -- a dict (default None) containing any non-standard              letter combinations used in the transliteration scheme              that we want to pre-process away before transliterating.              See the ITRANS data for examples. Raises: KeyError: unknown block name. TypeError: swapTable is not a dict

Functions


main(argv=None)
Call transliterator from a command line. python transliterator.py text inputFormat outputFormat ... writes the transliterated text to stdout text -- the text to be transliterated OR the name of a file containing the text inputFormat -- the name of the character block or transliteration scheme that                the text is to be transliterated FROM, e.g. 'CYRILLIC', 'IAST'.                Not case-sensitive outputFormat -- the name of the character block or transliteration scheme that                the text is to be transliterated TO, e.g. 'CYRILLIC', 'IAST'.                Not case-sensitive

resetOptions()
Reset options to their default values.

transliterate(text, inFormat, outFormat, requestOptions={})
Transliterate a text. Keyword arguments: text -- a unicode string containing the text to be transliterated inFormat -- the "from" CharacterBlock or TransliterationScheme, or its name outFormat -- the target CharacterBlock or TransliterationScheme, or its name requestOptions -- optional dict containing option settings that override the                   defaults for this request. Returns a unicode object containing the text transliterated into the target character set. Raises: ValueError -- unrecognised input or output format. KeyError -- a character in text is not a member of inFormat, or has no corresponding character defined in outFormat.

Data

HARVARDKYOTO = {"'": 2365, 'A': 2310, 'D': 2337, 'Dh': 2338, 'G': 2329, 'H': 2307, 'I': 2312, 'J': 2334, 'M': 2306, 'N': 2339, ...}
IAST = {"'": 2365, '.': 2404, '..': 2405, '0': 2406, '1': 2407, '2': 2408, '3': 2409, '4': 2410, '5': 2411, '6': 2412, ...}
ITRANS = {'.': 2404, '..': 2405, '.a': 2365, '.h': 2307, '.m': 2306, '.n': 2306, '0': 2406, '1': 2407, '2': 2408, '3': 2409, ...}
UNRECOGNISED_ECHO = 1
UNRECOGNISED_FAIL = 0
UNRECOGNISED_SUBSTITUTE = 2
__version__ = '0.1'
characterBlocks = {'CYRILLIC': {u'\u0400': <transliterator.TLCharacter object a... <transliterator.TLCharacter object at 0x2a98f0>}, 'DEVANAGARI': {u'\u0902': <transliterator.DevanagariCharacter ...iterator.DevanagariCharacter object at 0x233870>}}
options = {'handleUnrecognised': 0, 'inputEncoding': 'utf-8', 'outputASCIIEncoded': False, 'outputEncoding': 'utf-8', 'substituteChar': '?'}

Data
		HARVARDKYOTO = {"'": 2365, 'A': 2310, 'D': 2337, 'Dh': 2338, 'G': 2329, 'H': 2307, 'I': 2312, 'J': 2334, 'M': 2306, 'N': 2339, ...} IAST = {"'": 2365, '.': 2404, '..': 2405, '0': 2406, '1': 2407, '2': 2408, '3': 2409, '4': 2410, '5': 2411, '6': 2412, ...} ITRANS = {'.': 2404, '..': 2405, '.a': 2365, '.h': 2307, '.m': 2306, '.n': 2306, '0': 2406, '1': 2407, '2': 2408, '3': 2409, ...} UNRECOGNISED_ECHO = 1 UNRECOGNISED_FAIL = 0 UNRECOGNISED_SUBSTITUTE = 2 __version__ = '0.1' characterBlocks = {'CYRILLIC': {u'\u0400': <transliterator.TLCharacter object a... <transliterator.TLCharacter object at 0x2a98f0>}, 'DEVANAGARI': {u'\u0902': <transliterator.DevanagariCharacter ...iterator.DevanagariCharacter object at 0x233870>}} options = {'handleUnrecognised': 0, 'inputEncoding': 'utf-8', 'outputASCIIEncoded': False, 'outputEncoding': 'utf-8', 'substituteChar': '?'}