Text Preprocessing Tools for Tamil Language

What is Text Preprocessing?

Preprocessing is an important and crucial task in Natural Language Processing (NLP), where the text is transformed into a form which an algorithm can digest. There are several preprocessing techniques which could be used to achieve this, which are discussed below.

There are several well established text preprocessing tools like Natural Language Toolkit (NLTK) and Stanford CoreNLP. But these only support popular languages like English, Spanish, etc.

Tools which provide support for Dravidian languages are still scarce, this is mainly since these languages are lowly resourced. Over the recent years, several Indian research institutes started working on preprocessing tools and resources for Tamil language. Amrita, TDIL and AU-KBC are such popular Indian institutes. But sadly at the time of writing this article, none of these universities are making these resources publicly available. TDIL makes its resources available only for Indian researches.

So how can we perform Tamil preprocessing?

One way is that you could train and build your own preprocessing modules, there are a limited amount Tamil Corpus and resources made available in open domain that you could make use of. Or even build your own corpus in your preferred domain, if time permits.

If you are working on a time critical project, another way is to start searching for tools a lot more deeper. There are a handful of Tamil Preprocessing tools which are openly available (open source), but these tools are not highly popular therefore a thorough research is necessary.  I’ve been working on a Research on Tamil language over the past year, and these are some of my findings on Tamil preprocessing tools.

Tokenizing

Tokenization is the process of breaking a stream of textual content into meaningful elements called tokens. These tokens can be words, terms, symbols, etc. Generally, the process of tokenization happens at word level, but sometimes it’s tough to define what’s meant by a ‘word’. Standard tokenizers use simple heuristics like;

  • Punctuations and whitespace may or may not be returned with the tokens.
  • Contiguous strings of alphabetic characters or numbers are considered as a single token.
  • Tokens are separated using whitespace characters or punctuation characters.

Tokenization is supported for Tamil by Indic NLP, this is a python based open source library. Learn how to set up Indic NLP here.

Code snippet on how to perform a Tamil string tokenization using Indic NLP, along with its output is given below.

from indicnlp import common
from indicnlp import loader
from indicnlp.tokenize import indic_tokenize

# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES = r"resource_path"

# Export the path to the Indic NLP Resources directory programmatically
common.set_resources_path(INDIC_NLP_RESOURCES)

# Initialize the Indic NLP library
loader.load()


# Tokenization
indic_string = 'புத்துணர்ச்சியான சுவாசம் மற்றும் பளபளப்பான பற்கள் தங்களின் தோற்றத்தை நிர்ணயிக்கிறது'
print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string):
        print(t)

 

tokenizing-output.PNG

Part of Speech Tagging (POS)

POS tagging is a vital process in understanding the meaning of a sentence, it helps to infer possible knowledge about neighboring words and the syntactic structure weaving around a word. POS tagging is crucial since the accuracy of an NLP tool depends on its POS tagger.

pos-tagg.png
Sample POS tagged Tamil sentence

Several well-established POS tagging tools are out there for languages like English. However, for a lowly resourced language like Tamil, there are limited number of works carried out and different approaches are yet to be tested out. Especially for a highly inflectional language like Tamil the complexity of the tagger is increased.

POS tagging for Tamil language is supported by RDRPOSTagger. This is a ripple-down rule-based POS tagger, which comes with pre-trained POS tagging modules. Note that this library only supports Universal POS tags for Tamil language. This is R-based library, but the same library is made available in Python as RippleTagger.

A python code snippet of POS tagging using RippleTagger and its corresponding output is given below.

from rippletagger.tagger import Tagger

query = 'மஞ்சள் வளர்க்க ஏற்ற மண் வகைகள்'

# POS tagging
tagger = Tagger(language="tam")
posTagger = tagger.tag(query)
print('POS tag', posTagger)

pos-output.PNG

Morphological Analysis

Stemming is a computational procedure where words with the same root are reduced to a common form, generally by stripping each word of its derivational and inflectional suffixes. Most of the IR systems uses the stemming process to identify the root words and to improve the retrieval performance.

Morphological analysis (MA) produces information regarding the morphosyntactic properties of a word, it is a highly important component to perform Machine Translations.

ma.png
Stemming VS Morphological Analysis

Stemming is a simpler process than MA but stemming alone will not be able to identify the root words if the words are inflected. MA is proven to outperform stemming because of its ability to take care of additional analysis that is not supported by stemming. Stemming gives higher accuracy for languages with fewer word inflections, but MA performs better than algorithmic stemmers in terms of languages with complex morphology. Provided that Tamil being a morphologically rich language, MA is suitable than stemmers generally, but this also depends on the research and what you are trying achieve.

Morphological analysis for Tamil language is provided by Indic NLP mentioned above, but this module gives a very low performance. An alternative is using Polyglot, another Python based package. In my opinion, Polyglot gives better accuracy and higher performance than Indic NLP.

A sample MA done with Polyglot is shown below;

from polyglot.text import Text

text = Text('புத்துணர்ச்சியான')
text.language = "ta"
print(text.morphemes)

ployglot-ma-output.PNG

Example of MA performed using Indic NLP is given below;

from indicnlp import common
from indicnlp import loader
from indicnlp.morph import unsupervised_morph

# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES = r"resource_path_name"

# Export the path to the Indic NLP Resources directory programmatically
common.set_resources_path(INDIC_NLP_RESOURCES)

# Initialize the Indic NLP library
loader.load()

# Morphological Analyser
analyzer = unsupervised_morph.UnsupervisedMorphAnalyzer('ta')
indic_string = 'புத்துணர்ச்சியான'

analyzes_tokens = analyzer.morph_analyze_document(indic_string.split(' '))
print(analyzes_tokens)

indicnlp-ma-output.PNG

Stop word Removal

One of the major preprocessing task is filtering out unnecessary data. In Natural Language Processing (NLP) these useless terms/words are referred to as stop words. Stop words are considered as irrelevant for searching and retrieval purposed since they occur frequently and does not add any additional sense about the query’s context. In order to save both time and space, stop words are dropped during indexing and are then ignored during searches.

I personally haven’t come across any libraries or packages which performs Tamil stop word removal. Instead I found TamilNLP which is a github repository which provides Tamil NLP resources. This repository contains a list of 125 Tamil stop words, which could be stored in a file and used to remove stop words.

The Tamil stop word list provided by TamilNLP could be downloaded from here, and store into a text file named as tamil_stopwords.txt. The below Python code snippet shows how file handling could used to check if a given word is present in the text file, and it is then the stop word is ignored.

# remove stop words from string
def __remove_stopwords(self, query):
    parsed = []

    # open file and read its contents
    with open('../resources/tamil_stopwords.txt', encoding="utf8") as file:
        contents = file.read()

    for word in query:
        if word[0] not in contents:
            parsed.append(word)

    return parsed

Transliteration

Transliteration is defined as ‘transcription of one alphabet to another, or replacement of letters/characters to another language with the same phonetic sound’. Unlike translation which gives the meaning of a given word in a different language, transliteration represents the characters of a given script to characters of another and only gives an idea on how that word is pronounced.

Transliteration is an optional module, this again depends on your system and what you are trying to achieve. A possible use case would be, when phrases which cannot be translated are encountered these are considered as Out-of-Vocabulary terms (OOV). These OOV terms are mainly Named Entity (NE) terms like names and locations, which needs to be transliterated.

Indic NLP and Polyglot are two libraries which supports Tamil-English transliterations. I personally prefer the Indic NLP transliterations (seems more accurate to me), but you could test of both the packages and decide what you think is suitable. Note that Indic NLP provides two types of transliterations one is using Itrans and the other one makes use of BrahmiNet API.

Examples of both the transliterations provided by Indic NLP are given below.

import requests

from indicnlp import common
from indicnlp import loader
from urllib.parse import quote
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES = r"F:\University\Final_Year\FYP\Resources\indic_nlp_resources-master\indic_nlp_resources-master"

# Export the path to the Indic NLP Resources directory programmatically
common.set_resources_path(INDIC_NLP_RESOURCES)

# Initialize the Indic NLP library
loader.load()

# Transliteration
input_text = 'புத்துணர்ச்சியான சுவாசம் மற்றும் பளபளப்பான பற்கள் தங்களின் தோற்றத்தை நிர்ணயிக்கிறது'
lang = 'ta'
print('iTrans transliteration: ')
print(ItransTransliterator.to_itrans(input_text, lang))

# transliteration with brahmi api
text = quote('புத்துணர்ச்சியான சுவாசம் மற்றும் பளபளப்பான பற்கள் தங்களின் தோற்றத்தை நிர்ணயிக்கிறது')
url = 'http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/transliterate_bulk/ta/en/{}/rule'.format(text)
response = requests.get(url)
print('Brahmi-net transliterations: ')
print(response.json())

indic-transliteration

Example of Polyglot transliterations is show below for the same Tamil string that was used above.

from polyglot.text import Text
from polyglot.transliteration import Transliterator

sentence = Text('புத்துணர்ச்சியான சுவாசம் மற்றும் பளபளப்பான பற்கள்')
word = sentence.words[2]

print(sentence.transliterate('en'))

polyglot-transliteration.PNG

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: