Setting up Indic NLP

I have been on the lookout for a good text-preprocessing library for the Tamil language. There are sparsely any resources for Indian languages, but several Indian universities have developed preprocessing modules and even though they have been made publicly available, some of them require approval from the relevant universities.  And the rest barely have any documentation to help you set it up.

This was when I stumbled upon Indic NLP, this an open source python based library which has been actively developed over the last 6 years. Funny how this isn’t popular enough already. Indic NLP provides common text processing modules for Indian languages. It supports 18 Indo-Aryan, Dravidian and English languages, the list of languages can be viewed in the table above.

How to setup Indic NLP in your Windows machine

Dependency installation

  1. Python (Python 3.x is recommended, as python 2.x is not actively supported).
  2. Morfessor 2.0 python library is required
    • Download the Morpfessor zip/tarball file and unzip it.
    • To build and install the module and scripts to the detault path
      python setup.py install
  3. Indic NLP resources (Optional – If you need the morphological analyzer and transliteration module) Obtain the resources from github with the link provided.

Configuration Setup

  1. Add the project to the Python path,
    • Edit system environment variables for your account
    • Add ‘PYTHONPATH’ (if it doesn’t exist)
    • Provide the Indic NLP library src folder path

Test it out

If you do not have Pandas Python data analysis library installed already, it would give you an error.

To install Pandas via PyPI;

python -m pip install --upgrade pandas

Now let’s test whether the library is properly set up, try out the sample code below. If everything goes well, you should be able to see the tokenized Hindi string.

# The path to the local git repo for Indic NLP Resources

INDIC_NLP_RESOURCES=r"F:\University\Final_Year\FYP\Resources\indic_nlp_resources-master\indic_nlp_resources-master"



#Export the path to the Indic NLP Resources directory programmatically

from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)



#Initialize the Indic NLP library

from indicnlp import loader
loader.load()



#Tokenize a Hindi string
from indicnlp.tokenize import indic_tokenize  
indic_string='अनूप,अनूप?।फोन'
print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)

Output would be;

Capture.PNG

 

3 thoughts on “Setting up Indic NLP

Add yours

  1. Please guide me to download NLP resources and path set up in detail as I m working on NLP for the first time and found your blog very helpful to know about indic NLP and downloaded the same. But facing problem in downloading indic NLP resources it would be greatful if you help me to resolve this.

    Like

Leave a comment

Website Powered by WordPress.com.

Up ↑