I have been on the lookout for a good text-preprocessing library for the Tamil language. There are sparsely any resources for Indian languages, but several Indian universities have developed preprocessing modules and even though they have been made publicly available, some of them require approval from the relevant universities. And the rest barely have any documentation to help you set it up.
This was when I stumbled upon Indic NLP, this an open source python based library which has been actively developed over the last 6 years. Funny how this isn’t popular enough already. Indic NLP provides common text processing modules for Indian languages. It supports 18 Indo-Aryan, Dravidian and English languages, the list of languages can be viewed in the table above.
How to setup Indic NLP in your Windows machine
Dependency installation
- Python (Python 3.x is recommended, as python 2.x is not actively supported).
- Morfessor 2.0 python library is required
- Download the Morpfessor zip/tarball file and unzip it.
- To build and install the module and scripts to the detault path
python setup.py install
- Indic NLP resources (Optional – If you need the morphological analyzer and transliteration module) Obtain the resources from github with the link provided.
Configuration Setup
- Add the project to the Python path,
- Edit system environment variables for your account
- Add ‘PYTHONPATH’ (if it doesn’t exist)
- Provide the Indic NLP library src folder path
Test it out
If you do not have Pandas Python data analysis library installed already, it would give you an error.
To install Pandas via PyPI;
python -m pip install --upgrade pandas
Now let’s test whether the library is properly set up, try out the sample code below. If everything goes well, you should be able to see the tokenized Hindi string.
# The path to the local git repo for Indic NLP Resources INDIC_NLP_RESOURCES=r"F:\University\Final_Year\FYP\Resources\indic_nlp_resources-master\indic_nlp_resources-master" #Export the path to the Indic NLP Resources directory programmatically from indicnlp import common common.set_resources_path(INDIC_NLP_RESOURCES) #Initialize the Indic NLP library from indicnlp import loader loader.load() #Tokenize a Hindi string from indicnlp.tokenize import indic_tokenize indic_string='अनूप,अनूप?।फोन' print('Input String: {}'.format(indic_string)) print('Tokens: ') for t in indic_tokenize.trivial_tokenize(indic_string): print(t)
Output would be;
Please guide me to download NLP resources and path set up in detail as I m working on NLP for the first time and found your blog very helpful to know about indic NLP and downloaded the same. But facing problem in downloading indic NLP resources it would be greatful if you help me to resolve this.
LikeLike
Hi Dakshayini, hope it’s not too late, can you tell me what’s the issue you are facing?
LikeLike