What languages are supported for nltk.word_tokenize and nltk.pos_tag

I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.

Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?


ANSWERS:


By default, both functions only support English text. It's not really in the documentation but you can see it by looking at the source code:

  1. The pos_tag() function loads a tagger from the this file: 'taggers/maxent_treebank_pos_tagger/english.pickle'. (see here)

  2. The word_tokenize() function uses the Treebank tokenizer which uses regular expressions to tokenize text as in the (English) Penn Treebank Corpus. (see here)



 MORE:


 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? NLTK Sentence Tokenizer Incorrect
 ? NLTK sentence tokenisation with words as sentence units
 ? Change nltk word tagged sentence elements and write it back. Can NLTK create a sentence?
 ? Python NLTK :: Intersecting words and sentences
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]