error using "nltk.word_tokenize()" function

I'm trying to tokenize twitter text. When I apply the function nltk.word_tokenize() each single twitter text, it works perfectly even for some very ugly one such as

'\xd8\xb3\xd8\xa3\xd9\x87\xd9\x8e\xd9\x85\xd9\x90\xd8\xb3\xd9\x8f',
'\xd9\x82\xd9\x90\xd8\xb5\xd9\x8e\xd9\x91\xd8\xa9\xd9\x8b', '\xd8\xad\xd8\xaa\xd9\x89'

but when I loop through all the twitter in a file

tokens = []
for i in range(0,5047591):
    s = ','.join(l_of_l[i])
    tokens += nltk.word_tokenize(s)

it returns errors such as:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices):

and many more

any suggestion about how to fix it?


ANSWERS:


The problem you're getting is not from the code you included, it's from the code that include open() command. The script is opening the file fine, but when you're accessing your data, it's give you that TraceBack

import codecs
...
with codecs.open('file.csv','r',encoding='utf8') as f:
    text = f.read()


 MORE:


 ? error using "nltk.word_tokenize()" function
 ? What languages are supported for nltk.word_tokenize and nltk.pos_tag
 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? Tokenize a paragraph into sentence and then into words in NLTK
 ? NLTK Sentence Tokenizer Incorrect
 ? NLTK sentence tokenisation with words as sentence units
 ? Change nltk word tagged sentence elements and write it back. Can NLTK create a sentence?
 ? Python NLTK :: Intersecting words and sentences
 ? ThreadPoolExecutor is not defined [python3]