How can I get my Python to parse the following text?

I have a sample of the text:

"PROTECTING-ħarsien",

I'm trying to parse with the following

import csv, json

with open('./dict.txt') as maltese:
    entries = maltese.readlines()
    for entry in entries:
        tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
        if len(tokens) == 1:
            pass
        else:   
            print tokens[0] + "," + unicode(tokens[1])

But I'm getting an error message

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

What am I doing wrong?


ANSWERS:


It appears that dict.txt is UTF-8 encoded (ħ is 0xc4 0xa7 in UTF-8).

You should open the file as UTF-8, then:

import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
    # etc.

You will then have Unicode strings instead of bytestrings to work with; you therefore don't need to call unicode() on them, but you may have to re-encode them to the encoding of the terminal you're outputting to.


You have to change your last line to (this has been tested to work on your data):

print tokens[0] + "," + unicode(tokens[1], 'utf8')

If you don't have that utf8, Python assumes that the source is ascii encoding, hence the error.

See



 MORE:


 ? How can I get my Python to parse the following text?
 ? How to convert unicode original python type
 ? How to convert unicode original python type
 ? How to convert unicode original python type
 ? How to validate the length of nested items in a serializer?
 ? How to convert a array or list to JSON?
 ? Django Rest Framework converting list of related objects into blank list after running validations
 ? I am sending json data to api but getting unicode json data whe api call from android device
 ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 8: ordinal not in range(128)
 ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 8: ordinal not in range(128)