TypeError: must be unicode, not str in NLTK

I am using python2.7, nltk 3.2.1 and python-crfsuite 0.8.4. I am following this page : for nltk.tag.crf module.

To start with i just run this

from nltk.tag import CRFTagger
ct = CRFTagger()
train_data = [[('dfd','dfd')]]
ct.train(train_data,"abc")

I tried this too

f = open("abc","wb")
ct.train(train_data,f)

but i am getting the following error,

  File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 129, in <genexpr>
    if all (unicodedata.category(x) in punc_cat for x in token):
TypeError: must be unicode, not str


ANSWERS:


In Python 2, regular quotes '...' or "..." create byte strings. To get Unicode strings, use a u prefix before the string, like u'dfd'.

To read from a file, you'll want to specify an encoding. See Backporting Python 3 open(encoding="utf-8") to Python 2 for options; most straightforwardly, replace open() with io.open().

To convert an existing string, use the unicode() method; though usually, you'll want to use decode() and supply an encoding, too.

For (much) more details, Ned Batchelder's "Pragmatic Unicode" slides are recommended, if not outright obligatory reading;



 MORE:


 ? Python encoding issue (possibly from windows to linux issue)
 ? TypeError: file() takes at most 3 arguments (4 given)
 ? Korean txt file encoding with utf-8
 ? Python: encoding a file as you write it
 ? Python UnicodeEncodeError
 ? Python UnicodeEncodeError
 ? Python UnicodeEncodeError
 ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
 ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
 ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)