Python Polish character encoding issues

I'm having some issues with character encoding, and in this special case with Polish characters.

I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?

The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).

I tried this:

import unicodedata

text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))

This prints:

Racawicka Roge

But now the ó and é are both encoded to o and e.

How can I get this right?


ANSWERS:


If you want to move to 1252, that's what you should tell encode and decode:

>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.

from unidecode import unidecode

text = u'Racławicka Rógé'
result = ''

for i in text:
    try:
        result += i.encode('1252').decode('1252')
    except (UnicodeEncodeError, UnicodeDecodeError):
        result += unidecode(i)

print result # which will be 'Raclawicka Rógé'


 MORE:


 ? Python Polish character encoding issues
 ? Python Polish character encoding issues
 ? What Character Encoding Is This?
 ? Python issues on character encoding
 ? Charset utf-8 not supporting polish lang in cakephp translation
 ? Charset utf-8 not supporting polish lang in cakephp translation
 ? Charset utf-8 not supporting polish lang in cakephp translation
 ? Windows 1252 Data in UTF-8 MySQL Table Using CakePHP
 ? Browser detects ISO-8859-1 encoding on UTF-8 cakePHP app
 ? Charset UTF-8, Can't solve it