I am scraping data from a web-site and i have got a problem. I cannot create a file with data in polish encoding. I got a lot of unicode characters, but i do not want to have them instead of real characters. Could anyone help me? Thanks.
Here is some part of output content i get:
le\śnych, hibiskusa lub brzoskwini 250 g cukru 5 g kwasku cytrynowego 2 \ły\żki soku z cytryny
Here is the code creating the file:
with codecs.open('recipes.txt', 'w', 'cp1250') as w:
w.write(string)
On Python 3 it gives always correct text
leśnych, hibiskusa lub brzoskwini 250 g cukru 5 g kwasku cytrynowego 2 łyżki soku z cytryny
So it seems you use Python 2 which always had problem with Polish coding.
(Polish is my native language).
Python 2 treats \ś
as normal string, not unicode char ś
.
You have to encode
and decode
it again.
text = text.encode().decode('unicode_escape')
You should see correct text when you even use print()
(if only your system can works with CP1250
and has font with Polish chars)
Minimal working code
import codecs
text = 'le\u015bnych, hibiskusa lub brzoskwini 250 g cukru 5 g kwasku cytrynowego 2 \u0142y\u017cki soku z cytryny'
text = text.encode().decode('unicode_escape')
#print(text)
with codecs.open('recipes.txt', 'w', 'cp1250') as w:
w.write(text)
The solution i found for me useful is to add .prettify('iso-8859-1').decode('utf-8', errors='replace') to all the strings you need to add. But before, please, read @furas answer and some comments from him.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.