简体   繁体   English

如何从UTF-8文件中检索原始字符串?

[英]How to retrieve the original string from UTF-8 file?

I am doing some web scraping with python and BeautifulSoup. 我正在用python和BeautifulSoup进行一些Web抓取。

body = soup.find("article")
tempvar = body.find()

fuu = open('tempfile', 'w')
tempvar = tempvar.encode('utf-8')
fuu.write(str(tempvar))
fuu.close()

fupa = open('tempfile')
joji = BeautifulSoup(fupa,'html.parser')
fupa.close()

print(joji)

tempvar would would contain html stuff , sometimes with emojis. tempvar将包含html内容,有时还会包含表情符号。 I want to use the contents of the file tempfile later in a real html file. 我想稍后在实际的html文件中使用文件tempfile的内容。

The print(joji) produces something like this: print(joji)产生如下内容:

<b>mencapai\xc2\xa0batas aksara 140</b>, tapi sudah tentu itu tidak termasuk semua <i>tweet </i>yang tak pernah dihantar kerana pengguna tidak boleh nak luahkan apa yang mereka mahukan. Selepas <b>mengaktifkan aksara 280</b> pada <b>sejumlah kecil akaun </b>yang bertuah, <b>Twitter </b>mengatakan <b>hanya 1%</b> sahaja <b>pengguna yang capai had aksara 280</b>. Tulis panjang\xc2\xb2 nak buat karangan ka. \xf0\x9f\x98\x9c<br/>\n<br/>\nIa juga jarang berlaku bagi pengguna untuk mencapai aksara 280, hanya <b>2%</b> dari <i>tweet </i><b>melebihi aksara 190</b>. <b>Had aksara tweet sebanyak 280 </b>juga <b>mendapat lebih <i>likes </i>dan <i>retweets </i></b>daripada had aksara <i>tweet </i>sebanyak 140. \xf0\x9f\x98\x8a<br/>\n<br/>

tempvar is a Unicode string. tempvar是Unicode字符串。 To write it correctly to a file: 要将其正确写入文件:

with open('tempfile', 'w', encoding='utf8') as fuu:
    fuu.write(tempvar)

Read it back in with: 通过以下方式读回:

with open('tempfile', encoding='utf8') as fupa:
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从Python3中使用Python2编码的文件中检索UTF-8编码(来自unicode)字符串的正确值? - How to retrieve correct value of a UTF-8 encoded (from unicode) string from a file from Python3 which was encoded using Python2? 从文件到字符串 python 中的字符串 utf-8 - String from file to string utf-8 in python 如何使用utf-8机器将utf-8字符串写入utf-8文件 - how to write utf-8 string to utf-8 file on utf-8 machine with python 如何在 Kodi 的字符串中读取 utf-8 编码的 JSON 文件(本地/来自互联网)? - How to read utf-8 encoded JSON file (locally/from internet) in a string in Kodi? 如何同时处理用于打印和存储到文件中的UTF-8字符串? - How to handle a UTF-8 string for the printing and the storage into a file simultaneously? 从文件UTF-8编码utf-8 python到其他文件 - encoding utf-8 python from file UTF-8 to other file Python:UTF-8:如何计算UTF-8字符串中的单词数? - Python : UTF-8 : How to count number of words in UTF-8 string? 如何将字符串从CP-1251转换为UTF-8? - How to convert a string from CP-1251 to UTF-8? 如何在python中将unicode字符串(来自JSON的字符串)编码为&#39;utf-8&#39;? - How to encode a unicode string (ones from JSON) to 'utf-8' in python? 如何解码从 shell_exec() 返回的 UTF-8 字符串? - How to decode UTF-8 string returned from shell_exec()?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM