[英]How to escape “’” ““” when I used gzip.open() in python
我对Python完全gzip.open()
,当我使用gzip.open()
处理.gz文件时,我得到了一些代码,例如"It’s one of those great ensemble casts that’s incredibly balanced"
。
我该如何处理? 我使用的代码:
def _review_reader(file_path):
gz = gzip.open(file_path)
for l in gz:
yield eval(l)
该文件是从json文件压缩的
喜欢:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn\'t appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}\n
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch \\"grown up\\" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}\n
....
我想获取评论文字,但是有一些代码类似于’
由于您正在查看JSON数据,因此请使用Python的JSON解析器加载它。 它将自动处理任何嵌入的转义字符,例如\\n
或\\"
。
从gzip文件读取时,重要的是要意识到gzip会为您提供原始字节。 必须通过调用.decode()
将这些字节显式调整为文本,然后才能正确执行此操作,您需要知道JSON使用了哪种文本编码。 UTF-8是一个非常安全的默认假设,但也可以是其他假设,具体取决于编写.gz文件时选择的内容。
解析JSON之后,您可以按属性名称访问属性:
import json
import gzip
def _review_reader(file_path, encoding="utf8"):
with gzip.open(file_path, "rb") as f:
json_text = f.read().decode(encoding)
return json.loads(json_text)
for review in _review_reader(file_path):
print(review['reviewText'])
如果reviewText
恰好包含HTML代码而不是纯文本,则可能需要执行另一步-HTML解析。 lxml
模块可以帮助:
from lxml import etree
# ...
for review in _review_reader(file_path):
text = review['reviewText']
tree = etree.fromstring("<html>" + text + "</html>")
print(tree.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.