[英]Convert html to json in Python
I am trying to convert some html files to json. From the beginning: I downloaded a kind of old dataset called SarcasmAmazonReviewsCorpus
.我正在尝试将一些 html 文件转换为 json。从一开始:我下载了一种名为
SarcasmAmazonReviewsCorpus
的旧数据集。 It has several txt files, all with comments, reactions, name of product and so on, as it follows in the image:它有几个txt文件,都有评论,反应,产品名称等,如下图所示:
I was able to pick up each txt file and using os module I created a list with every file content.我能够获取每个 txt 文件并使用 os 模块我创建了一个包含每个文件内容的列表。 The code was:
代码是:
files_content = []
for filename in filter(lambda p: p.endswith("txt"), os.listdir(path)):
filepath = os.path.join(path, filename)
with open(filepath, mode='r') as f:
files_content += [f.read()]
Then, I am trying to use Beatifulsoup:然后,我尝试使用 Beatifulsoup:
soup = BeautifulSoup(files_content[2], 'html5lib')
soup
The output is like: output 是这样的:
Is there a way that I can convert all the itens in the files_content
list into a json file
?有没有办法可以将
files_content
列表中的所有项目转换为json file
? Tkanks for the help!感谢您的帮助!
You might have to change soup.find in dictionary to get the data you want.您可能必须更改字典中的 soup.find 才能获取所需的数据。
import json
dictionary = {
"title": soup.find("title"),
"date": soup.find("date")
}
json_object = json.dumps(dictionary, indent=4)
with open("saveFile.json", "w") as outfile:
outfile.write(json_object)
As it looks like you are parsing some very simple HTML data, I think you could simply use xmltodict package for this:看起来您正在解析一些非常简单的 HTML 数据,我认为您可以为此简单地使用 xmltodict package :
data = []
for txt in txt_files:
with open(txt, "r") as file:
data += [xmltodict.parse(file.read())]
json_str = json.dumps(data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.