在Python中将html转换为json

Question

I am trying to convert some html files to json. From the beginning: I downloaded a kind of old dataset called SarcasmAmazonReviewsCorpus .我正在尝试将一些 html 文件转换为 json。从一开始：我下载了一种名为SarcasmAmazonReviewsCorpus的旧数据集。 It has several txt files, all with comments, reactions, name of product and so on, as it follows in the image:它有几个txt文件，都有评论，反应，产品名称等，如下图所示：

I was able to pick up each txt file and using os module I created a list with every file content.我能够获取每个 txt 文件并使用 os 模块我创建了一个包含每个文件内容的列表。 The code was:代码是：

files_content = []

for filename in filter(lambda p: p.endswith("txt"), os.listdir(path)):
    filepath = os.path.join(path, filename)
    with open(filepath, mode='r') as f:
        files_content += [f.read()]

Then, I am trying to use Beatifulsoup:然后，我尝试使用 Beatifulsoup：

soup = BeautifulSoup(files_content[2], 'html5lib')
soup

The output is like: output 是这样的：

Is there a way that I can convert all the itens in the files_content list into a json file ?有没有办法可以将files_content列表中的所有项目转换为json file ？ Tkanks for the help!感谢您的帮助！

Answer 1

You might have to change soup.find in dictionary to get the data you want.您可能必须更改字典中的 soup.find 才能获取所需的数据。

import json

dictionary = {
    "title": soup.find("title"),
    "date": soup.find("date")
}

json_object = json.dumps(dictionary, indent=4)

with open("saveFile.json", "w") as outfile:
    outfile.write(json_object)

Answer 2

As it looks like you are parsing some very simple HTML data, I think you could simply use xmltodict package for this:看起来您正在解析一些非常简单的 HTML 数据，我认为您可以为此简单地使用 xmltodict package ：

data = []
for txt in txt_files:
   with open(txt, "r") as file:
     data += [xmltodict.parse(file.read())]

json_str = json.dumps(data)

在Python中将html转换为json

问题描述

2 个解决方案

解决方案1
0 2022-11-29 20:38:40

解决方案2
0 2022-11-29 20:54:28

在Python中将html转换为json

问题描述

2 个解决方案

解决方案1 0 2022-11-29 20:38:40

解决方案2 0 2022-11-29 20:54:28

解决方案1
0 2022-11-29 20:38:40

解决方案2
0 2022-11-29 20:54:28