简体   繁体   English

在Python中将html转换为json

[英]Convert html to json in Python

I am trying to convert some html files to json. From the beginning: I downloaded a kind of old dataset called SarcasmAmazonReviewsCorpus .我正在尝试将一些 html 文件转换为 json。从一开始:我下载了一种名为SarcasmAmazonReviewsCorpus的旧数据集。 It has several txt files, all with comments, reactions, name of product and so on, as it follows in the image:它有几个txt文件,都有评论,反应,产品名称等,如下图所示:

在此处输入图像描述

I was able to pick up each txt file and using os module I created a list with every file content.我能够获取每个 txt 文件并使用 os 模块我创建了一个包含每个文件内容的列表。 The code was:代码是:

files_content = []

for filename in filter(lambda p: p.endswith("txt"), os.listdir(path)):
    filepath = os.path.join(path, filename)
    with open(filepath, mode='r') as f:
        files_content += [f.read()]

Then, I am trying to use Beatifulsoup:然后,我尝试使用 Beatifulsoup:

soup = BeautifulSoup(files_content[2], 'html5lib')
soup

The output is like: output 是这样的:

图片

Is there a way that I can convert all the itens in the files_content list into a json file ?有没有办法可以将files_content列表中的所有项目转换为json file Tkanks for the help!感谢您的帮助!

You might have to change soup.find in dictionary to get the data you want.您可能必须更改字典中的 soup.find 才能获取所需的数据。

import json

dictionary = {
    "title": soup.find("title"),
    "date": soup.find("date")
}

json_object = json.dumps(dictionary, indent=4)

with open("saveFile.json", "w") as outfile:
    outfile.write(json_object)

As it looks like you are parsing some very simple HTML data, I think you could simply use xmltodict package for this:看起来您正在解析一些非常简单的 HTML 数据,我认为您可以为此简单地使用 xmltodict package :

data = []
for txt in txt_files:
   with open(txt, "r") as file:
     data += [xmltodict.parse(file.read())]

json_str = json.dumps(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM