使用 beautifulsoup 将 xml 文件解析为 csv 文件

Question

我正在尝试解析多个（最终超过 1000 个）xml 文件以获取三个信息 persName、@ref 和 /date。 我设法获取了所有文件，当我使用 print() 时，它为我提供了我想要的所有信息。 但是，当我尝试将该信息写入 csv 文件时，只会解析最后一个 xml 文件。

from bs4 import BeautifulSoup
import csv
import os
path = r'C:\programming1\my-app'

for filename in os.listdir(path):
    if filename.endswith(".xml"):
        fullpath = os.path.join(path, filename)

        f = csv.writer(open("test2.csv", "w"))
        f.writerow(["date", "Name", "pref"])

        soup = BeautifulSoup (open(fullpath, encoding="utf-8"), "lxml")
        # removing unnecessary information to better isolate //date
        for docs in soup.find_all('tei'):
            for pubstmt in soup.find_all("publicationStmt"): 
                pubstmt.decompose()
            for sourdesc in soup.find_all("sourceDesc"):
                sourdesc.decompose()
            for lists in soup.find_all("list"):
                lists.decompose()
            for heads in soup.find_all("head"):
                lists.decompose()
            #finding all dates of Protokolls under /title
            for dates in soup.find_all("date"):
                date = dates.get('when')

            #getting all Names from xml files exept for thos in /list
            for Names in soup.find_all("persname"):
                nameonly = Names.contents
                nameref = Names.get("ref")
                f.writerow([date, nameonly, nameref])'

如果我将 writerow 放在 Names 下，那么它只会写入最后一个文件的所有信息，如果我将 writerow 放在 Names 之后，那么它只会写入一个名称的信息

有人可以告诉我我做错了什么吗？ 我尝试了很多 for 循环，但似乎都没有。

Answer 1

你写了：

但是，当我尝试将该信息写入 csv 文件时，只会解析最后一个 xml 文件。

通过阅读您的代码，正在发生的事情是：

每个 XML 都会被解析，但只有最后一个 XML 文件被写入 CSV

那是因为您正在为每个输入 XML 打开test2.csv “用于写入”。 当您打开写入"w"时，它会创建文件，或者在您的情况下，它会为每次迭代重新创建文件（覆盖其内容）。

因为你想要一个 header：

在开始迭代 XML 之前，您需要打开 CSV
写你的 header
循环处理您的 XML 并写入 CSV
在最底部，退出循环后，关闭 CSV

使用 beautifulsoup 将 xml 文件解析为 csv 文件

问题描述

1 个解决方案

解决方案1
0 2021-12-26 22:28:02

使用 beautifulsoup 将 xml 文件解析为 csv 文件

问题描述

1 个解决方案

解决方案1 0 2021-12-26 22:28:02

解决方案1
0 2021-12-26 22:28:02