XML 到 CSV Python 3.5.2

Question

I am trying to convert the following XML to CSV.我正在尝试将以下 XML 转换为 CSV。 The catch is every single entry might not have a value so it returns a NoneType .问题是每个条目可能没有值，因此它返回NoneType 。 For example, in the XML shown below not every "entry" will have a "rule".例如，在下面显示的 XML 中，并不是每个“条目”都有一个“规则”。

If this happens I would like the CSV file to contain either nothing or a generic value such as "EMPTY".如果发生这种情况，我希望 CSV 文件不包含任何内容或通用值，例如“EMPTY”。 I would like the CSV file to look something like this:我希望 CSV 文件看起来像这样：

domain    serial     seqno   rule
 1       43434343434    1     21
 1       43434343434    1     21
 1       43434343434    1     EMPTY

By using the list comprehension shown below I was able to avoid the NoneType error.通过使用下面显示的列表理解，我能够避免NoneType错误。 However, it appears that formatting the data in the CSV is where I need some assistance.但是，在 CSV 中格式化数据似乎是我需要一些帮助的地方。

rows = [cleanhtml(str(entry))
        for entry in soup.find_all("entry")
        if entry.find(header_list[counter]) is not None]

#!/usr/bin/env python3

import csv
import re
from bs4 import BeautifulSoup

html_results='''<response status="success"><result>
      <job>
        <tenq>09:48:24</tenq>
        <tdeq>09:48:24</tdeq>
        <tlast>18:00:00</tlast>
        <status>FIN</status>
        <id>5955</id>
        <cached-logs>1118</cached-logs>
      </job>
      <log>
        <logs count="100" progress="100">
          <entry logid="4343">
            <domain>1</domain>
            <serial>43434343434</serial>
            <seqno>0</seqno>
            <actionflags>0x0</actionflags>
            <type>EXAMPLE</type>
            <subtype>EXAMPLE</subtype>
            <config_ver>0</config_ver>
            <src>1.1.1.1</src>
            <dst>1.1.1.1</dst>
            <rule>Rule 21</rule>
          </entry>
      <log>
          <entry logid="4343">
            <domain>1</domain>
            <serial>43434343434</serial>
            <seqno>0</seqno>
            <actionflags>0x0</actionflags>
            <type>EXAMPLE</type>
            <subtype>EXAMPLE</subtype>
            <config_ver>0</config_ver>
            <src>1.1.1.1</src>
            <dst>1.1.1.1</dst>
            <rule>Rule 21</rule>
          </entry>'''

def cleanhtml(raw_html):
  tags = re.compile('<.*?>')
  cleantext = re.sub(tags, '', raw_html)
  return cleantext

soup = BeautifulSoup(html_results, 'html.parser')

header_list = ['domain',"serial","seqno","actionflags","type","subtype","config_ver","src","dst","rule"]

query_results = open("query_results.csv","w")

csvwriter = csv.writer(query_results)

csvwriter.writerow(header_list)

num_of_logs = soup.find("logs").get("count")

counter = 0

rows = [cleanhtml(str(entry)) for entry in soup.find_all("entry") if entry.find(header_list[counter]) is not None]

csvwriter.writerows(rows)

query_results.close()

Answer 1

You are not handling your entry subelements;您没有处理您的条目子元素； you are merely turning each entry to text and removing the XML tag markup.您只是将每个条目转换为文本并删除 XML 标记标记。 You need to produce a list or dictionary with each subelement entry teased out separately.您需要生成一个列表或字典，其中每个子元素条目都分开梳理。

If you produce a dictionary of the nested elements, then thecsv.DictWriter() class could handle filling in empty columns for you, without additional coding:如果您生成嵌套元素的字典，则csv.DictWriter()类可以为您处理空列的填充，而无需额外编码：

def entry_to_dict(entry):
    return {tag.name: tag.get_text() for tag in entry.find_all()}

header_list = ['domain', 'serial', 'seqno', 'actionflags', 'type', 'subtype', 'config_ver', 'src', 'dst', 'rule']

soup = BeautifulSoup(html_results, 'html.parser')
with open("query_results.csv","w") as query_results:
    csvwriter = csv.DictWriter(query_results, header_list, restval='EMPTY')
    csvwriter.writeheader()
    csvwriter.writerows(entry_to_dict(entry) for entry in soup.find_all('entry'))

Here, the restval argument tells the writer how to handle missing values in each row.在这里， restval参数告诉作者如何处理每一行中的缺失值。 header_list is passed in as the field names, so the writer knows what keys to expect in each row dictionary. header_list作为字段名称传入，因此作者知道在每个行字典中期望什么键。

entry_to_dict() simply turns each nested element in an entry into a key-value pair in a dictionary, with thetag.get_text() function doing the work of turning element contents into text. entry_to_dict()只是将条目中的每个嵌套元素转换为字典中的键值对，tag.get_text()函数完成将元素内容转换为文本的工作。 The这

For your demo XML data, this produces:对于您的演示 XML 数据，这会产生：

>>> import sys
>>> csvwriter = csv.DictWriter(sys.stdout, header_list, restval='EMPTY')
>>> csvwriter.writeheader()
domain,serial,seqno,actionflags,type,subtype,config_ver,src,dst,rule
>>> csvwriter.writerows(entry_to_dict(entry) for entry in soup.find_all('entry'))
1,43434343434,0,0x0,EXAMPLE,EXAMPLE,0,1.1.1.1,1.1.1.1,Rule 21
1,43434343434,0,0x0,EXAMPLE,EXAMPLE,0,1.1.1.1,1.1.1.1,Rule 21

This doesn't actually contain any empty elements, but when I add some, you can see that EMPTY is used to fill those in:这实际上并不包含任何空元素，但是当我添加一些元素时，您可以看到EMPTY用于填充这些元素：

>>> html_results += '''</log><log>
...           <entry logid="4343">
...             <domain>1</domain>
...             <serial>43434343434</serial>
...             <seqno>0</seqno>
...             <actionflags>0x0</actionflags>
...             <type>EXAMPLE</type>
...             <subtype>EXAMPLE</subtype>
...             <!-- incomplete entry, config_ver, src, dst and rule missing -->
...          </entry>
...        </log>'''
>>> soup = BeautifulSoup(html_results, 'html.parser')
>>> csvwriter = csv.DictWriter(sys.stdout, header_list, restval='EMPTY')
>>> csvwriter.writeheader()
domain,serial,seqno,actionflags,type,subtype,config_ver,src,dst,rule
>>> csvwriter.writerows(entry_to_dict(entry) for entry in soup.find_all('entry'))
1,43434343434,0,0x0,EXAMPLE,EXAMPLE,0,1.1.1.1,1.1.1.1,Rule 21
1,43434343434,0,0x0,EXAMPLE,EXAMPLE,0,1.1.1.1,1.1.1.1,Rule 21
1,43434343434,0,0x0,EXAMPLE,EXAMPLE,EMPTY,EMPTY,EMPTY,EMPTY

As a final note: consider installing the lxml library , and use the xml parser in BeautifulSoup:最后一点：考虑安装lxml库，并在 BeautifulSoup 中使用xml解析器：

soup = BeautifulSoup(html_results, 'xml')

This ensures that your XML is parsed correctly at all times (the HTML parser is fault-tolerant in ways that XML should not be, and is case-insensitive, which could cause issues with mixed-case XML data).这可确保始终正确解析您的 XML（HTML 解析器具有 XML 不应该具有的容错能力，并且不区分大小写，这可能会导致混合大小写 XML 数据出现问题）。

XML 到 CSV Python 3.5.2

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-12 12:59:31

XML 到 CSV Python 3.5.2

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-12 12:59:31

解决方案1
1 已采纳 2018-02-12 12:59:31