逐行读取 XML 而无需将整个文件加载到 memory

Question

This is structure of my XML:这是我的 XML 的结构：

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="756" ViewCount="63468" Body="&lt;p&gt;I want to use a &lt;code&gt;Track-Bar&lt;/code&gt; to change a &lt;code&gt;Form&lt;/code&gt;'s opacity.&lt;/p&gt;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&lt;pre class=&quot;lang-cs prettyprint-override&quot;&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;When I build the application, it gives the following error:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;pre class=&quot;lang-none prettyprint-override&quot;&gt;&lt;code&gt;Cannot implicitly convert type decimal to double&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;I have tried using &lt;code&gt;trans&lt;/code&gt; and &lt;code&gt;double&lt;/code&gt;, but then the &lt;code&gt;Control&lt;/code&gt; doesn't work. This code worked fine in a past VB.NET project.&lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="3072350" LastEditorDisplayName="Rich B" LastEditDate="2021-02-26T03:31:15.027" LastActivityDate="2021-11-15T21:15:29.713" Title="How to convert a Decimal to a Double in C#?" Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;" AnswerCount="12" CommentCount="4" FavoriteCount="59" CommunityOwnedDate="2012-10-31T16:42:47.213" ContentLicense="CC BY-SA 4.0" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="313" ViewCount="22477" Body="&lt;p&gt;I have an absolutely positioned &lt;code&gt;div&lt;/code&gt; containing several children, one of which is a relatively positioned &lt;code&gt;div&lt;/code&gt;. When I use a &lt;code&gt;percentage-based width&lt;/code&gt; on the child &lt;code&gt;div&lt;/code&gt;, it collapses to &lt;code&gt;0 width&lt;/code&gt; on IE7, but not on Firefox or Safari.&lt;/p&gt;&#xA;&lt;p&gt;If I use &lt;code&gt;pixel width&lt;/code&gt;, it works. If the parent is relatively positioned, the percentage width on the child works.&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;Is there something I'm missing here?&lt;/li&gt;&#xA;&lt;li&gt;Is there an easy fix for this besides the &lt;code&gt;pixel-based width&lt;/code&gt; on the child?&lt;/li&gt;&#xA;&lt;li&gt;Is there an area of the CSS specification that covers this?&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;" OwnerUserId="9" LastEditorUserId="9134576" LastEditorDisplayName="user14723686" LastEditDate="2021-01-29T18:46:45.963" LastActivityDate="2021-01-29T18:46:45.963" Title="Why did the width collapse in the percentage width child element in an absolutely positioned parent on Internet Explorer 7?" Tags="&lt;html&gt;&lt;css&gt;&lt;internet-explorer-7&gt;" AnswerCount="7" CommentCount="0" FavoriteCount="13" ContentLicense="CC BY-SA 4.0" />
</posts>

Can I load every row one by one without loading whole XML file into memory?我可以row加载而不将整个 XML 文件加载到 memory 中吗？ For example printing all of the titles例如打印所有的标题

Answer 1

Providing the XML file is structured exactly as shown in the example then BeautifulSoup could be used to parse relevant lines.如果 XML 文件的结构与示例中所示的完全相同，则 BeautifulSoup 可用于解析相关行。 Something like this:像这样：

from bs4 import BeautifulSoup as BS
with open('my.xml') as xml:
    for line in map(str.strip, xml):
        if line.startswith('<row'):
            soup = BS(line, 'lxml')
            if row := soup.find('row'):
                if title := row.get('title'):
                    print(title)

Answer 2

"Lines" in XML are pretty irrelevant; XML 中的“行”是无关紧要的； the relevant units are things like elements, attributes, start tags, end tags.相关单位是元素、属性、开始标签、结束标签等。

A streaming parser (often called a SAX parser, though strictly speaking SAX is a Java API) will deliver the document to the application incrementally, not one line at a time, but one syntactic unit at a time.流式解析器（通常称为 SAX 解析器，尽管严格来说 SAX 是一个 Java API）将递增地向应用程序交付文档，不是一次一行，而是一次一个语法单元。

See for example Python SAX Parser参见例如Python SAX 解析器

Answer 3

You can try something like this:你可以尝试这样的事情：

while line:= file.readline():

Answer 4

Yes, you can use open() , it will return a file object and not read the file content into the RAM.是的，你可以使用open() ，它会返回一个文件 object 而不是将文件内容读入 RAM。 So you want to do something like this:所以你想做这样的事情：

with open('file_name') as file:
    for row in file:
        print(row)

逐行读取 XML 而无需将整个文件加载到 memory

问题描述

4 个解决方案

解决方案1
1 已采纳 2022-05-05 08:40:43

解决方案2
1 2022-05-05 14:40:37

解决方案3
0 2022-05-05 08:27:37

解决方案4
0 2022-05-05 08:47:27

逐行读取 XML 而无需将整个文件加载到 memory

问题描述

4 个解决方案

解决方案1 1 已采纳 2022-05-05 08:40:43

解决方案2 1 2022-05-05 14:40:37

解决方案3 0 2022-05-05 08:27:37

解决方案4 0 2022-05-05 08:47:27

解决方案1
1 已采纳 2022-05-05 08:40:43

解决方案2
1 2022-05-05 14:40:37

解决方案3
0 2022-05-05 08:27:37

解决方案4
0 2022-05-05 08:47:27