简体   繁体   English

如何使用 python 从大文件中删除特定行?

[英]How to remove a specific line from a large file with python?

For work I have to process an XML-based file: place it in another directory, change extension (from.xaf to.xml), read some info from lines and remove two specific lines from the file (after it has been replaced.对于工作,我必须处理一个基于 XML 的文件:将它放在另一个目录中,更改扩展名(从 .xaf 到 .xml),从行中读取一些信息并从文件中删除两个特定行(在它被替换之后)。
I got everything working except for the last part.除了最后一部分,我一切正常。 I need to remove two specific lines (or two specific parts from a line if the content of the xml is written in one line).我需要删除两个特定的行(如果 xml 的内容写在一行中,则需要删除一行中的两个特定部分)。 That should not be a problem, there are already many posts about it on stackoverflow.这应该不是问题,stackoverflow 上已经有很多关于它的帖子。 The solution often given is to read it from the source file and copy it line by line (except for the line that needs to be deleted) for the whole file.经常给出的解决方案是从源文件中读取,然后对整个文件逐行复制(除了需要删除的行)。
The problem is that the files i need to process are very big (anywhere from 100.000 to >5.000.000 lines), and there are a lot of files to process so doing this method takes a long time.问题是我需要处理的文件非常大(从 100.000 行到 >5.000.000 行不等),并且有很多文件要处理,所以执行此方法需要很长时间。
Is there a way to copy the file and edit the content directly, instead of copying the file line by line?有没有办法直接复制文件并编辑内容,而不是逐行复制文件? The parts that need to be deleted are always somewhere in the top 20 lines.需要删除的部分总是在前 20 行的某处。

What I tried was copying the files from the source to the destination, and then opening the source file again to read the first 20 lines and copying those to the destination file.我尝试的是将文件从源复制到目标,然后再次打开源文件以读取前 20 行并将它们复制到目标文件。 However, this meant the whole destination file was overwritten (so anything after those 20 lines was gone).但是,这意味着整个目标文件都被覆盖了(所以那 20 行之后的所有内容都消失了)。

Does anyone has an idea on how to handle this?有没有人知道如何处理这个? Many thanks非常感谢

example part:\示例部分:\

   \<companyIdent>XXXXX\</companyIdent>\
\<companyName>Company1\</companyName>\
\<taxRegistrationCountry>NL\</taxRegistrationCountry>\
\<taxRegIdent>123456789\</taxRegIdent>\
\<streetAddress>
    \<streetname>Address1\</streetname>\
    \<city>CITY\</city>\
    \<postalCode>1234AB\</postalCode>\
    \<country>NL\</country>\</streetAddress>\
\<customersSuppliers>
   \<customerSupplier>
      \<custSupID>C0001\</custSupID>\

I want to remove<streetAddress> and </streetAddress>.我想删除 <streetAddress> 和 </streetAddress>。 Only these two tags, so not the content in it (that's why i was thinking of removing lines instead of parsing it)只有这两个标签,而不是其中的内容(这就是为什么我想删除行而不是解析它)

Using event-based SAX parser you can filter tags with low memory usage and good performance:使用基于事件的 SAX 解析器,您可以过滤 memory 使用率低且性能良好的标签:

from xml.sax import make_parser
from xml.sax.saxutils import XMLFilterBase, XMLGenerator

# filter class which skips startElement and endElement events
# for tags configured
class MyFilter(XMLFilterBase):

    def __init__(self, tags_to_exclude, parent=None):
        super().__init__(parent)

        # tags to exclude
        self._tags_to_exclude = tags_to_exclude

    def startElement(self, name, attrs):
        if name not in self._tags_to_exclude:
            super().startElement(name, attrs)

    def endElement(self, name):
        if name not in self._tags_to_exclude:
            super().endElement(name)


# define tags to be ecluded    
tags_to_exclude = {'streetAddress'}

# create filter    
reader = MyFilter(tags_to_exclude, make_parser())

# parse source and write to target
with open('target.xml', 'w') as file:
    handler = XMLGenerator(file)
    reader.setContentHandler(handler)
    reader.parse('source.xml')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM