简体   繁体   English

使用Python有效地重写大型文本文件中的行

[英]Efficiently rewriting lines in a large text file with Python

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. 我正在尝试通过遍历数千个数据库记录来生成大数据文件(以GB为单位)。 At the top of the file are a line for each "feature" that appears latter in the file. 文件顶部是每个出现在文件后面的“功能”的一行。 They look like: 他们看着像是:

@attribute 'Diameter' numeric
@attribute 'Length' real
@attribute 'Qty' integer

lines containing data using these attributes look like: 使用这些属性包含数据的行如下所示:

{0 0.86, 1 0.98, 2 7}

However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. 但是,由于我的数据是稀疏数据,数据库中的每个记录可能没有每个属性,因此我不知道完整的功能集是什么。 I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method. 从理论上讲,我可以对数据库记录进行两次遍历,第一次是积累功能集,然后是第二次输出我的记录,但是我试图找到一种更有效的方法。

I'd like to try a method like the following pseudo-code: 我想尝试类似以下伪代码的方法:

fout = open('output.dat', 'w')
known_features = set()
for records in records:
    if record has unknown features:
        jump to top of file
        delete existing "@attribute" lines and write new lines
        jump to bottom of file
    fout.write(record)

It's the jump-to/write/jump-back part I'm not sure how to pull off. 这是我不确定如何实现的跳转/写入/后退部分。 How would you do this in Python? 您将如何在Python中执行此操作?

I tried something like: 我尝试了类似的东西:

fout.seek(0)
for new_attribute in new_attributes:
    fout.write(attribute)
fout.seek(0, 2)

but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify. 但这会覆盖文件顶部的属性行数据行,而不仅仅是在我指定的查找位置开始插入新行。

How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? 如何在Python中获得字处理器的“插入”功能而不将整个文档加载到内存中? The final file is larger than all my available memory. 最终文件大于我所有可用的内存。

Why don't you get a list of all the features and their data types; 为什么不获得所有功能及其数据类型的列表? list them first. 首先列出它们。 If a feature is missing, replace it with a known value - NULL seems appropriate. 如果缺少某个功能,请用一个已知值替换它NULL似乎合适。

This way your records will be complete (in length), and you don't have to hop around the file. 这样,您的记录将是完整的(长度),并且您不必在文件中四处走动。

The other approach is, write two files. 另一种方法是,写入两个文件。 One contains all your features, the others all your rows. 一个包含您的所有功能,其他包含您的所有行。 Once both files are generated, append the feature file to the top of the data file. 生成两个文件后,将功能文件附加到数据文件的顶部。

FWIW, word processors load files in memory for editing; FWIW,文字处理器将文件加载到内存中进行编辑; and then they write the entire file out. 然后他们将整个文件写出来。 This is why you can't load a file larger than the addressable/available memory in a word processor; 这就是为什么您不能加载大于字处理器中可寻址/可用内存的文件的原因。 or any other program that is not implemented as a stream reader. 或未实现为流读取器的任何其他程序。

为什么不先在内存中构建输出(例如,作为字典),然后在所有数据都知道后将其写入文件?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中有效地解析大文本文件? - Efficiently parsing a large text file in Python? 高效获取大型文本文件中以给定字符串开头的所有行 - Efficiently get all lines starting with given string for a large text file 如何在Python中以不同的顺序有效地读取大文本文件的行:a)每次随机行,b)从中间开始...? - How to efficiently read lines of large text file in Python in different orders: a) random line each time, b) starting in middle…? 在Python中将文本有效地放在很大的文本文件之前 - Efficiently prepending text to a very large text file in Python 从python中的大型文本文件中有效读取部分 - reading sections from a large text file in python efficiently 从python中的大文本文件中删除特定行 - Remove specific lines from a large text file in python 使用python解析大型(20GB)文本文件 - 以2行读取1 - Parsing large (20GB) text file with python - reading in 2 lines as 1 使用python从大型文本文件中选择具有特定条件的特定行 - select specific lines with specific condition from a large text file with python Python - 有效地搜索多个模式的文件行 - Python - Searching file lines for multiple patterns efficiently 如何在 Python 中高效解析大型 JSON 文件? - How to parse a large JSON file efficiently in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM