使用Python有效地重写大型文本文件中的行

Question

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. 我正在尝试通过遍历数千个数据库记录来生成大数据文件（以GB为单位）。 At the top of the file are a line for each "feature" that appears latter in the file. 文件顶部是每个出现在文件后面的“功能”的一行。 They look like: 他们看着像是：

@attribute 'Diameter' numeric
@attribute 'Length' real
@attribute 'Qty' integer

lines containing data using these attributes look like: 使用这些属性包含数据的行如下所示：

{0 0.86, 1 0.98, 2 7}

However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. 但是，由于我的数据是稀疏数据，数据库中的每个记录可能没有每个属性，因此我不知道完整的功能集是什么。 I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method. 从理论上讲，我可以对数据库记录进行两次遍历，第一次是积累功能集，然后是第二次输出我的记录，但是我试图找到一种更有效的方法。

I'd like to try a method like the following pseudo-code: 我想尝试类似以下伪代码的方法：

fout = open('output.dat', 'w')
known_features = set()
for records in records:
    if record has unknown features:
        jump to top of file
        delete existing "@attribute" lines and write new lines
        jump to bottom of file
    fout.write(record)

It's the jump-to/write/jump-back part I'm not sure how to pull off. 这是我不确定如何实现的跳转/写入/后退部分。 How would you do this in Python? 您将如何在Python中执行此操作？

I tried something like: 我尝试了类似的东西：

fout.seek(0)
for new_attribute in new_attributes:
    fout.write(attribute)
fout.seek(0, 2)

but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify. 但这会覆盖文件顶部的属性行和数据行，而不仅仅是在我指定的查找位置开始插入新行。

How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? 如何在Python中获得字处理器的“插入”功能而不将整个文档加载到内存中？ The final file is larger than all my available memory. 最终文件大于我所有可用的内存。

Answer 1

Why don't you get a list of all the features and their data types; 为什么不获得所有功能及其数据类型的列表？ list them first. 首先列出它们。 If a feature is missing, replace it with a known value - NULL seems appropriate. 如果缺少某个功能，请用一个已知值替换它NULL似乎合适。

This way your records will be complete (in length), and you don't have to hop around the file. 这样，您的记录将是完整的（长度），并且您不必在文件中四处走动。

The other approach is, write two files. 另一种方法是，写入两个文件。 One contains all your features, the others all your rows. 一个包含您的所有功能，其他包含您的所有行。 Once both files are generated, append the feature file to the top of the data file. 生成两个文件后，将功能文件附加到数据文件的顶部。

FWIW, word processors load files in memory for editing; FWIW，文字处理器将文件加载到内存中进行编辑； and then they write the entire file out. 然后他们将整个文件写出来。 This is why you can't load a file larger than the addressable/available memory in a word processor; 这就是为什么您不能加载大于字处理器中可寻址/可用内存的文件的原因。 or any other program that is not implemented as a stream reader. 或未实现为流读取器的任何其他程序。

Answer 2

为什么不先在内存中构建输出（例如，作为字典），然后在所有数据都知道后将其写入文件？

使用Python有效地重写大型文本文件中的行

问题描述

2 个解决方案

解决方案1
1 已采纳 2012-11-08 04:37:55

解决方案2
0 2012-11-08 04:30:55

使用Python有效地重写大型文本文件中的行

问题描述

2 个解决方案

解决方案1 1 已采纳 2012-11-08 04:37:55

解决方案2 0 2012-11-08 04:30:55

解决方案1
1 已采纳 2012-11-08 04:37:55

解决方案2
0 2012-11-08 04:30:55