Efficiently rewriting lines in a large text file with Python

Question

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:

@attribute 'Diameter' numeric
@attribute 'Length' real
@attribute 'Qty' integer

lines containing data using these attributes look like:

{0 0.86, 1 0.98, 2 7}

However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.

I'd like to try a method like the following pseudo-code:

fout = open('output.dat', 'w')
known_features = set()
for records in records:
    if record has unknown features:
        jump to top of file
        delete existing "@attribute" lines and write new lines
        jump to bottom of file
    fout.write(record)

It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?

I tried something like:

fout.seek(0)
for new_attribute in new_attributes:
    fout.write(attribute)
fout.seek(0, 2)

but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.

How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.

Answer 1

Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.

This way your records will be complete (in length), and you don't have to hop around the file.

The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.

FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.

Answer 2

为什么不先在内存中构建输出（例如，作为字典），然后在所有数据都知道后将其写入文件？

Efficiently rewriting lines in a large text file with Python

Question

2 answers

solution1
1 ACCPTED 2012-11-08 04:37:55

solution2
0 2012-11-08 04:30:55

Efficiently rewriting lines in a large text file with Python

Question

2 answers

solution1 1 ACCPTED 2012-11-08 04:37:55

solution2 0 2012-11-08 04:30:55

solution1
1 ACCPTED 2012-11-08 04:37:55

solution2
0 2012-11-08 04:30:55