简体   繁体   中英

Comparing two large text files column by column in Python

I have two large tab separated text files with dimensions : 36000 rows x 3000 columns. The structure of the columns is same in both files but they may not be sorted.

I need to compare only the numeric columns between these two files(apprx 2970 columns) and export out those rows where there is a difference in the value between any two respective columns.

Problem: Memory issue

Things I tried:

1) Transposing data: Making the data from wide to long and reading the data chunk by chunk. Problem: Data bloats to a more than few million rows and python throws me a memory error

2) Difflib: Difflib along with generators and without transposing did provide me an output which was efficient but it compares the two files row by row. It doesn't differentiate the columns in the tab separated file.(I need them to be differentiated into columns since I will be performing some column operations between the difference rows.

3) Chunk and join: This is third approach I am trying wherein I will divide one file into chunks and merge it on the common keys with the other file repeatedly and find the difference in those chunks. This is going to be a shitty approach and its going to take a lot of time but I am unable to think of any thing else.

Also: These type of questions have been answered in the past but they only dealt with one huge file and processing the same.

Any suggestions for a better approach in Python will be greatly appreciated. Thank you.

First of all, if files are that big, they should be read row by row.

Reading one file row by row is simple:

with open(...) as f:
    for row in f:
        ...

To iterate two files row by row, zip them:

with open(...) as f1, open(...) as f2:
    for row1, row2 in itertools.izip(f1, f2):
        # compare rows, decide what to do with them

I used izip , as it won't zip everything at once, like zip would in Python 2. In Python 3, use zip . It does the right thing there. It will go row by row and yield the pairs.

The next question is comparing by column. Separate the columns:

columns = row.split('\t')  # they are separated by tabs, therefore \t

Now pick the relevant columns and compare them. Then discard irrelevant rows and write the relevant ones to the output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM