简体   繁体   中英

fastest way to find and replace in a large text file (single line file or single string file) in python

every one, I am facing problem with slow find and replace in python with large text file (it's just a single line file or single string file), It's take a lots of time to perfume the task. I am having a excel file in which column "A" codes available in text file which to be replace with column "B", but codes are around a million or more to be replace. any fastest way you can recommend. Thanks in advance. I am tried both the listed ways

# first way

import pandas as pd
import re

df = pd.read_excel("rep-codes.xlsx", header=None, index_col=False, dtype=str)
df.columns = ['A', 'B']

for index, row in df.iterrows():
    open_file = open('final.txt', 'r')
    read_file = open_file.read()
    regex = re.compile((row['A']))
    read_file = regex.sub((row['B']), read_file)
    write_file = open('final.txt','w')
    write_file.write(read_file)


# 2nd way

df = pd.read_excel("rep-codes.xlsx", header=None, index_col=False, dtype=str)
df.columns = ['A', 'B']

fin = open("final.txt", "rt")
data = fin.read()

for index, row in df.iterrows():
    data = data.replace((row['A']), (row['B']))

fin.close()
fin = open("final.txt", "wt")
fin.write(data)
fin.close()
  • Firstly, clarify the performance that would satisfy the business requirements. You can optimise forever, but at some point it's more effective to just let the thing run for however long it runs (overnight, if necessary), or otherwise throw brute force at it (rent a beefy machine from AWS or equivalent).

  • There is a replacement regex library, pyre2 (more generally, Google RE2), which can work faster in some circumstances, in particular on large amounts of text.

  • Another algorithm would be to take all the words in column A and compile them into a single regex; this might work especially well in combination with pyre2 . Something like:

     for index, row in df.iterrows(): map[row['A']] = row['B'] def repl(match_obj): return map[match_obj.group(0)] regex = re.compile('|'.join(re.escape(index) for index, _row in df.iterrows())) data = regex.sub(repl, data)
  • Another question is whether to do the replacement in memory, or directly to the output file. In memory, the string needs to be copied each time; directly to disk will involve library calls for each match.

    You'd have to measure, with real data, whether this is an advantage or a disadvantage.

    This approach could also be extended to handle files larger than memory.

    Instead of regex.sub you'd call regex.finditer ; for each match object, you'd write out the section of the string up to match_obj.start(), followed by the replacement. Finally, write out the rest.

     for index, row in df.iterrows(): map[row['A']] = row['B'] regex = re.compile('|'.join(re.escape(index) for index, _row in df.iterrows())) cur_pos = 0 for match_obj in regex.finditer(data): out_file.write(data[cur_pos:match_obj.start()]) out_file.write(map[match_obj.group(0)]) cur_pos = match_obj.end() out_file.write(data[cur_pos:])

    I suspect in most cases this will be slower than the regex.sub() approach, but it may be worth trying.

If the.txt file is just a single column of data, then the operation should be as simple as this;

df = pd.read_excel("rep-codes.xlsx", header=None, index_col=False, dtype=str)
df.columns = ['A', 'B']

df['B'].to_csv('final.txt')

If the.txt file is multiple columns and you just need to swap the values of column a with column b;

df = pd.read_excel("rep-codes.xlsx", header=None, index_col=False, dtype=str)
df.columns = ['A', 'B']

txt_df = pd.read_csv('final.txt')
txt_df['A']=df['B']
txt_df.to_csv('final.txt')

I'm also going to guess that there are some other factors not mentioned like different column sizes and such. Let me know what else needs to be changed if needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM