简体   繁体   中英

Merge two large strings in python efficiently

I've been working at this for a few days now, and it seems no where has the answer I need.

In fear of this being marked as duplicate, I'll explain why the other questions don't work for me.

  • Any answer with DIFFLIB for Python will not help my needs. (I describe more below). It is entirely too slow- unless someone has a good optimization tip for me (unified_diff module) I won't be able to use it.

  • I've tried researching how to send large strings to commands that expect files, but none of the options worked for me. I wouldn't mind using this option if I could get it to work (also described more below).

  • I don't mind being marked as duplicate so long as it is a question that genuinely solves my problem- and I've scraped a few sites and haven't found a solution that works for me yet.

I want to merge two large strings in Python. The strings are about 1.5KB each. Assuming there are two strings, str1 and str2, I just want to return the merged string which is simply str1 with the added information of str2. I don't want anything to be removed.

For the most part, these strings will be relatively the same . Most times, it will be 90% the same. The difference is that there may be new information added to the second string, and I would like to capture that information into the original one.

ergo.

str1 = "This is a very
        Long string and
        This is how it looks."

str2 = "This is a very
        This is my Example
        This is how it looks."

result = "This is a very
          Long string and
          This is my Example 
          This is how it looks." #Third line was added to str1

The very first way that I solved this problem is using git diff. I'm on Windows, and what I would do is execute a git diff cmd with temporary files that I outputted the string into, then delete the files immediately after. The cmd function I made would return the output (a unified diff) as a string. I would then post process on the string to remove the header that diff's always add. I was able to remove the '+' and '-' on each line by changing the output indicators to spaces (I all the options I used from my code for simplicity.

#The f1and f2 text files are created here
#cmd is a function created by me, and it uses the os module to execute the command

output = cmd("git diff -U999999 -b --no-index f1.txt f2.txt")

#f1 and f2 text files are deleted here

I've tried DiffLib, but that was entirely too slow. It took about 8-10 minutes to do one diff file output. I used the unified_diff module and I passed the arguments as strings, and as lists. I even tried to manipulate the source code but my changes didn't make it much faster.

I've also tried passing the strings directly to git diff or just diff. There would be errors, however, complaining "Argument List too Long" . I even tried sending the string to stdout and using that as a file argument and that didn't work much at all either.

I don't mind using any of these options if it can tweaked to work for my goal. Clearly, my current solution (the block of code above) is very inefficient and I don't want to keep creating and deleting text files if it can be avoided.

If you wanted to roll your own solution - you could add each line to some list, one at a time, alternating between the first string and the second:

list_1 = "A\nB\nC\nD".split()
list_2 = "A\nE\nF\nD".split()
output = []

for i in range(len(list_1)):
    output.append(list_1[i])
    output.append(list_2[i])

for o in output:
    print(o)

>> A
>> A
>> B
>> E
>> C
>> F
>> D
>> D

Then you need to remove duplicates from the output list (without using sets, as sets will scramble the order up).

from collections import OrderedDict

output = list(dict.fromkeys(output))

for o in output:
    print(o)

>> A
>> B
>> E
>> C
>> F
>> D

A few caveats I can think of:

  1. If len(list_1) != len(list_2) , you will need to account for that.

  2. It's not clear to me what "merge" means in this context. For instance, if:

     list_1 == ["A", "B", "A", "D", "A", "C", "A", "D", "A", "B", "B", "B"] list_2 == ["B", "A", "C", "C", "A", "A", "D", "D", "A", "B", "C", "A", "D"] 

It's not clear to me what the resulting merge should look like.

I figured out the solution, with the help of @schwobaseggl's suggestion.

Diff_patch_match is light years faster than difflib.

For anyone who may be stuck with a similar issue, here are some roadblocks I faced and how I fixed them:

  • When creating a diff_patch_match object, be sure to pass strings as arguments.
  • If your input is solely a single line, then you don't have to worry about my next point. However, if your string is multiple lines (separated by '\\n'), then you must follow a different protocol. The website is linked here . However I'll put it below in case the website ever goes down. The website linked uses Java, but I changed it to Python here.
def diff_lineMode(text1, text2):
  dmp = new diff_match_patch()
  a = dmp.diff_linesToChars_(text1, text2)
  lineText1 = a[0]
  lineText2 = a[1]
  lineArray = a[2]
  diffs = dmp.diff_main(lineText1, lineText2, false)
  dmp.diff_charsToLines_(diffs, lineArray)
  return diffs

Thank you everyone for your comments and answers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM