简体   繁体   English

合并两个几乎相同的字符串

[英]Merging two almost identical strings

I have two objects, one which is a list of tuples with (int, str) , like this:我有两个对象,一个是带有(int, str)的元组列表,如下所示:

first_input = [
    (0  ,  "Lorem ipsum dolor sit amet, consectetur"),
    (1  ,  " adipiscing elit"),
    (0  ,  ". In pellentesque\npharetra ex, at varius sem suscipit ac. "),
    (-1 ,  "Suspendisse luctus\ncondimentum velit a laoreet. "),
    (0  ,  "Donec dolor urna, tempus sed nulla vitae, dignissim varius neque.")
]
# Note that the strings contain newlines `\n` on purpose.

The other object is a string, which is the result of a series of operations(*) which, by design, will result in a concatenation of all the strings above but with some additional newlines \n inserted.另一个 object 是一个字符串,它是一系列操作 (*) 的结果,根据设计,这些操作将导致上述所有字符串的串联,但插入了一些额外的换行符\n

(*: that can't be done while conserving the list of tuples structure, obviously) (*:显然,在保存list of tuples无法做到这一点)

For instance:例如:

second_input = "Lorem ipsum dolor sit amet,\nconsectetur adipiscing elit. In pellentesque\npharetra ex, at varius sem\nsuscipit ac. Suspendisse luctus\ncondimentum velit a laoreet. Donec dolor urna, tempus sed\nnulla vitae, dignissim varius neque."
# Note that there are 3 new newlines,  here ^ for instance
# but also in "sem\nsuscipit" and "sed\nnulla"

My goal is to go back to the first structure, but keeping the additional newlines.我的目标是 go 回到第一个结构,但保留额外的换行符。 So in my example, I would get:所以在我的例子中,我会得到:

expected_output = [
    (0  ,  "Lorem ipsum dolor sit amet,\nconsectetur"),  # new newline here
    (1  ,  " adipiscing elit"),
    (0  ,  ". In pellentesque\npharetra ex, at varius sem\nsuscipit ac. "), # new newline here
    (-1 ,  "Suspendisse luctus\ncondimentum velit a laoreet. "),
    (0  ,  "Donec dolor urna, tempus sed\nnulla vitae, dignissim varius neque.") # new newline here
]

Do you have a smart way to do it, other than reconstructing the string with a character by character comparison?除了用逐个字符的比较来重建字符串之外,你有没有聪明的方法呢?

(NB: I don't care in which of the two tuples it ends if a new \n is at the limit of a string. Eg getting [(0, "foo\n"), (1, "bar")] or [(0, "foo"), (1, "\nbar")] doesn't matter.) (注意:如果新的\n处于字符串的限制,我不在乎它在两个元组中的哪一个结束。例如,得到[(0, "foo\n"), (1, "bar")][(0, "foo"), (1, "\nbar")]没关系。)


Edit: what I want to avoid, is to do something like this:编辑:我想避免的是做这样的事情:

position=0
output = []
for tup in first_input:
    reconstructed_string = ""
    for letter in tup[1]:
        if letter == second_input[position]:
            reconstructed_string = reconstructed_string + letter
        else:
            reconstructed_string = reconstructed_string + second_input[position]
        position +=1
    output.append((tup[0], reconstructed_string))
# Note: this is hastily written to give you an idea, I have no idea if it would work properly, probably not
# Well, it does seem to work without bug, at least in my example. That's unexpected lol. Anyway, if you can think of a better solution...!

That is, going through each character of the strings and compare them to reconstruct the strings character by character.也就是说,遍历字符串的每个字符并比较它们以逐个字符地重构字符串。

I think the easiest way would be to translate whatever operations you are performing on the combined string back to the pieces, but I guess you already thought of this.我认为最简单的方法是将您在组合字符串上执行的任何操作转换回片段,但我想您已经想到了这一点。 Instead, one could not insert any newline characters but generate a list of positions at which they would be entered.相反,不能插入任何换行符,而是生成一个输入它们的位置列表。 Keeping track of the length of the string bits, this could look like this, assuming the positions at which a ' ' is to be replaced by '\n' are stored in the variable posis :跟踪字符串位的长度,这可能看起来像这样,假设将 ' ' 替换为 '\n' 的位置存储在变量posis

 import numpy as np
 posis = [27,98187,227] # position of the newlines in your sample, length of full string as last entry
 lengths = [len(string) for _, string in first_input]
 covered_distance = 0 # lengths of all strings we looked at already                                           
 j = 0  # iterating index for positions                                                                       
 output = []                                                                     
 rel_pos = posis[0]-covered_distance # initialize relative position in the current string                    
 inserted_newlines = 0 # keep track of newlines we added already                  
 for i, [n, string] in enumerate(first_input):                                                                           
     while rel_pos < lengths[i]:                                                 
         string = string[:rel_pos+inserted_newlines]+'\n'\                       
                 +string[rel_pos+inserted_newlines+1:]  # replace the character at the relative position                         
         j += 1 # advance to the next newline to be inserted                              
         rel_pos = posis[j]-covered_distance # update the relative position                     
         inserted_newlines += 1  # keep track of inserted newlines      
     output.append((n, string))  # store resulting string               
     covered_distance += lengths[i]  # update the number of characters we passed                        
     rel_pos = posis[j]-covered_distance                                         

This is not very beautiful but it works for the sample, in order to do proper testing I would need some more information on possible cases and maybe the operations determining the newline positions.这不是很漂亮,但它适用于示例,为了进行适当的测试,我需要更多关于可能情况的信息,也许还有确定换行位置的操作。

The way I would do it - written in terrible code.我会这样做的方式 - 用糟糕的代码编写。 I wrote this pretty hastily我写的很仓促

import re first_input = [ (0, "Lorem ipsum dolor sit amet, consectetur"), (1, " adipiscing elit"), (0, ". In pellentesque\npharetra ex, at varius sem suscipit ac. "), (-1, "Suspendisse luctus\ncondimentum velit a laoreet. "), (0, "Donec dolor urna, tempus sed nulla vitae, dignissim varius neque.") ] second_input = "Lorem ipsum dolor sit amet,\n consectetur adipiscing elit. In pellentesque\npharetra ex, at varius sem\n suscipit ac. Suspendisse luctus\ncondimentum velit a laoreet. Donec dolor urna, tempus sed\n nulla vitae, dignissim varius neque." first_sanitized = [x[1].replace('\n', '') for x in first_input] second_sanitized = second_input.replace('\n', '') newline_positions = [m.start() for m in re.finditer('\n',second_input, re.M)] new_output = [] i = 0 print(second_sanitized) newlines_so_far = 0 for first_str in first_sanitized: print(first_str) index = second_sanitized.index(first_str) number_of_newlines_in_between = sum([1 for x in newline_positions if (x > index and x < index + len(first_input[i][1]))]) new_string = second_input[index: (index + len(first_input[i][1]) + number_of_newlines_in_between + newlines_so_far)] newlines_so_far += number_of_newlines_in_between new_element = (first_input[i][0], new_string) new_output.append(new_element) i = i + 1

Ok, considering that NO CHARACTERS are replaced or modified (as the OP stated), here's what I could come up to:好的,考虑到 NO CHARACTERS 被替换或修改(如 OP 所述),这就是我能想到的:

first_input_no_newline = list(map(lambda x: (x[0], x[1].replace('\n', '')), first_input))

expected_output = []
for item in first_input_no_newline:
    next_index = len(item[1])

    second_input_copy = second_input
    offset = 0
    while True:
        amount = second_input_copy[:next_index].count("\n")
        if not amount:
            next_index += offset
            break
        offset += amount
        second_input_copy = second_input_copy.replace('\n', '', amount)

    expected_output.append((item[0], second_input[:next_index]))
    second_input = second_input[next_index:]

print(expected_output)

Explaining: you don't have to keep track of the newlines or anything like that.解释:您不必跟踪换行符或类似的东西。 Also the newlines in "first_input" doesn't really matter, because we have all of them in the second input (plus more of them).此外,“first_input”中的换行符并不重要,因为我们在第二个输入中拥有所有换行符(加上更多)。

So, just take the length of each item of first_input_no_newline , this should also be the length of the substring in second_input if there were no newlines in it but, if there are newlines, ok, just keep counting and removing them from a copy of the second_input and add this result as an offset to cut the original second_input.因此,只需获取first_input_no_newline的每个项目的长度,如果其中没有换行符,这也应该是 second_input 中second_input的长度,但是,如果有换行符,好的,只需继续计数并从副本中删除它们second_input 并将此结果添加为偏移量以剪切原始的 second_input。

Input sample (fixed OP's original input, adding the missing white characters in between some phrases):输入示例(修复了 OP 的原始输入,在一些短语之间添加了缺失的白色字符):

first_input = [
        (0, "Lorem ipsum dolor sit amet, consectetur"),
        (1, " adipiscing elit"),
        (0, ". In pellentesque\npharetra ex, at varius sem suscipit ac. "),
        (-1, "Suspendisse luctus\ncondimentum velit a laoreet. "),
        (0, "Donec dolor urna, tempus sed nulla vitae, dignissim varius neque.")
    ]

second_input = "Lorem ipsum dolor sit amet, \nconsectetur adipiscing elit. In pellentesque\npharetra ex, at varius sem \nsuscipit ac. Suspendisse luctus\ncondimentum velit a laoreet. Donec dolor urna, tempus sed \nnulla vitae, dignissim varius neque."

Output: Output:

[
    (0, 'Lorem ipsum dolor sit amet, \nconsectetur'), 
    (1, ' adipiscing elit'), 
    (0, '. In pellentesque\npharetra ex, at varius sem \nsuscipit ac. '), 
    (-1, 'Suspendisse luctus\ncondimentum velit a laoreet. '), 
    (0, 'Donec dolor urna, tempus sed \nnulla vitae, dignissim varius neque.')
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM