简体   繁体   中英

Eliminate overlap between two text blocks using python

I have two text files, which slightly overlap, ie :

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

As you can see the last sentence of text1 and the first sentence of text2 slightly overlap. Now, I would like to get rid of this overlap, essentially deleting the strings in text2 that are also in the last sentence of text1.

To do so, I can extract the last sentence of text1:

text1_last_sentence = list(filter(None,text1.split(".")))[-1]

And the first sentence of text2:

text2_first_sentence = text2.split(".")[0]

... but now the question is:

How do I find the part of the first sentence of text2 that should stay in text2 and put everything back toghether?

EDIT 1 :

The expected output:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

EDIT 2 :

Here is the complete code:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_last_sentence = list(filter(None,text1.split(".")))[-1]
text2_first_sentence = text2.split(".")[0]

print(text1_last_sentence, "\n")
print(text2_first_sentence, "\n")

The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in

theory or investigate a phenomenon in greater detail

Here is a way to do it, that will find the largest possible overlap:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

def remove_overlap(text1, text2):
    """Returns the part of text2 that doesn't overlap with text1"""

    words1 = text1.split()
    words2 = text2.split()

    # all apperances of the last word of text1 in text2
    last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
    # we look for the largest possible overlap
    for n in reversed(last_word_appearances):
        # are the first n+1 words of text2 the same as the (n+1) last from text1? 
        if words2[:n+1] == words1[-(n+1):]:
            return ' '.join(words2[n+1:])
    else:
        # no overlap found
        return text2


remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]

This is a bit hacky, but it works:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_ls = list(filter(None,text1.split(".")))[-1]
text2_fs = text2.split(".")[0]

temp2 = text2_fs.split(" ")

for i in range(1, len(temp2)):  
    if " ".join(temp2[:i]) not in text1_ls:
        text2_fs = " ".join(temp2[(i - 1):])
        break

print(text1_ls, "\n")
print(text2_fs, "\n")

Basically you're taking larger and larger substring from text2_fs until it is no longer also a substring of text1_ls , that tells you that the last word of the substring of text2_fs is the first word that is not in text1_ls .

Might not address all corner cases but will work for the mentioned text

first_word_text2 = text2.split()[0]
pos = len(text1) - text1.rfind(first_word_text2)
text2[pos:].strip()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM