How to remove a repetitive pattern with the same beginning and ending in text files

Question

I am working with multiple txt files that contain repetitive sentences with the format below:

"[TEXT1]File. Title:[TEXT2]____________ [TEXT3]File. Title:[TEXT4]____________[TEXT5]"

*TEXT: Contains words,\n,\t, number, spaces, punctuations

I want to remove all sentences that starts with 'File. Title:' and ends with '____________' from the text. This is the output I'm looking for:

"[TEXT1][TEXT3][TEXT5]"

The actual text looks like:

"xxxxx \n\t, \tFile. Title:\tVersion 2.0\t\n____________"

Unfortunately, the code I used removed everything between the first occurrence of "File. Title" and the last occurrence of "Version 2.0". I'm wondering if there's a solution that can better solve my problem?

Here's the code that I used.

text = re.sub('File. Title:\s.*\sVersion 2.0','',text, flags = re.DOTALL)
text = text.replace("____________", "")

Thank you!

Answer 1

s = "[TEXT1]File. Title:[TEXT2]____________[TEXT3]File. Title:[TEXT4]____________[TEXT5]"

def filter_texts(s):
    start = 'File. Title:'
    end = '____________'
    s2 = s.replace(start, f'splitmarker{start}').replace(end, f'{end}splitmarker')
    s2 = s2.split('splitmarker')
    s2 = filter(lambda ss: not (ss.startswith(start) and ss.endswith(end)), s2)
    s2 = ''.join(s2)
    return s2

print(filter_texts(s))

prints

[TEXT1][TEXT3][TEXT5]

This codes replaces each start marker (ie 'File. Title:' ) with split marker concantenated with start marker and each end marker (ie '____________' ) with end marker concatenated with split maker (where split maker is simply a string that (hopefully) does not occur otherwise, here set to 'splitmarker' ) such that when the string is then split by the split marker, one can filter the resulting list of texts directly by the desired condition, (ie text does not start with start marker or does not end with end marker ). Note that while this does the trick, there probably exist much more elegant solutions.

Answer 2

You can try using a regular expression to identify such lines, then split them.

How to remove a repetitive pattern with the same beginning and ending in text files

Question

1 answers

solution1
0 2022-08-23 00:16:36

solution2
0 2022-08-23 00:33:08

How to remove a repetitive pattern with the same beginning and ending in text files

Question

1 answers

solution1 0 2022-08-23 00:16:36

solution2 0 2022-08-23 00:33:08

solution1
0 2022-08-23 00:16:36

solution2
0 2022-08-23 00:33:08