简体   繁体   中英

How to remove a repetitive pattern with the same beginning and ending in text files

I am working with multiple txt files that contain repetitive sentences with the format below:

"[TEXT1]File. Title:[TEXT2]____________ [TEXT3]File. Title:[TEXT4]____________[TEXT5]"

*TEXT: Contains words,\n,\t, number, spaces, punctuations

I want to remove all sentences that starts with 'File. Title:' and ends with '____________' from the text. This is the output I'm looking for:

"[TEXT1][TEXT3][TEXT5]"

The actual text looks like:

"xxxxx \n\t, \tFile. Title:\tVersion 2.0\t\n____________"

Unfortunately, the code I used removed everything between the first occurrence of "File. Title" and the last occurrence of "Version 2.0". I'm wondering if there's a solution that can better solve my problem?

Here's the code that I used.

text = re.sub('File. Title:\s.*\sVersion 2.0','',text, flags = re.DOTALL)
text = text.replace("____________", "")

Thank you!

s = "[TEXT1]File. Title:[TEXT2]____________[TEXT3]File. Title:[TEXT4]____________[TEXT5]"

def filter_texts(s):
    start = 'File. Title:'
    end = '____________'
    s2 = s.replace(start, f'splitmarker{start}').replace(end, f'{end}splitmarker')
    s2 = s2.split('splitmarker')
    s2 = filter(lambda ss: not (ss.startswith(start) and ss.endswith(end)), s2)
    s2 = ''.join(s2)
    return s2

print(filter_texts(s))

prints

[TEXT1][TEXT3][TEXT5]

This codes replaces each start marker (ie 'File. Title:' ) with split marker concantenated with start marker and each end marker (ie '____________' ) with end marker concatenated with split maker (where split maker is simply a string that (hopefully) does not occur otherwise, here set to 'splitmarker' ) such that when the string is then split by the split marker, one can filter the resulting list of texts directly by the desired condition, (ie text does not start with start marker or does not end with end marker ). Note that while this does the trick, there probably exist much more elegant solutions.

You can try using a regular expression to identify such lines, then split them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM