Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. Eg
ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//
The \W+ removes all numbers including 1 in malwmrll1
import re
text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d+\\b\s*$+\sORIGIN$\W+]', '', text_file)
print(new_txt, len(new_txt))
My output is:
malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109
The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
Right, depending on your desired result showing underscores at all or not, try to use re.findall
and raw-string notation. You currently use a character class that makes no sense:
\b(?!(?:ORIGIN|[_\d]+)\b)\w+
See an online demo
\b
- Word-boundary; (?!(?:ORIGIN|[_\d]+)\b)
- Negative lookahead with nested non-capture group to match either ORIGIN
or 1+ underscore/digit combinations before a trailing word-boundary; \w+
- 1+ word-characters. import re
text_file = """ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//"""
new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))
print(new_txt, len(new_txt))
Prints:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.
This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.
FILENAME = 'mytext.txt'
def keep(t):
if t.isdigit() or t == 'ORIGIN' or t == '//':
return False
return True
with open(FILENAME) as f:
new_txt = ''.join(filter(keep, f.read().split()))
print(new_txt, len(new_txt))
Output:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
Another idea:
new_txt = re.sub('[\\W_]+|\\b(?:\\d+|ORIGIN)\\b', '', text_file)
Strip out all non word characters + underscore OR digits / "ORIGIN" between word boundaries .
See this demo at tio.run (the regex is very basic, explanation at regex101 )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.