简体   繁体   中英

Remove all numbers except for the ones combined to string using python regex

Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. Eg

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

The \W+ removes all numbers including 1 in malwmrll1

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d+\\b\s*$+\sORIGIN$\W+]', '', text_file)

print(new_txt, len(new_txt))

My output is:

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:


\b(?!(?:ORIGIN|[_\d]+)\b)\w+

See an online demo


  • \b - Word-boundary;
  • (?!(?:ORIGIN|[_\d]+)\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1+ underscore/digit combinations before a trailing word-boundary;
  • \w+ - 1+ word-characters.

import re
  
text_file = """ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))    
print(new_txt, len(new_txt))

Prints:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Using RE for this is an interesting academic exercise but extending the functionality is fraught with danger unless one is very familiar with the technique.

This answer may look long-winded but you should be able to see how easy it would be to extend it so that other tokens/patterns can be excluded or included. It's also readily maintainable because anyone else having to modify the code isn't going to get a migraine while trying to figure out how the RE works.

FILENAME = 'mytext.txt'

def keep(t):
    if t.isdigit() or t == 'ORIGIN' or t == '//':
        return False
    return True

with open(FILENAME) as f:
    new_txt = ''.join(filter(keep, f.read().split()))
    print(new_txt, len(new_txt))

Output:

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Another idea:

new_txt = re.sub('[\\W_]+|\\b(?:\\d+|ORIGIN)\\b', '', text_file)

Strip out all non word characters + underscore OR digits / "ORIGIN" between word boundaries .

See this demo at tio.run (the regex is very basic, explanation at regex101 )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM