简体   繁体   中英

how to exclude some words from text with regular expression?

I have a question. I want to have a regular expression that find all Latina words except some special words. I mean that I want to delete all Latina words from my text except "@USER" and "@USERS" and "http". For example in this sentence:

"Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."

will become like this:

"سلام. من حسین هستم. @USER @USERS http.

I tried this code but it doesn't work.

def remove_eng(sents):
    new_list = []
    for string in sents:
      string = str(string)
      new_list.append(' '.join(re.sub(r'^[a-zA-Z].*[a-zA-Z].?$', r'', w) 
                        for w in string.split()))
    return new_list

And the answer is like this:

[' سلام. من حسین هستم. @USER @USERS    a  ']

And I don't know how to exclude '@USER' and '@USERS' and 'http' Could anyone help me? Thanks.

Use normal for -loop instead of list comprehension and then you can use if/else to exclude words before you use regex

import re

def remove_eng(sents):
    new_list = []

    for old_string in sents:
        old_words = old_string.split()
        new_words = []
        
        for word in old_words:
            if word in ('@USER', '@USERS', 'http'):
                new_words.append(word)    
            else:
                result = re.sub(r'^[a-zA-Z.?!]+', r'', word)
                #print(result)
                if result:  # skip empty 
                    new_words.append(result)
                    
        new_string = " ".join(new_words)
        new_list.append(new_string)

    return new_list

# --- main ---

data = ["Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."]

print(remove_eng(data))

Result:

['سلام. من حسین هستم. @USER @USERS http']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM