简体   繁体   English

用正则表达式去除标点符号 - python

[英]strip punctuation with regex - python

I need to use regex to strip punctuation at the start and end of a word.我需要使用正则表达式去除单词开头结尾的标点符号。 It seems like regex would be the best option for this.似乎正则表达式将是最好的选择。 I don't want punctuation removed from words like 'you're', which is why I'm not using .replace().我不想从像“you're”这样的词中删除标点符号,这就是我不使用 .replace() 的原因。

You don't need regular expression to do this task.您不需要正则表达式来执行此任务。 Use str.strip with string.punctuation :str.stripstring.punctuation str.strip使用:

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'

>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"

I think this function will be helpful and concise in removing punctuation:我认为此功能在删除标点符号方面会有所帮助且简洁:

import re
def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and space
        w = re.sub(r'_','',w) #how to remove underscore as well
        new_words.append(w)
    return new_words

If you persist in using Regex, I recommend this solution:如果你坚持使用正则表达式,我推荐这个解决方案:

import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me

Note also that you can pass your own punctuation marks:另请注意,您可以传递自己的标点符号:

my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
           '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 
           '`', '{', '|', '}', '~', '»', '«', '“', '”']

punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19

You can remove punctuation from a text file or a particular string file using regular expression as follows -您可以使用正则表达式从文本文件或特定字符串文件中删除标点符号,如下所示 -

new_data=[]
with open('/home/rahul/align.txt','r') as f:
    f1 = f.read()
    f2 = f1.split()



    all_words = f2 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' 
    # You can add and remove punctuations as per your choice 
    #removing stop words in hungarian text and  english text and 
    #display the unpunctuated string
    # To remove from a string, replace new_data with new_str 
    # new_str = "My name$#@ is . rahul -~"

    for word in all_words: 
        if word not in punctuations:
           new_data.append(word)

    print (new_data)

PS - Do the identation properly as per required. PS - 按要求正确进行识别。 Hope this helps!!希望这可以帮助!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM