简体   繁体   English

从字符串中去除标点符号的最佳方法

[英]Best way to strip punctuation from a string

It seems like there should be a simpler way than:似乎应该有比以下更简单的方法:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

Is there?在那儿?

From an efficiency perspective, you're not going to beat从效率的角度来看,你不会被打败

s.translate(None, string.punctuation)

For higher versions of Python use the following code:对于更高版本的 Python,请使用以下代码:

s.translate(str.maketrans('', '', string.punctuation))

It's performing raw string operations in C with a lookup table - there's not much that will beat that but writing your own C code.它使用查找表在 C 中执行原始字符串操作 - 除了编写您自己的 C 代码之外,没有什么可以打败它。

If speed isn't a worry, another option though is:如果不担心速度,另一种选择是:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

This is faster than s.replace with each char, but won't perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings.这比使用每个字符的 s.replace 快,但性能不如非纯 python 方法,例如正则表达式或 string.translate,正如您从下面的时序中看到的那样。 For this type of problem, doing it at as low a level as possible pays off.对于这种类型的问题,在尽可能低的水平上做是有回报的。

Timing code:计时码:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

This gives the following results:这给出了以下结果:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

Regular expressions are simple enough, if you know them.正则表达式很简单,如果你知道的话。

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.为了使用方便,我总结了Python 2和Python 3中对字符串进行条带化的注意事项,详细说明请参考其他答案。


Python 2蟒蛇2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3蟒蛇 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation
myString.translate(None, string.punctuation)

I usually use something like this:我通常使用这样的东西:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

string.punctuation is ASCII only ! string.punctuation只有ASCII! A more correct (but also much slower) way is to use the unicodedata module:更正确(但也更慢)的方法是使用 unicodedata 模块:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

You can generalize and strip other types of characters as well:您也可以概括和剥离其他类型的字符:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

It will also strip characters like ~*+§$ which may or may not be "punctuation" depending on one's point of view.它还会删除像~*+§$这样的字符,根据个人的观点,这些字符可能是也可能不是“标点符号”。

Not necessarily simpler, but a different way, if you are more familiar with the re family.不一定更简单,而是一种不同的方式,如果你对 re 家族更熟悉的话。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

For Python 3 str or Python 2 unicode values, str.translate() only takes a dictionary;对于 Python 3 str或 Python 2 unicode值, str.translate()只需要一个字典; codepoints (integers) are looked up in that mapping and anything mapped to None is removed.在该映射中查找代码点(整数),并删除映射到None任何内容。

To remove (some?) punctuation then, use:要删除(一些?)标点符号,请使用:

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

The dict.fromkeys() class method makes it trivial to create the mapping, setting all values to None based on the sequence of keys.dict.fromkeys()类方法使得创建映射变得微不足道,根据键的顺序将所有值设置为None

To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger;要删除所有标点符号,而不仅仅是 ASCII 标点符号,您的表格需要更大一点; see JF Sebastian's answer (Python 3 version):参见JF Sebastian 的回答(Python 3 版本):

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

string.punctuation misses loads of punctuation marks that are commonly used in the real world. string.punctuation错过了现实世界中常用的大量标点符号。 How about a solution that works for non-ASCII punctuation?适用于非 ASCII 标点符号的解决方案怎么样?

import regex
s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

Personally, I believe this is the best way to remove punctuation from a string in Python because:就个人而言,我认为这是从 Python 中的字符串中删除标点符号的最佳方法,因为:

  • It removes all Unicode punctuation它删除了所有 Unicode 标点符号
  • It's easily modifiable, eg you can remove the \\{S} if you want to remove punctuation, but keep symbols like $ .它很容易修改,例如,如果您想删除标点符号,您可以删除\\{S} ,但保留像$这样的符号。
  • You can get really specific about what you want to keep and what you want to remove, for example \\{Pd} will only remove dashes.您可以非常具体地了解要保留的内容和要删除的内容,例如\\{Pd}只会删除破折号。
  • This regex also normalizes whitespace.这个正则表达式也规范了空格。 It maps tabs, carriage returns, and other oddities to nice, single spaces.它将制表符、回车符和其他奇怪的东西映射到漂亮的单个空格。

This uses Unicode character properties, which you can read more about on Wikipedia .这使用 Unicode 字符属性,您可以在 Wikipedia 上阅读更多相关信息

I haven't seen this answer yet.我还没有看到这个答案。 Just use a regex;只需使用正则表达式; it removes all characters besides word characters ( \\w ) and number characters ( \\d ), followed by a whitespace character ( \\s ):它删除除单词字符 ( \\w ) 和数字字符 ( \\d ) 之外的所有字符,后跟空白字符 ( \\s ):

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

Here's a one-liner for Python 3.5:这是 Python 3.5 的单行代码:

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

This might not be the best solution however this is how I did it.这可能不是最好的解决方案,但这就是我所做的。

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

Here is a function I wrote.这是我写的一个函数。 It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:它不是很有效,但它很简单,您可以添加或删除您想要的任何标点符号:

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList
import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function.作为更新,我在 Python 3 中重写了 @Brian 示例并对其进行了更改,以将 regex 编译步骤移动到函数内部。 My thought here was to time every single step needed to make the function work.我的想法是为使函数工作所需的每一步计时。 Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile step at each worker.也许您正在使用分布式计算并且不能在您的工作人员之间共享正则表达式对象,并且需要在每个工作人员处进行re.compile步骤。 Also, I was curious to time two different implementations of maketrans for Python 3另外,我很好奇为 Python 3 计时两种不同的 maketrans 实现

table = str.maketrans({key: None for key in string.punctuation})

vs对比

table = str.maketrans('', '', string.punctuation)

Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.另外,我添加了另一种使用 set 的方法,在那里我利用交集函数来减少迭代次数。

This is the complete code:这是完整的代码:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

This is my results:这是我的结果:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

在不太严格的情况下,单行可能会有所帮助:

''.join([c for c in s if c.isalnum() or c.isspace()])
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

Here's a solution without regex.这是一个没有正则表达式的解决方案。

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then
  • Replaces the punctuations with spaces用空格替换标点符号
  • Replace multiple spaces in between words with a single space用一个空格替换单词之间的多个空格
  • Remove the trailing spaces, if any with strip()使用 strip() 删除尾随空格(如果有)

Why none of you use this?为什么你们没人用这个?

 ''.join(filter(str.isalnum, s)) 

Too slow?太慢了?

# FIRST METHOD
# Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring ='' # Creating empty string
word = raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring += i
print ("The string without punctuation is", newstring)

# SECOND METHOD
word = raw_input("Enter string: ")
punctuation = '!?,.:;"\')(_-'
newstring = word.translate(None, punctuation)
print ("The string without punctuation is",newstring)


# Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage
with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 

for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:

    print(i,end='\n')
    print ("____________")

Here's one other easy way to do it using RegEx这是使用 RegEx 执行此操作的另一种简单方法

import re

punct = re.compile(r'(\w+)')

sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence) 
'This is a sample sentence'

试试那个:)

regex.sub(r'\p{P}','', s)

I was looking for a really simple solution.我一直在寻找一个非常简单的解决方案。 here's what I got:这是我得到的:

import re 

s = "string. With. Punctuation?" 
s = re.sub(r'[\W\s]', ' ', s)

print(s)
'string  With  Punctuation '

Apparently I can't supply edits to the selected answer, so here's an update which works for Python 3. The translate approach is still the most efficient option when doing non-trivial transformations.显然我无法对选定的答案进行编辑,所以这里有一个适用于 Python 3 的更新。在进行非平凡转换时, translate方法仍然是最有效的选择。

Credit for the original heavy lifting to @Brian above.归功于上面@Brian 最初的繁重工作。 And thanks to @ddejohn for his excellent suggestion for improvement to the original test.并感谢@ddejohn 对改进原始测试的极好建议。

#!/usr/bin/env python3

"""Determination of most efficient way to remove punctuation in Python 3.

Results in Python 3.8.10 on my system using the default arguments:

set       : 51.897
regex     : 17.901
translate :  2.059
replace   : 13.209
"""

import argparse
import re
import string
import timeit

parser = argparse.ArgumentParser()
parser.add_argument("--filename", "-f", default=argparse.__file__)
parser.add_argument("--iterations", "-i", type=int, default=10000)
opts = parser.parse_args()
with open(opts.filename) as fp:
    s = fp.read()
exclude = set(string.punctuation)
table = str.maketrans("", "", string.punctuation)
regex = re.compile(f"[{re.escape(string.punctuation)}]")

def test_set(s):
    return "".join(ch for ch in s if ch not in exclude)

def test_regex(s):  # From Vinko's solution, with fix.
    return regex.sub("", s)

def test_translate(s):
    return s.translate(table)

def test_replace(s):  # From S.Lott's solution
    for c in string.punctuation:
        s = s.replace(c, "")
    return s

opts = dict(globals=globals(), number=opts.iterations)
solutions = "set", "regex", "translate", "replace"
for solution in solutions:
    elapsed = timeit.timeit(f"test_{solution}(s)", **opts)
    print(f"{solution:<10}: {elapsed:6.3f}")

Considering unicode.考虑到Unicode。 Code checked in python3.代码在python3中检查。

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

You can also do this:你也可以这样做:

import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

The question does not have a lot of specifics, so the approach I took is to come up with a solution with the simplest interpretation of the problem: just remove the punctuation.这个问题没有很多细节,所以我采取的方法是提出一个对问题最简单解释的解决方案:去掉标点符号。

Note that solutions presented don't account for contracted words (eg, you're ) or hyphenated words (eg, anal-retentive )...which is debated as to whether they should or shouldn't be treated as punctuations...nor to account for non-English character set or anything like that...because those specifics were not mentioned in the question.请注意,提出的解决方案不考虑收缩词(例如, you're )或带连字符的词(例如, anal-retentive )......关于它们是否应该或不应该被视为标点符号的争论......也不考虑非英语字符集或类似的东西……因为问题中没有提到这些细节。 Someone argued that space is punctuation, which is technically correct ...but to me it makes zero sense in the context of the question at hand.有人认为空格是标点符号,这在技术上正确的......但对我来说,在手头问题的上下文中它是零意义的。

# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))

# using list comprehension
''.join('' if c in string.punctuation else c for c in s)

When you deal with the Unicode strings, I suggest using PyPi regex module because it supports both Unicode property classes (like \\p{X} / \\P{X} ) and POSIX character classes (like [:name:] ).当您处理 Unicode 字符串时,我建议使用PyPi regex模块,因为它支持 Unicode 属性类(如\\p{X} / \\P{X} )和 POSIX 字符类(如[:name:] )。

Just install the package by typing pip install regex (or pip3 install regex ) in your terminal and hit ENTER.只需在终端中输入pip install regex (或pip3 install regex )安装包,然后按回车键。

In case you need to remove punctuation and symbols of any kind (that is, anything other than letters, digits and whitespace) you can use如果您需要删除任何类型的标点符号和符号(即字母、数字和空格以外的任何内容),您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

See a Python demo online : 在线查看Python 演示

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

Here, I added a whitespace \\s pattern to the character class在这里,我在字符类中添加了一个空格\\s模式

For serious natural language processing (NLP), you should let a library like SpaCy handle punctuation through tokenization , which you can then manually tweak to your needs.对于严肃的自然语言处理 (NLP),您应该让像SpaCy这样的库通过标记化处理标点符号,然后您可以根据需要手动调整。

For example, how do you want to handle hyphens in words?例如,您想如何处理单词中的连字符? Exceptional cases like abbreviations?缩写等特殊情况? Begin and end quotes?开始和结束引号? URLs?网址? IN NLP it's often useful to separate out a contraction like "let's" into "let" and "'s" for further processing.在 NLP 中,将“let's”之类的缩略词分离为“let”和“'s”以进行进一步处理通常很有用。

SpaCy 示例标记化

Remove stop words from the text file using Python使用 Python 从文本文件中删除停用词

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

This is how to change our documents to uppercase or lower case. 这是将文档更改为大写或小写的方法。

print('@@@@This is lower case@@@@')

with open('students.txt','r')as myFile:

    str1=myFile.read()
    str1.lower()
print(str1.lower())

print('*****This is upper case****')

with open('students.txt','r')as myFile:

    str1=myFile.read()

    str1.upper()

print(str1.upper())

I like to use a function like this:我喜欢使用这样的函数:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM