简体   繁体   English

不使用正则表达式删除字符串中的标点和空格

[英]Removing punctuations and spaces in a string without using regex

I used import string and string.punctuation but I realized I still have '…' after conducting string.split() . 我使用了导入stringstring.punctuation但是我意识到在进行string.split()之后我仍然有'…' I also get '' , which I don't know why I would get it after doing strip( ). 我也得到'' ,我不知道为什么在执行strip( )之后会得到它。 As far as I understand, strip() removes the peripheral spaces, so if I have spaces between a string it would not matter: 据我了解, strip()删除了外围空格,因此,如果我在字符串之间有空格,那就没关系了:

>>> s = 'a dog    barks    meow!   @  … '
>>> s.strip()
'a dog    barks    meow!   @  …'


>>> import string
>>> k = []
>>> for item in s.split():
...  k.append(item.strip(string.punctuation))
... 
>>> k
['a', 'dog', 'barks', 'meow', '', '…']

I would like to get rid of '', '…' , the final output I'd like is ['a', 'dog', 'barks', 'meow'] . 我想摆脱'', '…' ,我想要的最终输出是['a', 'dog', 'barks', 'meow']

I would like to refrain from using regex, but if that's the only solution I will consider it .. for now I'm more interested in solving this without resorting to regex. 我想避免使用正则表达式,但是如果这是唯一的解决方案,那么我将考虑使用它。.目前,我对不使用正则表达式的解决方案更感兴趣。

You can remove punctuation by retaining only alphanumeric characters and spaces: 您可以通过仅保留字母数字字符和空格来删除标点符号:

s = 'a dog    barks    meow!   @  …'
print(''.join(c for c in s if c.isalnum() or c.isspace()).split())

This outputs: 输出:

['a', 'dog', 'barks', 'meow']

I used the following: 我使用了以下内容:

s = 'a dog    barks    Meow!   @  … '



import string
p = string.punctuation+'…'
k = []
for item in s.split():
    k.append(item.strip(p).lower())


k = [x for x in k if x]

building on the accepted answer to this question : 对此问题的公认答案为基础:

import itertools

k = []
for ok, grp in itertools.groupby(s, lambda c: c.isalnum()):
    if ok:
        k.append(''.join(list(grp)))

or the same as a one-liner (except for the import): 或与单线相同(进口除外):

k = [''.join(list(grp)) for ok, grp in itertools.groupby(s, lambda c: c.isalnum()) if ok]

itertools.groupby() scans the string s as a list of characters, grouping them ( grp ) by the value ( ok ) of the lambda expression. itertools.groupby()将字符串s扫描为字符列表,并根据lambda表达式的值( ok )将它们分组( grp )。 The if ok filters out the groups not matching the lambda. if ok ,则将不匹配lambda的组过滤掉。 The groups are iterators that have to be converted to a list of characters and then joined to get back the words. 这些组是迭代器,必须将其转换为字符列表,然后再进行组合以获取单词。

The meaning of isalnum() is essentially “is alphanumeric”. isalnum()的含义实质上是“是字母数字”。 Depending on your use case, you might prefer isalpha() . 根据您的用例,您可能更喜欢isalpha() In both cases, for this input: 在两种情况下,对于此输入:

s = 'a 狗    barks    meow!   @  …'

the output is 输出是

['a', '狗', 'barks', 'meow']

(For experts: this reminds us of the problem that not in all languages words are separated by non-word characters - eg ) (对于专家:这使我们想起了一个问题,即并非所有语言中的单词都由非单词字符分隔- 例如

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM