简体   繁体   English

.clean 和 .strip 长字符串的最佳方法?

[英]Best way to .clean and .strip long string?

Desired Outcome Is = ["This", "is", "a", "random", "sentence"] Desired Outcome Is = ["This", "is", "a", "random", "sentence"]

text = "Th,is is a? random!! sentence..."  # Eddied, added comma inside word 

clean_text = text.split()

for clean in clean_text:

    double_clean_text = clean.strip(",.!?")

    print(double_clean_text)

Managed to clean, but how do I get it all back to list??设法清理,但我如何将其全部恢复到列表中?

Is this is efficient way to do it?这是一种有效的方法吗?

您可以执行以下操作:

print(" ".join([clean.strip(",.!?") for clean in clean_text]))

您可以使用列表理解:

print([t.strip(",.!?") for t in text.split()])

Try this:尝试这个:

clean_text = text.split()
print([clean.strip(",.!?") for clean in clean_text])

OR或者

clean_text = text.split()
res = []
for clean in clean_text:
    double_clean_text = clean.strip(",.!?")
    res.append(double_clean_text)
print(res)

I would recommend you to use regular expression "\\w+" to find all words:我建议您使用正则表达式"\\w+"来查找所有单词:

import re

result = re.findall("\w+", text)

Instead of assigning to a new variable, assign the cleaned result back to the list.不是分配给新变量,而是将清理后的结果分配回列表。

text = "This, is a? random!! sentence..."

clean_text = text.split()

for i, clean in enumerate(clean_text):

    clean_text[i] = clean.strip(",.!?")

Then you can use ' '.join to (mostly) restore the list to its original form:然后您可以使用' '.join来(主要)将列表恢复到其原始形式:

cleaned_text = ' '.join(clean_text)

I say "mostly", because split erases information about how many spaces were removed from the original string, which may be fine, but is worth being aware of.我说“大部分”,因为split会删除有关从原始字符串中删除了多少空格的信息,这可能没问题,但值得注意。

The whole thing can be written using a single list comprehension.整个事情可以使用单个列表理解来编写。

clean_text = ' '.join([clean.strip(",.!?") for clean in text.split()])

Either use re and simply put r'\\w+' greedily captures all alpha characters.要么使用re并简单地放置r'\\w+'贪婪地捕获所有字母字符。

>>> import re
>>> text = "This, is a? random!! sentence..."
>>> re.findall(r'\w+', text)
['This', 'is', 'a', 'random', 'sentence']   

Or you could use str.strip and str.split and an easy way to supply all punctuation to strip is using string.punctuation .或者你可以使用str.stripstr.split并且提供所有标点符号的简单方法是使用string.punctuation This will split the text by whitespace then remove all punctuation from each sub string.这将按空格拆分文本,然后从每个子字符串中删除所有标点符号。

>>> from string import punctuation
>>> text = "This, is a? random!! sentence..."
>>> [s.strip(punctuation) for s in text.split()]
['This', 'is', 'a', 'random', 'sentence']   

Since you already good pretty good answers, I'd like to introduce regular expressions既然你已经很好很好的答案,我想介绍正则表达式

import re
text = "This, is a? random!! sentence..."
clean_list = re.split('[.,?! ]+', text)

Where the chars inside the square brackets are the chars you want to split by and strip方括号内的字符是您要拆分和剥离的字符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM