[英]Best way to .clean and .strip long string?
Desired Outcome Is = ["This", "is", "a", "random", "sentence"] Desired Outcome Is = ["This", "is", "a", "random", "sentence"]
text = "Th,is is a? random!! sentence..." # Eddied, added comma inside word
clean_text = text.split()
for clean in clean_text:
double_clean_text = clean.strip(",.!?")
print(double_clean_text)
Managed to clean, but how do I get it all back to list??设法清理,但我如何将其全部恢复到列表中?
Is this is efficient way to do it?这是一种有效的方法吗?
您可以执行以下操作:
print(" ".join([clean.strip(",.!?") for clean in clean_text]))
您可以使用列表理解:
print([t.strip(",.!?") for t in text.split()])
Try this:尝试这个:
clean_text = text.split()
print([clean.strip(",.!?") for clean in clean_text])
OR或者
clean_text = text.split()
res = []
for clean in clean_text:
double_clean_text = clean.strip(",.!?")
res.append(double_clean_text)
print(res)
I would recommend you to use regular expression "\\w+"
to find all words:我建议您使用正则表达式"\\w+"
来查找所有单词:
import re
result = re.findall("\w+", text)
Instead of assigning to a new variable, assign the cleaned result back to the list.不是分配给新变量,而是将清理后的结果分配回列表。
text = "This, is a? random!! sentence..."
clean_text = text.split()
for i, clean in enumerate(clean_text):
clean_text[i] = clean.strip(",.!?")
Then you can use ' '.join
to (mostly) restore the list to its original form:然后您可以使用' '.join
来(主要)将列表恢复到其原始形式:
cleaned_text = ' '.join(clean_text)
I say "mostly", because split
erases information about how many spaces were removed from the original string, which may be fine, but is worth being aware of.我说“大部分”,因为split
会删除有关从原始字符串中删除了多少空格的信息,这可能没问题,但值得注意。
The whole thing can be written using a single list comprehension.整个事情可以使用单个列表理解来编写。
clean_text = ' '.join([clean.strip(",.!?") for clean in text.split()])
Either use re
and simply put r'\\w+'
greedily captures all alpha characters.要么使用re
并简单地放置r'\\w+'
贪婪地捕获所有字母字符。
>>> import re
>>> text = "This, is a? random!! sentence..."
>>> re.findall(r'\w+', text)
['This', 'is', 'a', 'random', 'sentence']
Or you could use str.strip
and str.split
and an easy way to supply all punctuation to strip is using string.punctuation
.或者你可以使用str.strip
和str.split
并且提供所有标点符号的简单方法是使用string.punctuation
。 This will split the text by whitespace then remove all punctuation from each sub string.这将按空格拆分文本,然后从每个子字符串中删除所有标点符号。
>>> from string import punctuation
>>> text = "This, is a? random!! sentence..."
>>> [s.strip(punctuation) for s in text.split()]
['This', 'is', 'a', 'random', 'sentence']
Since you already good pretty good answers, I'd like to introduce regular expressions既然你已经很好很好的答案,我想介绍正则表达式
import re
text = "This, is a? random!! sentence..."
clean_list = re.split('[.,?! ]+', text)
Where the chars inside the square brackets are the chars you want to split by and strip方括号内的字符是您要拆分和剥离的字符
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.