简体   繁体   English

从列表的元素中删除字符串-Python

[英]Remove strings from elements of a list - Python

I have a list c whis has 353000 elements. 我有一个列表c WHIS有353000元。 Each element is a parse string. 每个元素都是一个解析字符串。 A sample of this list is: 此列表的示例是:

print c[25:50]
['aluminum co of america', 'aluminum co of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'ace cash express, inc.', 'ace cash express, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.']

I counted the frequency of words in the list: 我计算了列表中单词的出现频率:

from collections import Counter
r=[]
for e in c:
    r.extend(e.split())

count=Counter(r)

So, the six most frequent words of the list are : 因此,列表中最常见的六个词是:

{'inc.': 18670, 'corporation': 9255, 'company': 2632, 'group,': 1190, '&': 1158, 'financial': 1025}

I would like to remove these elements of my list. 我想删除列表中的这些元素。 For example if I have "aluminum corporation of america" , the output should be "aluminum of america" . 例如,如果我有"aluminum corporation of america" ,则输出应为"aluminum of america" Is there any help? 有什么帮助吗?

# Using Generator Expression with `Counter` to speed it up a little bit
from collections import Counter
count = Counter(item for e in c for item in e.split())

# Get most frequently used words
words = {item for item, cnt in count.most_common(6)}

# filter the `words` in `c` and reconstruct the sentences in `c`
[" ".join([item for item in e.split() if item not in words]) for e in c]

You could use regular expressions to substitute an empty string for the words you want to delete: 您可以使用正则表达式将空字符串替换为要删除的单词:

import re
p = re.compile(' |'.join(word for word in count))
cleaned = [p.sub('', item) for item in c]

edit: Although, you'd have to escape the . 编辑:虽然,您必须转义. s and & in your regex, so it will become a bit more complex than above... s和&在您的正则表达式中,因此它将变得比上面更复杂...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM