If I have this list of strings:
['fsuy3,fsddj4,fsdg3,hfdh6,gfdgd6,gfdf5',
'fsuy3,fsuy3,fdfs4,sdgsdj4,fhfh4,sds22,hhgj6,xfsd4a,asr3']
(big list)
how can I remove all words which occur in less than 1% and more than 60% of the strings?
You can use a collections.Counter
:
counts = Counter(mylist)
and then:
newlist = [s for s in mylist if 0.01 < counts[s]/len(mylist) < 0.60]
(in Python 2.x use float(counts[s])/len(mylist)
)
If you're talking about the comma-seperated words, then you can use a similar approach:
words = [l.split(',') for l in mylist]
counts = Counter(word for l in words for word in l)
newlist = [[s for s in l if 0.01 < counts[s]/len(mylist) < 0.60] for l in words]
The straightforward solution
occurrences = dict()
for word in words:
if word not in occurrences:
occurrences[word] = 1
else:
occurrences[word] += 1
result = [word for word in words 0.01 <= occurrences[word] /len(words) <= 0.6]
I'm going to guess you want this:
from collections import Counter,Set
# break up by ',' and remove duplicate words on each line
st = [set(s.split(',')) for s in mylist]
# Count all the words
count = Counter([word for line in st for word in line])
# Work out which words are allowed
allowed = [s for s in count if 0.01 < counts[s]/len(mylist) < 0.60]
#For each row in the original list. If the word is allowed then keep it
result = [[w for w in s.split(',') if w in allowed] for s in mylist]
print result
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.