简体   繁体   中英

Selecting specific elements that contain a certain word from a list in python

I want to do a sentiment analysis, but only want to use elements of a list that contain a certain word. It's about comments and I only want to analyse the comments that

For example, my list is:

comments = ["nice blog","i like your blog","nivea is a nice product","i like nivea"]

How do I create a list where only the comments that contain the word 'nivea' are added?

So I want my final list to be:

commentsfinal = ["nivea is a nice product","i like nivea"]

I tried to count the total of comments (so not the total amount of nivea mentions, but really the comments) where nivea is mentioned in different ways. All the different ways resulted in different outcomes, could anyone help me which one is the right one and why?

First try:

niveaucountlist=[]
match="nivea"

for comment in allcomments:
    niveacount=0
    for word in comment.split():
        if word in match:
            niveacount+=1
        niveacountlist.append(niveacount)

total=sum(niveacount)

This got me an outcome of 4547 comments

Second try: The second thing I tried was to make a list, whereby every comment is valued with the total of times that nivea is mentioned. I got a list like:

niveacountlist=[1,0,0,1,2,0]

Then I removed all the elements that had the value zero (because those are the comments that are not about nivea

niveacountlistpos=[x for x in niveacountlist if x != 0]
print(len(niveacountlistpos))

This resulted in 3771 comments..

Last try: My last try was what you guys answered me in my first question. So I used regexp and did:

import re
nivealist=[x for x in allcomments if re.search("nivea",x)]

This resulted in 2583 comments..

So, what is happening right here? Can someone explain me why the outcomes are all different?

--- Another (last) question that I have, is about the way I counted the total of nivea mentions (so the sum of all the times nivea was in the comments). I tried to do this by making a string of all the comments (called allwords) together and then did this:

match="nivea"
niveacount1=0
for word in allwords:
    niveacount1+=1
print(niveacount1)

Is this correct? Or can I do this in a better way..

You can use a list comprehension and in to test for substring-ness.

nivea_comments = [c for c in comments if "nivea" in c]

If you're into functional programming you'll recognise this as a filter .

nivea_comments = filter(lambda c: "nivea" in c, comments)

Using a regular expression and list comprehension For example:

import re
new_list = [x for x in comments if re.search('nivea', x)]

First try:

Your first try doesn't get the right amount because it is looking for the word in match. If a 'i' is in the comments it will look if the 'i' also appears in nivea. It does, so the counter will be raised with 1. That's why you don't get the right amount there.

Second try:

The second try is giving a different answer because you are asking for the length of the list with len(), not the sum of all the values in the list. It also has the same problem as the first try so that's why this value is still higher than the last try.

And as an answer to your last question, it isn't a good way of doing it. Because if it is a string and you use a for loop it will do it for every letter instead of for every comment. For example:

s = "This is a check"
for word in s:
    print(word)

Will return:

T
h
i
s

etc.

So it is better to use the list comprehensions like mentioned before.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM