简体   繁体   中英

Count occurrences of elements in string from a list?

I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. One particular speech looks like this:

speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."

So, in this case, I'd like to count four (4) contractions. I have a list of contractions, and here are some of the first few terms:

contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}

My code looks something like this, to begin with:

count = 0
for word in speech:
    if word in contractions:
        count = count + 1
print count

I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words.

Use str.split() to split your string on whitespace:

for word in speech.split():

This will split on arbitrary whitespace ; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row.

You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation:

from string import punctuation

count = 0
for word in speech.lower().split():
    word = word.strip(punctuation)
    if word in contractions:
        count += 1

I use the str.strip() method here; it removes everything found in the string.punctuation string from the start and end of a word.

You're iterating over a string. So the items are characters. To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split() , which is more powerful. But I don't think that you need splitting the text with regexes.

What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. I strongly recommending the first alternative. The latter isn't really practicable. Removing the punctuation is also a duty for this. But this is still naive. If you're need a more sophisticated method, you have to split the text via a word tokenizer. NLTK is a good starting point for that, see the nltk tokenizer . But I strongly feel that this problem is not your major one or affects you really in solving your question. :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...

# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re

def abbreviation_counter(input_text, abbreviation_dict):   
    count = 0
    # what you want is a list of words. str.split() does this job for you.
    # " " is default and you can also omit this. But if you really need better
    # methods (see answer text abover), you have to take a word tokenizer tool
    # or have to write your own.
    for word in input_text.split(" "):
        # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
        # using re over `from string import punctuation` is that you have more
        # control in what you want to remove. That means that you can add or
        # remove easily any punctuation mark. It could be very handy. It could be
        # also overpowered. If the latter is the case, just stick to Martijn Pieters
        # solution.
        if re.sub(',|;', '', word).lower() in abbreviation_dict:
            count += 1

    return count

print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)

It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. That's why I've edited my question to give you some hints for future work in addition.

A for loop in Python iterates over all elements in an iterable. In the case of strings the elements are the characters.

You need to split the string into a list (or tuple) of strings that contain the words. You can use .split(delimiter) for this.

Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list.

So your code should look like this:

count = 0
for word in speech.split():
    if word in contractions:
        count = count + 1
print(count)

speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM