简体   繁体   English

计算列表中字符串中元素的出现次数?

[英]Count occurrences of elements in string from a list?

I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. 我试图在我收集到的一些讲话中计算发生口头收缩的次数。 One particular speech looks like this: 一个特定的演讲如下:

speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."

So, in this case, I'd like to count four (4) contractions. 因此,在这种情况下,我想计算四(4)次收缩。 I have a list of contractions, and here are some of the first few terms: 我有一个宫缩清单,以下是前几个名词:

contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}

My code looks something like this, to begin with: 我的代码如下所示:

count = 0
for word in speech:
    if word in contractions:
        count = count + 1
print count

I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words. 但是,我对此一无所知,因为代码遍历每个字母而不是整个单词。

Use str.split() to split your string on whitespace: 使用str.split()str.split()分割字符串:

for word in speech.split():

This will split on arbitrary whitespace ; 这将在任意空格上分割; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row. 这意味着空格,制表符,换行符以及其他一些奇异的空白字符,并且它们可以连续任意数量。

You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation: 您可能需要使用小写你的话str.lower()否则Ain't不会被发现,例如),并去掉标点符号:

from string import punctuation

count = 0
for word in speech.lower().split():
    word = word.strip(punctuation)
    if word in contractions:
        count += 1

I use the str.strip() method here; 我在这里使用str.strip()方法 it removes everything found in the string.punctuation string from the start and end of a word. 它将从单词的开头和结尾删除在string.punctuation字符串中找到的所有内容。

You're iterating over a string. 您正在遍历字符串。 So the items are characters. 因此,项目是字符。 To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split() , which is more powerful. But I don't think that you need splitting the text with regexes. 要从字符串中获取单词,您可以使用诸如str.split()这样的天真的方法(现在您可以遍历字符串列表(在str.split()的参数上拆分的单词,默认为split)在空白上),甚至还有re.split() ,其功能更强大。但是我认为您不需要使用正则表达式来分割文本。

What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. 您至少要做的是使用str.lower()小写字符串或将所有可能出现的内容(也包括大写字母)放入字典中。 I strongly recommending the first alternative. 我强烈建议第一种选择。 The latter isn't really practicable. 后者并不切实可行。 Removing the punctuation is also a duty for this. 删除标点符号也是为此的责任。 But this is still naive. 但这仍然很幼稚。 If you're need a more sophisticated method, you have to split the text via a word tokenizer. 如果您需要更复杂的方法,则必须通过单词标记器拆分文本。 NLTK is a good starting point for that, see the nltk tokenizer . NLTK是一个很好的起点,请参阅nltk标记器 But I strongly feel that this problem is not your major one or affects you really in solving your question. 但是我强烈认为这个问题不是您的主要问题,或者确实会影响您解决问题。 :) :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...

# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re

def abbreviation_counter(input_text, abbreviation_dict):   
    count = 0
    # what you want is a list of words. str.split() does this job for you.
    # " " is default and you can also omit this. But if you really need better
    # methods (see answer text abover), you have to take a word tokenizer tool
    # or have to write your own.
    for word in input_text.split(" "):
        # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
        # using re over `from string import punctuation` is that you have more
        # control in what you want to remove. That means that you can add or
        # remove easily any punctuation mark. It could be very handy. It could be
        # also overpowered. If the latter is the case, just stick to Martijn Pieters
        # solution.
        if re.sub(',|;', '', word).lower() in abbreviation_dict:
            count += 1

    return count

print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)

It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. 像Martijn Pieters一样,同时给出答案有点令人沮丧;),但我希望我仍然为您带来了一些价值。 That's why I've edited my question to give you some hints for future work in addition. 因此,我修改了问题,为您提供一些进一步的建议。

A for loop in Python iterates over all elements in an iterable. Python中的for循环迭代可迭代对象中的所有元素。 In the case of strings the elements are the characters. 对于字符串,元素是字符。

You need to split the string into a list (or tuple) of strings that contain the words. 您需要将字符串拆分为包含单词的字符串列表(或元组)。 You can use .split(delimiter) for this. 您可以为此使用.split(delimiter)

Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list. 您的问题很普遍,因此Python有一个快捷方式: speech.split()在任意数量的空格/制表符/换行符之间进行拆分,因此您只能在列表中使用单词。

So your code should look like this: 因此,您的代码应如下所示:

count = 0
for word in speech.split():
    if word in contractions:
        count = count + 1
print(count)

speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list. speech.split(" ")也可以,但是只能在空格上分割,而不能在制表符或换行符上分割,如果有双倍空格,则结果列表中将出现空元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pythonic计算字符串列表中出现次数的方法 - Pythonic way to count occurrences from a list in a string Python 从字典中拆分列表中的字符串后,分别计算出现次数以及出现次数最多的元素 - Python count the number of occurrences and also the elements with maximum occurences separately after splitting the string in list from dictionary 从列表 Python 计数出现次数 - Count occurrences from a list Python 计算嵌套列表中元素的出现,并执行计算 - count the occurrences of elements in a nested list, and perform calculations 如何计算嵌套元组列表中字符串的出现次数? - How to count occurrences of a string in nested tuple list? 如何计算字符串中字符的出现次数(列表) - How to count number of occurrences of a chracter in a string (list) Python:计算字符串中列表项的出现次数 - Python: Count number of occurrences of list items in a string 显示列表项并计算列表中的出现次数 - Display list items and count occurrences from list 从python中的文本文件中计算列表中出现和不出现特殊字符的所有元素 - Count all occurrences of elements with and without special characters in a list from a text file in python Python:字典中来自另一个列表的出现次数 - Python: Count of occurrences in dict from another list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM