繁体   English   中英

计算列表中字符串中元素的出现次数?

[英]Count occurrences of elements in string from a list?

我试图在我收集到的一些讲话中计算发生口头收缩的次数。 一个特定的演讲如下:

speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."

因此,在这种情况下,我想计算四(4)次收缩。 我有一个宫缩清单,以下是前几个名词:

contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}

我的代码如下所示:

count = 0
for word in speech:
    if word in contractions:
        count = count + 1
print count

但是,我对此一无所知,因为代码遍历每个字母而不是整个单词。

使用str.split()str.split()分割字符串:

for word in speech.split():

这将在任意空格上分割; 这意味着空格,制表符,换行符以及其他一些奇异的空白字符,并且它们可以连续任意数量。

您可能需要使用小写你的话str.lower()否则Ain't不会被发现,例如),并去掉标点符号:

from string import punctuation

count = 0
for word in speech.lower().split():
    word = word.strip(punctuation)
    if word in contractions:
        count += 1

我在这里使用str.strip()方法 它将从单词的开头和结尾删除在string.punctuation字符串中找到的所有内容。

您正在遍历字符串。 因此,项目是字符。 要从字符串中获取单词,您可以使用诸如str.split()这样的天真的方法(现在您可以遍历字符串列表(在str.split()的参数上拆分的单词,默认为split)在空白上),甚至还有re.split() ,其功能更强大。但是我认为您不需要使用正则表达式来分割文本。

您至少要做的是使用str.lower()小写字符串或将所有可能出现的内容(也包括大写字母)放入字典中。 我强烈建议第一种选择。 后者并不切实可行。 删除标点符号也是为此的责任。 但这仍然很幼稚。 如果您需要更复杂的方法,则必须通过单词标记器拆分文本。 NLTK是一个很好的起点,请参阅nltk标记器 但是我强烈认为这个问题不是您的主要问题,或者确实会影响您解决问题。 :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...

# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re

def abbreviation_counter(input_text, abbreviation_dict):   
    count = 0
    # what you want is a list of words. str.split() does this job for you.
    # " " is default and you can also omit this. But if you really need better
    # methods (see answer text abover), you have to take a word tokenizer tool
    # or have to write your own.
    for word in input_text.split(" "):
        # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
        # using re over `from string import punctuation` is that you have more
        # control in what you want to remove. That means that you can add or
        # remove easily any punctuation mark. It could be very handy. It could be
        # also overpowered. If the latter is the case, just stick to Martijn Pieters
        # solution.
        if re.sub(',|;', '', word).lower() in abbreviation_dict:
            count += 1

    return count

print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)

像Martijn Pieters一样,同时给出答案有点令人沮丧;),但我希望我仍然为您带来了一些价值。 因此,我修改了问题,为您提供一些进一步的建议。

Python中的for循环迭代可迭代对象中的所有元素。 对于字符串,元素是字符。

您需要将字符串拆分为包含单词的字符串列表(或元组)。 您可以为此使用.split(delimiter)

您的问题很普遍,因此Python有一个快捷方式: speech.split()在任意数量的空格/制表符/换行符之间进行拆分,因此您只能在列表中使用单词。

因此,您的代码应如下所示:

count = 0
for word in speech.split():
    if word in contractions:
        count = count + 1
print(count)

speech.split(" ")也可以,但是只能在空格上分割,而不能在制表符或换行符上分割,如果有双倍空格,则结果列表中将出现空元素。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM