[英]Count occurrences of elements in string from a list?
我试图在我收集到的一些讲话中计算发生口头收缩的次数。 一个特定的演讲如下:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
因此,在这种情况下,我想计算四(4)次收缩。 我有一个宫缩清单,以下是前几个名词:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
我的代码如下所示:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
但是,我对此一无所知,因为代码遍历每个字母而不是整个单词。
使用str.split()
在str.split()
分割字符串:
for word in speech.split():
这将在任意空格上分割; 这意味着空格,制表符,换行符以及其他一些奇异的空白字符,并且它们可以连续任意数量。
您可能需要使用小写你的话str.lower()
否则Ain't
不会被发现,例如),并去掉标点符号:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
我在这里使用str.strip()
方法 ; 它将从单词的开头和结尾删除在string.punctuation
字符串中找到的所有内容。
您正在遍历字符串。 因此,项目是字符。 要从字符串中获取单词,您可以使用诸如str.split()
这样的天真的方法(现在您可以遍历字符串列表(在str.split()的参数上拆分的单词,默认为split)在空白上),甚至还有re.split()
,其功能更强大。但是我认为您不需要使用正则表达式来分割文本。
您至少要做的是使用str.lower()
小写字符串或将所有可能出现的内容(也包括大写字母)放入字典中。 我强烈建议第一种选择。 后者并不切实可行。 删除标点符号也是为此的责任。 但这仍然很幼稚。 如果您需要更复杂的方法,则必须通过单词标记器拆分文本。 NLTK是一个很好的起点,请参阅nltk标记器 。 但是我强烈认为这个问题不是您的主要问题,或者确实会影响您解决问题。 :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
像Martijn Pieters一样,同时给出答案有点令人沮丧;),但我希望我仍然为您带来了一些价值。 因此,我修改了问题,为您提供一些进一步的建议。
Python中的for
循环迭代可迭代对象中的所有元素。 对于字符串,元素是字符。
您需要将字符串拆分为包含单词的字符串列表(或元组)。 您可以为此使用.split(delimiter)
。
您的问题很普遍,因此Python有一个快捷方式: speech.split()
在任意数量的空格/制表符/换行符之间进行拆分,因此您只能在列表中使用单词。
因此,您的代码应如下所示:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ")
也可以,但是只能在空格上分割,而不能在制表符或换行符上分割,如果有双倍空格,则结果列表中将出现空元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.