[英]How would I isolate specific words containing a specific character?
So I'm creating an analytics bot for my EPQ that counts the number of time a specific hashtag is used.因此,我正在为我的 EPQ 创建一个分析机器人,用于计算特定主题标签的使用次数。 How would I go about checking if a word in a string of other words contains a
#
?我将如何检查其他单词字符串中的单词是否包含
#
?
A first approach can check if a string has a substring using in
, and gather a count for each unique word using a dictionary:第一种方法可以使用
in
检查字符串是否具有子字符串,并使用字典收集每个唯一单词的计数:
texts = ["it's friday! #TGIF", "My favorite day! #TGIF"]
counts = {}
for text in texts:
for word in text.split(" "):
if "#" not in word:
continue
if word not in counts:
counts[word] = 0
counts[word] += 1
print(counts)
# {'#TGIF': 2}
This could be improved further with:这可以通过以下方式进一步改进:
str.casefold()
to normalize text with different casingsstr.casefold()
规范化不同大小写的文本You already have a decent answer, so it really just comes down to what kind of data you want to end up with.你已经有了一个不错的答案,所以它真的归结为你想要最终得到什么样的数据。 Here's another solution, using Python's
re
module on the same data:这是另一个解决方案,在相同数据上使用 Python 的
re
模块:
import re
texts = ["it's friday! #TGIF #foo", "My favorite day! #TGIF"]
[re.findall('#(\w+)', text) for text in texts]
Regex takes some getting used to.正则表达式需要一些时间来适应。 The
'#(\w+)'
'captures' (with the parentheses) the 'word' ( \w+
) after any hash characters ( '#'
). '#(\w+)'
'捕获'(带括号)任何哈希字符( '#'
)之后的'word'( \w+
)。 It results in a list of hashtags for each 'document' in the dataset:它会为数据集中的每个“文档”生成一个主题标签列表:
[['TGIF', 'foo'], ['TGIF']]
Then you could get the total counts with this trick :然后你可以用这个技巧得到总数:
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(finds))
Yielding this dictionary-like thing:产生这个类似字典的东西:
Counter({'TGIF': 2, 'foo': 1})
test = " if a word in a string of other words contains a #"
if "#" in test:
print("yes")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.