简体   繁体   English

我将如何隔离包含特定字符的特定单词?

[英]How would I isolate specific words containing a specific character?

So I'm creating an analytics bot for my EPQ that counts the number of time a specific hashtag is used.因此,我正在为我的 EPQ 创建一个分析机器人,用于计算特定主题标签的使用次数。 How would I go about checking if a word in a string of other words contains a # ?我将如何检查其他单词字符串中的单词是否包含#

A first approach can check if a string has a substring using in , and gather a count for each unique word using a dictionary:第一种方法可以使用in检查字符串是否具有子字符串,并使用字典收集每个唯一单词的计数:

texts = ["it's friday! #TGIF", "My favorite day! #TGIF"]
counts = {}

for text in texts:
    for word in text.split(" "):
            if "#" not in word:
                    continue
            if word not in counts:
                    counts[word] = 0
            counts[word] += 1

print(counts)
# {'#TGIF': 2}

This could be improved further with:这可以通过以下方式进一步改进:

  • using str.casefold() to normalize text with different casings使用str.casefold()规范化不同大小写的文本
  • using regex to ignore certain chars, eg '#tgif!'使用正则表达式忽略某些字符,例如 '#tgif!' should be parsed as '#tgif'应该被解析为“#tgif”

You already have a decent answer, so it really just comes down to what kind of data you want to end up with.你已经有了一个不错的答案,所以它真的归结为你想要最终得到什么样的数据。 Here's another solution, using Python's re module on the same data:这是另一个解决方案,在相同数据上使用 Python 的re模块

import re

texts = ["it's friday! #TGIF #foo", "My favorite day! #TGIF"]

[re.findall('#(\w+)', text) for text in texts]

Regex takes some getting used to.正则表达式需要一些时间来适应。 The '#(\w+)' 'captures' (with the parentheses) the 'word' ( \w+ ) after any hash characters ( '#' ). '#(\w+)' '捕获'(带括号)任何哈希字符( '#' )之后的'word'( \w+ )。 It results in a list of hashtags for each 'document' in the dataset:它会为数据集中的每个“文档”生成一个主题标签列表:

[['TGIF', 'foo'], ['TGIF']]

Then you could get the total counts with this trick :然后你可以用这个技巧得到总数:

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(finds))

Yielding this dictionary-like thing:产生这个类似字典的东西:

Counter({'TGIF': 2, 'foo': 1})
test = " if a word in a string of other words contains a #"
if "#" in test:
    print("yes")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM