如何查找字符串中單詞的計數

Question

我有一個字符串“ Hello I am going to I with hello am ”。 我想找出一個單詞在字符串中出現了多少次。 示例 hello 出現 2 次。 我試過這種只打印字符的方法 -

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

我想學習如何計算字數。

Answer 1

如果要查找單個單詞的計數，只需使用count ：

input_string.count("Hello")

使用collections.Counter和split()計算所有單詞：

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

Answer 2

收藏中的Counter是您的朋友：

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

Answer 3

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

使用re.findall比split更通用，因為否則你不能考慮諸如“不要”和“我會”等收縮。

演示（使用您的示例）：

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

如果您希望進行許多這樣的查詢，這只會做一次 O(N) 工作，而不是 O(N*#queries) 工作。

Answer 4

單詞出現次數的向量稱為bag-of-words 。

Scikit-learn 提供了一個很好的模塊來計算它， sklearn.feature_extraction.text.CountVectorizer 。 例子：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

輸出：

2 am
1 going
2 hello
1 to
1 with

部分代碼取自本Kaggle 關於 bag-of-words 的教程。

僅供參考：如何使用 sklearn 的 CountVectorizerand() 來獲取包含任何標點符號作為單獨標記的 ngram？

Answer 5

這是另一種不區分大小寫的方法

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

它通過將字符串和目標轉換為小寫來匹配。

ps：處理下面@DSM指出的str.count()的"am ham".count("am") == 2問題 :)

Answer 6

將Hello和hello視為相同的詞，無論它們的大小寫如何：

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

Answer 7

您可以將字符串划分為元素並計算它們的數量

count = len(my_string.split())

Answer 8

您可以使用 Python 正則表達式庫re查找子字符串中的所有匹配項並返回數組。

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

印刷：

Answer 9

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result

Answer 10

如果您要查找給定字符串中的單詞總數，這是您可以使用的最簡單的代碼：

def word_count(str):
    counts = len(str.split())
    return counts

如何查找字符串中單詞的計數

問題描述

9 個解決方案

解決方案1
43 已采納 2012-07-02 20:05:03

解決方案2
6 2012-07-02 20:05:06

解決方案3
5 2012-07-02 20:05:02

解決方案4
3 2015-08-11 23:40:15

解決方案5
2 2012-07-02 20:05:19

解決方案6
2 2012-07-02 20:14:35

解決方案7
1 2020-01-23 10:02:52

解決方案8
0 2016-09-09 20:06:11

解決方案9
0

解決方案10
0 2022-08-05 11:56:38

如何查找字符串中單詞的計數

問題描述

9 個解決方案

解決方案1 43 已采納 2012-07-02 20:05:03

解決方案2 6 2012-07-02 20:05:06

解決方案3 5 2012-07-02 20:05:02

解決方案4 3 2015-08-11 23:40:15

解決方案5 2 2012-07-02 20:05:19

解決方案6 2 2012-07-02 20:14:35

解決方案7 1 2020-01-23 10:02:52

解決方案8 0 2016-09-09 20:06:11

解決方案9 0

解決方案10 0 2022-08-05 11:56:38

解決方案1
43 已采納 2012-07-02 20:05:03

解決方案2
6 2012-07-02 20:05:06

解決方案3
5 2012-07-02 20:05:02

解決方案4
3 2015-08-11 23:40:15

解決方案5
2 2012-07-02 20:05:19

解決方案6
2 2012-07-02 20:14:35

解決方案7
1 2020-01-23 10:02:52

解決方案8
0 2016-09-09 20:06:11

解決方案9
0

解決方案10
0 2022-08-05 11:56:38