NLTK 中的實際字數

Question

NLTK 書中有幾個字數統計示例，但實際上它們不是字數統計，而是令牌計數。 例如，第 1 章，計數詞匯說以下給出了字數：

text = nltk.Text(tokens)
len(text)

然而，它沒有——它給出了一個單詞和標點符號的數量。 你怎么能得到一個真正的字數（忽略標點符號）？

同樣，如何獲得一個單詞的平均字符數？ 顯而易見的答案是：

word_average_length =(len(string_of_text)/len(text))

但是，這將被關閉，因為：

len(string_of_text) 是字符數，包括空格
len(text) 是一個記號計數，不包括空格但包括標點符號，它們不是單詞。

我在這里錯過了什么嗎？ 這一定是一個很常見的NLP任務……

Answer 1

使用nltk進行標記

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

返回

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']

Answer 2

刪除標點符號

使用正則表達式過濾掉標點符號

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

平均字符數

求和每個單詞的長度。 除以單詞數。

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

或者你可以利用你已經做過的計數來阻止一些重新計算。 這會將單詞的長度乘以我們看到它的次數，然后將所有這些加起來。

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

Answer 3

刪除標點符號（沒有正則表達式）

使用與dhg相同的解決方案，但測試給定的標記是字母數字而不是使用正則表達式模式。

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

好處：

使用非英語語言效果更好，因為"À".isalnum()為True而bool（nonPunct.match（“à”））為False （“à”不是至少用法語表示的標點符號）。
不需要使用re包。

Answer 4

刪除標點符號

from string import punctuation   
punctuations = list(punctuation)
punctuations.append("''")
punctuations.append("--")
punctuations.append("``")
from string import punctuation 
text = [word for word in text if word not in punctuations]

文本中單詞的平均字符數

from collections import Counter
from nltk import word_tokenize

word_count = Counter(word_tokenize(text))
sum(len(x)* y for x, y in word_count.items()) / len(text)

NLTK 中的實際字數

問題描述

4 個解決方案

解決方案1
13 2014-09-05 13:19:47

解決方案2
10 已采納 2012-05-20 20:46:05

刪除標點符號

平均字符數

解決方案3
0 2019-09-01 06:35:35

刪除標點符號（沒有正則表達式）

解決方案4
0 2022-09-23 13:29:48

NLTK 中的實際字數

問題描述

4 個解決方案

解決方案1 13 2014-09-05 13:19:47

解決方案2 10 已采納 2012-05-20 20:46:05

刪除標點符號

平均字符數

解決方案3 0 2019-09-01 06:35:35

刪除標點符號（沒有正則表達式）

解決方案4 0 2022-09-23 13:29:48

解決方案1
13 2014-09-05 13:19:47

解決方案2
10 已采納 2012-05-20 20:46:05

解決方案3
0 2019-09-01 06:35:35

解決方案4
0 2022-09-23 13:29:48