[英]How to Classify text based on common words
這個問題是關於根據常用詞對文本進行分類的,我不知道我是否正在解決問題,我對“說明”列中的文本和“ ID”列中的唯一ID都有很好的了解,我想遍歷“描述”並根據文本中常用詞的百分比或頻率對它們進行比較,我想對描述進行分類並為其指定另一個ID。 請參見下面的示例...。
#importing pandas as pd
import pandas as pd
# creating a dataframe
df = pd.DataFrame({'ID': ['12 ', '54', '88','9'],
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped
bacterium that is a member of the Firmicutes', 'Streptococcus pneumoniae,
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic',
'Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites ','A
television set or television receiver, more commonly called a television,
TV, TV set, or telly']})
ID Description
12 Staphylococcus aureus is a Gram-positive, round-shaped bacterium that is a member of the Firmicutes
54 Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-shaped bacterium that is a member beta-hemolytic
88 Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9 A television set or television receiver, more commonly called a television, TV, TV set, or telly
例如12和54說明具有超過75%的常用詞,它們將具有相同的ID。 輸出將是這樣的:
ID Description
12 Staphylococcus aureus is a Gram-positive, round-shaped bacterium that
is a member of the Firmicutes
12 Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-
shaped bacterium that is a member beta-hemolytic
88 Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9 A television set or television receiver, more commonly called a
television, TV, TV set, or telly
在這里,我嘗試了兩個不同的數據框Risk1和Risk2,但我並沒有遍歷行,我也需要這樣做:
import codecs
import re
import copy
import collections
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
import matplotlib.pyplot as plt
%matplotlib inline
nltk.download('stopwords')
from nltk.corpus import stopwords
# creating a dataframe 1
df = pd.DataFrame({'ID': ['12 '],
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped
bacterium that is a member of the Firmicutes']})
# creating a dataframe 2
df = pd.DataFrame({'ID': ['54'],
'Description': ['Streptococcus pneumoniae,
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic']})
esw = stopwords.words('english')
esw.append('would')
word_pattern = re.compile("^\w+$")
def get_text_counter(text):
tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
tokens = list(map(lambda x: x.lower(), tokens))
tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
return collections.Counter(tokens), len(tokens)
def make_df(counter, size):
abs_freq = np.array([el[1] for el in counter])
rel_freq = abs_freq / size
index = [el[0] for el in counter]
df = pd.DataFrame(data = np.array([abs_freq, rel_freq]).T, index=index, columns=['Absolute Frequency', 'Relative Frequency'])
df.index.name = 'Most_Common_Words'
return df
Risk1_counter, Risk1_size = get_text_counter(Risk1)
make_df(Risk1_counter.most_common(500), Risk1_size)
Risk2_counter, Risk2_size = get_text_counter(Risk2)
make_df(Risk2_counter.most_common(500), Risk2_size)
all_counter = Risk1_counter + Risk2_counter
all_df = make_df(Risk2_counter.most_common(1000), 1)
most_common_words = all_df.index.values
df_data = []
for word in most_common_words:
Risk1_c = Risk1_counter.get(word, 0) / Risk1_size
Risk2_c = Risk2_counter.get(word, 0) / Risk2_size
d = abs(Risk1_c - Risk2_c)
df_data.append([Risk1_c, Risk2_c, d])
dist_df= pd.DataFrame(data = df_data, index=most_common_words,
columns=['Risk1 Relative Freq', 'Risk2 Hight Relative Freq','Relative Freq Difference'])
dist_df.index.name = 'Most Common Words'
dist_df.sort_values('Relative Freq Difference', ascending = False, inplace=True)
dist_df.head(500)
更好的方法可能是在NLP中使用句子相似度算法。 一個良好的起點是使用Google的通用句子嵌入,如本Python筆記本中所示。 如果經過預訓練的Google USE無法正常工作,則還有其他句子嵌入(例如,從Facebook推斷)。 另一種選擇是使用word2vec並對句子中每個單詞獲得的向量求平均值。
您想找到句子嵌入之間的余弦相似度,然后重新標記相似度高於某個閾值(例如0.8)的類別。 您將不得不嘗試不同的相似性閾值以獲得最佳匹配性能。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.