如何根據常用詞對文本進行分類

Question

這個問題是關於根據常用詞對文本進行分類的，我不知道我是否正在解決問題，我對“說明”列中的文本和“ ID”列中的唯一ID都有很好的了解，我想遍歷“描述”並根據文本中常用詞的百分比或頻率對它們進行比較，我想對描述進行分類並為其指定另一個ID。 請參見下面的示例...。

    #importing pandas as pd 
    import pandas as pd 

     # creating a dataframe 
     df = pd.DataFrame({'ID': ['12 ', '54', '88','9'], 
    'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped 
     bacterium that is a member of the Firmicutes', 'Streptococcus pneumoniae, 
    or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic', 
    'Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites ','A 
    television set or television receiver, more commonly called a television, 
    TV, TV set, or telly']})

ID     Description
12  Staphylococcus aureus is a Gram-positive, round-shaped bacterium that is a member of the Firmicutes
54  Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-shaped bacterium that is a member beta-hemolytic
88  Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9   A television set or television receiver, more commonly called a television, TV, TV set, or telly

例如12和54說明具有超過75％的常用詞，它們將具有相同的ID。 輸出將是這樣的：

ID     Description
12  Staphylococcus aureus is a Gram-positive, round-shaped bacterium that 
is a member of the Firmicutes
12  Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round- 
shaped bacterium that is a member beta-hemolytic
88  Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9   A television set or television receiver, more commonly called a 
television, TV, TV set, or telly

在這里，我嘗試了兩個不同的數據框Risk1和Risk2，但我並沒有遍歷行，我也需要這樣做：

import codecs
import re
import copy
import collections
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
import matplotlib.pyplot as plt

%matplotlib inline

nltk.download('stopwords')

from nltk.corpus import stopwords

# creating a dataframe 1
 df = pd.DataFrame({'ID': ['12 '], 
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped 
 bacterium that is a member of the Firmicutes']})
# creating a dataframe 2
 df = pd.DataFrame({'ID': ['54'], 
'Description': ['Streptococcus pneumoniae, 
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic']})

esw = stopwords.words('english')
esw.append('would')

word_pattern = re.compile("^\w+$")

def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
return collections.Counter(tokens), len(tokens)

def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data = np.array([abs_freq, rel_freq]).T, index=index, columns=['Absolute Frequency', 'Relative Frequency'])
    df.index.name = 'Most_Common_Words'
return df

Risk1_counter, Risk1_size = get_text_counter(Risk1)
make_df(Risk1_counter.most_common(500), Risk1_size)

Risk2_counter, Risk2_size = get_text_counter(Risk2)
make_df(Risk2_counter.most_common(500), Risk2_size)

all_counter = Risk1_counter + Risk2_counter
all_df = make_df(Risk2_counter.most_common(1000), 1)
most_common_words = all_df.index.values


df_data = []
for word in most_common_words:
    Risk1_c = Risk1_counter.get(word, 0) / Risk1_size
    Risk2_c = Risk2_counter.get(word, 0) / Risk2_size
    d = abs(Risk1_c - Risk2_c)
    df_data.append([Risk1_c, Risk2_c, d])
dist_df= pd.DataFrame(data = df_data, index=most_common_words,
                    columns=['Risk1 Relative Freq', 'Risk2 Hight Relative Freq','Relative Freq Difference'])
dist_df.index.name = 'Most Common Words'
dist_df.sort_values('Relative Freq Difference', ascending = False, inplace=True)


dist_df.head(500)

Answer 1

更好的方法可能是在NLP中使用句子相似度算法。 一個良好的起點是使用Google的通用句子嵌入，如本Python筆記本中所示。 如果經過預訓練的Google USE無法正常工作，則還有其他句子嵌入（例如，從Facebook推斷）。 另一種選擇是使用word2vec並對句子中每個單詞獲得的向量求平均值。

您想找到句子嵌入之間的余弦相似度，然后重新標記相似度高於某個閾值（例如0.8）的類別。 您將不得不嘗試不同的相似性閾值以獲得最佳匹配性能。

如何根據常用詞對文本進行分類

問題描述

1 個解決方案

解決方案1
2 2019-08-22 18:03:54

如何根據常用詞對文本進行分類

問題描述

1 個解決方案

解決方案1 2 2019-08-22 18:03:54

解決方案1
2 2019-08-22 18:03:54