[英]How to use sklearn TFIdfVectorizer on pandas dataframe
我正在使用如下所示的制表符分隔文件:
0 abch7619 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 42Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat…..........
1 uewl0928 Duis aute irure d21olor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep3teur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
0 ahwb3612 Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur
1 llll2019 adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur???? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
0 jdne2319 At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.
1 asbq0918 Et harum quidem rerum facilis est et expedita distinctio................................ Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut
我的目標是產生一個看起來像這樣的 dataframe:
classification ID word1 word2 word3 word4
foo foo foo foo foo foo
其中 TSV 的長文本字段中的 ech 單詞作為特征(列)出現,其值是單詞 TFIDF。
我可以手動嘗試 go ,但我希望使用sklearn's TFIDFVECTORIZER
來產生這個。 但是,我需要對字段中的文本進行預處理,以遵循某些准則。
到目前為止,我可以讀取.tsv
文件,創建 dataframe,並對文本進行預處理。 我遇到的麻煩是將我的文本格式化功能組合起來,然后將其傳遞給TFIDFVECTORIZER
以下是我所擁有的:
import nltk, string, csv, operator, re, collections, sys, struct, zlib, ast, io, math, time
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from bs4 import BeautifulSoup as soup
from math import sqrt
from itertools import islice
import pandas as pd
# This function removes numbers from an array
def remove_nums(arr):
# Declare a regular expression
pattern = '[0-9]'
# Remove the pattern, which is a number
arr = [re.sub(pattern, '', i) for i in arr]
# Return the array with numbers removed
return arr
# This function cleans the passed in paragraph and parses it
def get_words(para):
# Create a set of stop words
stop_words = set(stopwords.words('english'))
# Split it into lower case
lower = para.lower().split()
# Remove punctuation
no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
# Remove integers
no_integers = remove_nums(no_punctuation)
# Remove stop words
dirty_tokens = (data for data in no_integers if data not in stop_words)
# Ensure it is not empty
tokens = [data for data in dirty_tokens if data.strip()]
# Ensure there is more than 1 character to make up the word
tokens = [data for data in tokens if len(data) > 1]
# Return the tokens
return tokens
def main():
tsv_file = "filepath"
print(tsv_file)
csv_table=pd.read_csv(tsv_file, sep='\t')
csv_table.columns = ['rating', 'ID', 'text']
s = pd.Series(csv_table['text'])
new = s.str.cat(sep=' ')
vocab = get_words(new)
print(vocab)
main()
產生:
['decent', 'terribly', 'inconsistent', 'food', 'ive', 'great', 'dishes', 'terrible', 'ones', 'love', 'chaat', 'times', 'great', 'fried', 'greasy', 'mess', 'bad', 'way', 'good', 'way', 'usually', 'matar', 'paneer', 'great', 'oversalted', 'peas', 'plain', 'bad', 'dont', 'know', 'coinflip', 'good', 'food', 'oversalted', 'overcooked', 'bowl', 'either', 'way', 'portions', 'generous', 'looks', 'arent', 'everything', 'little', 'divito', 'looks', 'little', 'scary', 'looking', 'like', 'ive', 'said', 'cant', 'judge', 'book', 'cover', 'necessarily', 'kind', 'place', 'take', 'date', 'unless', 'shes', 'blind', 'hungry', 'man', 'oh', 'man', 'food', 'ever', 'good', 'ordered', 'breakfast', 'lunch', 'dinner', 'fantastico', 'make', 'homemade', 'corn', 'tortillas', 'several', 'salsas', 'breakfast', 'burritos', 'world', 'cost', 'mcdonalds', 'meal', 'family', 'eats', 'frequently', 'frankly', 'tired',
但是,我不確定這是否是允許TFIDFVECTORIZER
正常工作的正確格式。 當我嘗試使用它時,我使用了以下運行正常的代碼:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(csv_table['text'])
df = pd.DataFrame(data=feature_matrix.todense(), columns=tfidf.get_feature_names())
print(df)
但只是給了我這樣的結果:
(0, 4147) 0.09801030349526582
(0, 4482) 0.11236176486916101
(0, 6304) 0.13511683683910816
: :
(1998, 11298) 0.08469000607646575
(1998, 500) 0.10185473904595721
(1998, 3196) 0.07801251063240894
而且我不知道我在看什么。 如何使用 TFIDFVECTORIZER 來實現我的目標,即使用 TFIDF 值創建每個單詞的特征矩陣(在應用我的清理邏輯之后)?
fit_transform 的 output 是一個稀疏矩陣,因此您需要將其轉換為密集形式,並包含您可以嘗試的清理步驟:
s = pd.Series(csv_table['text'])
corpus = s.apply(lambda s: ' '.join(get_words(s)))
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=X.todense(), columns=vectorizer.get_feature_names())
print(df)
基本上,您需要做的是在將其傳遞給fit_transform
之前,對csv_table['text']
( s
中的元素)中的每個文檔應用您的清理程序( get_words
)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.