如何在 PySpark 中使用 TfidfVectorizer

Question

我對使用 Pyspark 非常陌生，並且對 Pyspark Dataframe 有一些問題。

我正在嘗試實現 TF-IDF 算法。 我用 pandas dataframe 做過一次。 However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe dataframe['ColumnName'] . When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and還沒解決，目前的問題如下：

使用 Pandas：


tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])

使用 PySpark：

tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
     13 vocabulary = list(vocabulary)
     14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
     16 tfidf_tran = tfidf.transform(dataframe['name'])
     17 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
   1821         self._check_params()
   1822         self._warn_for_unused_params()
-> 1823         X = super().fit_transform(raw_documents)
   1824         self._tfidf.fit(X)
   1825         return self

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
   1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)
   1204 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1110         values = _make_int_array()
   1111         indptr.append(0)
-> 1112         for doc in raw_documents:
   1113             feature_counter = {}
   1114             for feature in analyze(doc):

E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
    458 
    459     def __iter__(self):
--> 460         raise TypeError("Column is not iterable")
    461 
    462     # string methods

TypeError: Column is not iterable

Answer 1

Tf-idf 是詞頻乘以逆文檔頻率。 MlLib 中沒有針對 Pyspark 庫中的數據幀的顯式 tf-idf 矢量化器，但它們有 2 個有用的模型可以幫助您使用 tf-idf。 使用HashingTF ，您將獲得術語頻率。 使用IDF ，您將獲得逆文檔頻率。 將兩者相乘，您應該有一個 output 矩陣與您對最初指定的 TfidfVectorizer 的期望相匹配。

如何在 PySpark 中使用 TfidfVectorizer

問題描述

1 個解決方案

解決方案1
0 2022-09-24 04:57:01

如何在 PySpark 中使用 TfidfVectorizer

問題描述

1 個解決方案

解決方案1 0 2022-09-24 04:57:01

解決方案1
0 2022-09-24 04:57:01