如何在 PySpark 中使用 TfidfVectorizer

Question

我对使用 Pyspark 非常陌生，并且对 Pyspark Dataframe 有一些问题。

我正在尝试实现 TF-IDF 算法。 我用 pandas dataframe 做过一次。 However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe dataframe['ColumnName'] . When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and还没解决，目前的问题如下：

使用 Pandas：


tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])

使用 PySpark：

tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
     13 vocabulary = list(vocabulary)
     14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
     16 tfidf_tran = tfidf.transform(dataframe['name'])
     17 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
   1821         self._check_params()
   1822         self._warn_for_unused_params()
-> 1823         X = super().fit_transform(raw_documents)
   1824         self._tfidf.fit(X)
   1825         return self

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
   1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)
   1204 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1110         values = _make_int_array()
   1111         indptr.append(0)
-> 1112         for doc in raw_documents:
   1113             feature_counter = {}
   1114             for feature in analyze(doc):

E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
    458 
    459     def __iter__(self):
--> 460         raise TypeError("Column is not iterable")
    461 
    462     # string methods

TypeError: Column is not iterable

Answer 1

Tf-idf 是词频乘以逆文档频率。 MlLib 中没有针对 Pyspark 库中的数据帧的显式 tf-idf 矢量化器，但它们有 2 个有用的模型可以帮助您使用 tf-idf。 使用HashingTF ，您将获得术语频率。 使用IDF ，您将获得逆文档频率。 将两者相乘，您应该有一个 output 矩阵与您对最初指定的 TfidfVectorizer 的期望相匹配。

如何在 PySpark 中使用 TfidfVectorizer

问题描述

1 个解决方案

解决方案1
0 2022-09-24 04:57:01

如何在 PySpark 中使用 TfidfVectorizer

问题描述

1 个解决方案

解决方案1 0 2022-09-24 04:57:01

解决方案1
0 2022-09-24 04:57:01