如何在 PySpark 中使用 TfidfVectorizer

Question

I am very new at using Pyspark and have some issues with Pyspark Dataframe.我对使用 Pyspark 非常陌生，并且对 Pyspark Dataframe 有一些问题。

I'm trying to implement the TF-IDF algorithm.我正在尝试实现 TF-IDF 算法。 I did it with pandas dataframe once.我用 pandas dataframe 做过一次。 However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe['ColumnName'] . When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and has not been solved yet. The current problem below: However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe dataframe['ColumnName'] . When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and还没解决，目前的问题如下：

With Pandas:使用 Pandas：


tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])

With PySpark:使用 PySpark：

tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
     13 vocabulary = list(vocabulary)
     14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
     16 tfidf_tran = tfidf.transform(dataframe['name'])
     17 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
   1821         self._check_params()
   1822         self._warn_for_unused_params()
-> 1823         X = super().fit_transform(raw_documents)
   1824         self._tfidf.fit(X)
   1825         return self

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
   1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)
   1204 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1110         values = _make_int_array()
   1111         indptr.append(0)
-> 1112         for doc in raw_documents:
   1113             feature_counter = {}
   1114             for feature in analyze(doc):

E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
    458 
    459     def __iter__(self):
--> 460         raise TypeError("Column is not iterable")
    461 
    462     # string methods

TypeError: Column is not iterable

Answer 1

Tf-idf is the term frequency multiplied by the inverse document frequency. Tf-idf 是词频乘以逆文档频率。 There isn't an explicit tf-idf vectorizer within the MlLib for dataframes in the Pyspark library, but they have 2 useful models that will help get you to the tf-idf. MlLib 中没有针对 Pyspark 库中的数据帧的显式 tf-idf 矢量化器，但它们有 2 个有用的模型可以帮助您使用 tf-idf。 Using the HashingTF , you'd get the term frequencies.使用HashingTF ，您将获得术语频率。 Using the IDF , you'd have the inverse document frequencies.使用IDF ，您将获得逆文档频率。 Multiply the two together, and you should have an output matrix matching what you would be expecting from the TfidfVectorizer you specified originally.将两者相乘，您应该有一个 output 矩阵与您对最初指定的 TfidfVectorizer 的期望相匹配。

如何在 PySpark 中使用 TfidfVectorizer

问题描述

1 个解决方案

解决方案1
0 2022-09-24 04:57:01

如何在 PySpark 中使用 TfidfVectorizer

问题描述

1 个解决方案

解决方案1 0 2022-09-24 04:57:01

解决方案1
0 2022-09-24 04:57:01