[英]How To Use TfidfVectorizer With PySpark
I am very new at using Pyspark and have some issues with Pyspark Dataframe.我对使用 Pyspark 非常陌生,并且对 Pyspark Dataframe 有一些问题。
I'm trying to implement the TF-IDF algorithm.我正在尝试实现 TF-IDF 算法。 I did it with pandas dataframe once.
我用 pandas dataframe 做过一次。 However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like
dataframe['ColumnName']
. When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and has not been solved yet. The current problem below: However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe
dataframe['ColumnName']
. When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and还没解决,目前的问题如下:
With Pandas:使用 Pandas:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])
With PySpark:使用 PySpark:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
13 vocabulary = list(vocabulary)
14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
16 tfidf_tran = tfidf.transform(dataframe['name'])
17
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
1821 self._check_params()
1822 self._warn_for_unused_params()
-> 1823 X = super().fit_transform(raw_documents)
1824 self._tfidf.fit(X)
1825 return self
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1110 values = _make_int_array()
1111 indptr.append(0)
-> 1112 for doc in raw_documents:
1113 feature_counter = {}
1114 for feature in analyze(doc):
E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
458
459 def __iter__(self):
--> 460 raise TypeError("Column is not iterable")
461
462 # string methods
TypeError: Column is not iterable
Tf-idf is the term frequency multiplied by the inverse document frequency. Tf-idf 是词频乘以逆文档频率。 There isn't an explicit tf-idf vectorizer within the MlLib for dataframes in the Pyspark library, but they have 2 useful models that will help get you to the tf-idf.
MlLib 中没有针对 Pyspark 库中的数据帧的显式 tf-idf 矢量化器,但它们有 2 个有用的模型可以帮助您使用 tf-idf。 Using the HashingTF , you'd get the term frequencies.
使用HashingTF ,您将获得术语频率。 Using the IDF , you'd have the inverse document frequencies.
使用IDF ,您将获得逆文档频率。 Multiply the two together, and you should have an output matrix matching what you would be expecting from the TfidfVectorizer you specified originally.
将两者相乘,您应该有一个 output 矩阵与您对最初指定的 TfidfVectorizer 的期望相匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.