[英]sklearn - how to use TfidfVectorizer to use entire strings?
[英]How To Use TfidfVectorizer With PySpark
我對使用 Pyspark 非常陌生,並且對 Pyspark Dataframe 有一些問題。
我正在嘗試實現 TF-IDF 算法。 我用 pandas dataframe 做過一次。 However, I started using Pyspark and now everything changed:( I can't use Pyspark Dataframe like dataframe dataframe['ColumnName']
. When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and還沒解決,目前的問題如下:
使用 Pandas:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])
使用 PySpark:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
13 vocabulary = list(vocabulary)
14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
16 tfidf_tran = tfidf.transform(dataframe['name'])
17
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
1821 self._check_params()
1822 self._warn_for_unused_params()
-> 1823 X = super().fit_transform(raw_documents)
1824 self._tfidf.fit(X)
1825 return self
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1110 values = _make_int_array()
1111 indptr.append(0)
-> 1112 for doc in raw_documents:
1113 feature_counter = {}
1114 for feature in analyze(doc):
E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
458
459 def __iter__(self):
--> 460 raise TypeError("Column is not iterable")
461
462 # string methods
TypeError: Column is not iterable
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.