如何更改我的代碼，以便字符串不會更改為浮動

Question

我正在嘗試編寫檢測假新聞的代碼。 不幸的是，我不斷收到相同的錯誤消息。 請有人解釋我哪里出錯了？ 我已得到代碼的一些線條https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/和一些代碼行從https://www.datacamp.com/community/教程/文本分析-初學者-nltk 。 當我嘗試組合兩個不同的代碼（通過刪除重復代碼）時，我收到一條錯誤消息。

編碼

%matplotlib inline
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import itertools
import json
import csv
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier  
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

df = pd.read_csv(r"C:\Users\johnrambo\Downloads\fake_news(1).csv", sep=',', header=0, engine='python', escapechar='\\')

X_train, X_test, y_train, y_test = train_test_split(df['headline'], is_sarcastic_1, test_size = 0.2, random_state = 7)

clf = MultinomialNB().fit(X_train, y_train)

predicted = clf.predict(X_test)

print("MultinomialNB Accuracy:", metrics.accuracy_score(y_test, predicted))

錯誤

ValueError                                Traceback (most recent call last)
<ipython-input-8-e1f11a702626> in <module>
     21 X_train, X_test, y_train, y_test = train_test_split(df['headline'], is_sarcastic_1, test_size = 0.2, random_state = 7)
     22 
---> 23 clf = MultinomialNB().fit(X_train, y_train)
     24 
     25 predicted = clf.predict(X_test)

~\Anaconda\lib\site-packages\sklearn\naive_bayes.py in fit(self, X, y, sample_weight)
    586         self : object
    587         """
--> 588         X, y = check_X_y(X, y, 'csr')
    589         _, n_features = X.shape
    590 

~\Anaconda\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    494             try:
    495                 warnings.simplefilter('error', ComplexWarning)
--> 496                 array = np.asarray(array, dtype=dtype, order=order)
    497             except ComplexWarning:
    498                 raise ValueError("Complex data not supported\n"

~\Anaconda\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

~\Anaconda\lib\site-packages\pandas\core\series.py in __array__(self, dtype)
    946             warnings.warn(msg, FutureWarning, stacklevel=3)
    947             dtype = "M8[ns]"
--> 948         return np.asarray(self.array, dtype)
    949 
    950     # ----------------------------------------------------------------------

~\Anaconda\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

~\Anaconda\lib\site-packages\pandas\core\arrays\numpy_.py in __array__(self, dtype)
    164 
    165     def __array__(self, dtype=None):
--> 166         return np.asarray(self._ndarray, dtype=dtype)
    167 
    168     _HANDLED_TYPES = (np.ndarray, numbers.Number)

~\Anaconda\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'experts caution new car loses 90% of value as soon as you drive it off cliff'

前幾行數據

Excel 文件：假新聞

這是我輸入 df.head().to_dict() 時得到的：

{'is_sarcastic': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1}, 'headline': {0: '三十多歲的科學家揭開了脫發的末日時鍾', 1: 'dem rep . 完全說明為什么國會在性別、種族平等方面達不到要求”，2：“吃你的蔬菜：9 種不同的美味食譜”，3：“惡劣的天氣使騙子無法上班”，4：“媽媽幾乎會使用單詞'流媒體'正確"}, 'article_link': {0: ' https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205 ', 1: ' https://www .huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207 ', 2: ' https://www.huffingtonpost.com/entry/eat-your-veggies-9-delici_b_8899742.html ', https://local .theonion.com/inclement-weather-prevents-liar-from-getting-to-work-1819576031 ', 4: ' https://www.theonion.com/mother-comes-pretty-close-to-using-word -streaming-cor-1819575546 '}}

Answer 1

我想您在df['headline']列中有文本數據，您需要先將文本數據轉換為基於數字的格式，然后將其傳遞給機器學習模型進行處理。

您可能想在此處參考 sklearn 的CountVectorizer和TfidfTransformer

如何更改我的代碼，以便字符串不會更改為浮動

問題描述

1 個解決方案

解決方案1
1 2019-12-10 19:18:13

如何更改我的代碼，以便字符串不會更改為浮動

問題描述

1 個解決方案

解決方案1 1 2019-12-10 19:18:13

解決方案1
1 2019-12-10 19:18:13