Sklearn Pipeline ValueError：无法将字符串转换为float

Question

I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. 我第一次玩sklearn和NLP，并且认为我理解了我所做的一切，直到我不知道如何解决这个错误。 Here is the relevant code (largely adapted from http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html ): 以下是相关代码（主要改编自http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html ）：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sgboost import XGBClassifier
from pandas import DataFrame

def read_files(path):
    for article in os.listdir(path):
        with open(os.path.join(path, doc)) as f:
            text = f.read()
        yield os.path.join(path, article), text

def build_data_frame(path, classification)
    rows = []
    index = []
    for filename, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(filename)
    df = DataFrame(rows, index=index)
    return df

data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES: # SOURCES is a list of tuples
    data = data.append(build_data_frame(path, classification))
data = data.reindex(np.random.permutation(data.index))

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('svd', TruncatedSVD(algorithm='randomized', n_components=300)
            ])),
        ('words', Pipeline([('wscaler', StandardScaler())])),
    ])),
    ('clf, XGBClassifier(silent=False)),
])
classifier.fit(data['text'].values, data['class'].values)

The data loaded into the DataFrame is preprocessed text with all stopwords, punctuation, unicode, capitals, etc. taken care of. 加载到DataFrame中的数据是预处理文本，包含所有停用词，标点符号，unicode，大写字母等。 This is the error I'm getting once I call fit on the classifier where the ... represents one of the documents that should have been vecorized in the pipeline: 这是我在分类器上调用fit时得到的错误，其中......表示应该在管道中进行了有效化的文档之一：

ValueError: could not convert string to float: ...

I first thought the TfidfVectorizer() is not working, causing an error on the SVD algorithm, but after I extracted each step out of the pipeline and implemented them sequentially, the same error only came up on XGBClassifer.fit(). 我首先想到TfidfVectorizer（）没有工作，导致SVD算法出错，但是在我从管道中提取出每一步并按顺序实现它们之后，同样的错误只出现在XGBClassifer.fit（）上。

Even more confusing to me, I tried to piece this script apart step-by-step in the interpreter, but when I tried to import either read_files or build_data_frame, the same ValueError came up with one of my strings, but this was merely after: 更令我困惑的是，我试图在解释器中逐步分离这个脚本，但是当我尝试导入read_files或build_data_frame时，同样的ValueError出现了我的一个字符串，但这只是在以下情况之后：

from classifier import read_files

I have no idea how that could be happening, if anyone has any idea what my glaring errors may be, I'd really appreciate it. 我不知道如何发生这种情况，如果有人知道我的明显错误是什么，我真的很感激。 Trying to wrap my head around these concepts on my own but coming across a problem likes this leaves me feeling pretty incapacitated. 我试图独自围绕这些概念，但遇到一个问题，这让我感到非常无能为力。

Answer 1

First part of your pipeline is a FeatureUnion . 管道的第一部分是FeatureUnion 。 FeatureUnion will pass all the data it gets parallely to all internal parts. FeatureUnion会将所有数据并行传递给所有内部部件。 The second part of your FeatureUnion is a Pipeline containing single StandardScaler . FeatureUnion的第二部分是包含单个StandardScaler的管道。 Thats the source of error. 那是错误的根源。

This is your data flow: 这是您的数据流：

X --> classifier, Pipeline
            |
            |  <== X is passed to FeatureUnion
            \/
      features, FeatureUnion
                      |
                      |  <== X is duplicated and passed to both parts
        ______________|__________________
       |                                 |
       |  <===   X contains text  ===>   |                         
       \/                               \/
   text, Pipeline                   words, Pipeline
           |                                  |   
           |  <===    Text is passed  ===>    |
          \/                                 \/ 
       tfidf, TfidfVectorizer            wscaler, StandardScaler  <== Error
                 |                                   |
                 | <==Text converted to floats       |
                \/                                   |
              svd, TruncatedSVD                      |
                       |                             |
                       |                             |
                      \/____________________________\/
                                      |
                                      |
                                     \/
                                   clf, XGBClassifier

Since text is passed to StandardScaler , the error is thrown, StandardScaler can only work with numerical features. 由于文本传递给StandardScaler ，因此抛出错误， StandardScaler只能使用数字功能。

Just as you are converting text to numbers using TfidfVectorizer, before sending that to TruncatedSVD, you need to do the same before StandardScaler , or else only provide numerical features to it. 正如您使用TfidfVectorizer将文本转换为数字一样，在将其发送到TruncatedSVD之前，您需要在StandardScaler之前执行相同操作，或者仅向其提供数字功能。

Looking at the description in question, did you intend to keep StandardScaler after the results of TruncatedSVD? 查看相关描述，您是否打算在TruncatedSVD结果之后保留StandardScaler？

Sklearn Pipeline ValueError：无法将字符串转换为float

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-09-01 03:10:30

Sklearn Pipeline ValueError：无法将字符串转换为float

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-09-01 03:10:30

解决方案1
2 已采纳 2018-09-01 03:10:30