如何处理数据以允许在其上运行随机森林？

Question

我想在一堆矩阵上训练随机森林（例如，下面的第一个链接）。 我想将它们分类为“ g”或“ b”（好或坏，a或b，1或0，都没有关系）。

我将脚本称为randfore.py。 我目前正在使用10个示例，但是一旦真正启动并运行它，我将使用更大的数据集。

这是代码：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os

import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

working_dir = os.getcwd() # Grabs the working directory

directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located

sources = list() # Just sets up a list here which is going to become the input for the random forest

for i in range(10):
    cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from
    sources.append(cutoutfile) # add it to our sources list

targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad)


sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary?

# Training sets
X_train = sources[:8] # Inputs
y_train = targets[:8] # Targets

# Random Forest
rf = RandomForestClassifier(n_estimators=10)
rf_fit = rf.fit(X_train, y_train)

这是当前的错误输出：

Traceback (most recent call last):
  File "randfore.py", line 31, in <module>
    rf_fit = rf.fit(X_train, y_train)
  File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

我尝试使dtype = object，但并没有帮助。 我只是不确定要进行这项工作需要执行哪种操作。

我认为问题是因为我附加到源代码中的文件不仅仅是数字，而是数字，逗号和各种方括号（基本上是一个大矩阵）的混合。 有没有自然的方法可以导入？ 特别是方括号可能是一个问题。

在将源转换为DataFrame之前，出现以下错误：

ValueError：无法将尺寸为99的序列复制到尺寸为1的数组轴上。这是由于我的输入（100行长）和我的目标具有10行1列的尺寸。

以下是第一个文件的内容，它们被切入切口（它们都是完全相同的样式）以用作输入： https : //pastebin.com/tkysqmVu

这是目标的faketargets.dat的内容： https ://pastebin.com/632RBqWc

有任何想法吗？ 帮助极大的赞赏。 我相信这里会发生很多根本的混乱。

Answer 1

尝试写作：

X_train = sources.values[:8] # Inputs
y_train = targets.values[:8] # Targets

希望这能解决您的问题！

如何处理数据以允许在其上运行随机森林？

问题描述

1 个解决方案

解决方案1
0 2017-07-20 10:34:15

如何处理数据以允许在其上运行随机森林？

问题描述

1 个解决方案

解决方案1 0 2017-07-20 10:34:15

解决方案1
0 2017-07-20 10:34:15