简体   繁体   English

如何处理数据以允许在其上运行随机森林?

[英]How can I manipulate my data to allow a random forest to run on it?

I want to train a random forest on a bunch of matrices (first link below for an example). 我想在一堆矩阵上训练随机森林(例如,下面的第一个链接)。 I want to classify them as either "g" or "b" (good or bad, a or b, 1 or 0, it doesn't matter). 我想将它们分类为“ g”或“ b”(好或坏,a或b,1或0,都没有关系)。

I've called the script randfore.py. 我将脚本称为randfore.py。 I am currently using 10 examples, but I will be using a much bigger data set once I actually get this up and running. 我目前正在使用10个示例,但是一旦真正启动并运行它,我将使用更大的数据集。

Here is the code: 这是代码:

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os

import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

working_dir = os.getcwd() # Grabs the working directory

directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located

sources = list() # Just sets up a list here which is going to become the input for the random forest

for i in range(10):
    cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from
    sources.append(cutoutfile) # add it to our sources list

targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad)


sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary?

# Training sets
X_train = sources[:8] # Inputs
y_train = targets[:8] # Targets

# Random Forest
rf = RandomForestClassifier(n_estimators=10)
rf_fit = rf.fit(X_train, y_train)

Here is the current error output: 这是当前的错误输出:

Traceback (most recent call last):
  File "randfore.py", line 31, in <module>
    rf_fit = rf.fit(X_train, y_train)
  File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

I tried making the dtype = object, but it hasn't helped. 我尝试使dtype = object,但并没有帮助。 I'm just not sure what sort of manipulation I need to perform to have this work. 我只是不确定要进行这项工作需要执行哪种操作。

I think the problem is because the files I appending to sources aren't just numbers but a mix of numbers, commas, and various square brackets (it's basically a big matrix). 我认为问题是因为我附加到源代码中的文件不仅仅是数字,而是数字,逗号和各种方括号(基本上是一个大矩阵)的混合。 Is there a natural way to import this? 有没有自然的方法可以导入? The square brackets in particular are probably an issue. 特别是方括号可能是一个问题。

Before I converted sources to a DataFrame I was getting the following error: 在将源转换为DataFrame之前,出现以下错误:

ValueError: cannot copy sequence with size 99 to array axis with dimension 1 This is due to the dimensions of my input (100 lines long) and my target which has 10 rows and 1 column. ValueError:无法将尺寸为99的序列复制到尺寸为1的数组轴上。这是由于我的输入(100行长)和我的目标具有10行1列的尺寸。

Here is the contents of the first file that's read into cutouts (they're all the exact same style) to be used as the input: https://pastebin.com/tkysqmVu 以下是第一个文件的内容,它们被切入切口(它们都是完全相同的样式)以用作输入: https : //pastebin.com/tkysqmVu

And here is the contents of faketargets.dat, the targets: https://pastebin.com/632RBqWc 这是目标的faketargets.dat的内容: https ://pastebin.com/632RBqWc

Any ideas? 有任何想法吗? Help greatly appreciated. 帮助极大的赞赏。 I am sure there is a lot of fundamental confusion going on here. 我相信这里会发生很多根本的混乱。

Try writing: 尝试写作:

X_train = sources.values[:8] # Inputs
y_train = targets.values[:8] # Targets

I hope this will solve your problem! 希望这能解决您的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何解释我的随机森林回归准确度数据? - How do I interpret my Random Forest Regression accuracy data? 如何将我的随机森林分类器应用于未标记的数据集? - How do I apply my Random Forest classifier to an unlabelled dataset? 如何调整管道内随机森林分类器中的参数? - How can I tune the parameters in a Random Forest Classifier inside a pipeline? 如何获得有关sklearn的随机森林中树木的信息? - How can I get information about the trees in a Random Forest in sklearn? 如何在 Scikit-Learn 的随机森林分类器中设置子样本大小? 特别是对于不平衡的数据 - How can I set sub-sample size in Random Forest Classifier in Scikit-Learn? Especially for imbalanced data 如何使用python处理表格数据? - How can I manipulate tabular data with python? 如何停止在 Python 中显示我的随机森林并行计算的状态? - How do I stop showing the status of my random forest parallel computing in Python? SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小? - How can SciKit-Learn Random Forest sub sample size may be equal to original training data size? 我打算如何/在哪里放置此tensorflow随机森林教程的训练数据? - How/where am I meant to put the training data for this tensorflow random forest tutorial? 如何在随机森林中设置我自己的概率阈值? - How to set up my own probabilistic threshold in random forest?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM