I'm trying to read a dataset for binary classification from a .txt file.
+1 1:-0.882353 2:-0.0653266 3:0.147541 4:-0.373737 5:-1 6:-0.0938897 7:-0.797609 8:-0.933333
This is an example row.
And this is the code i use to parse the file.
reader=csv.reader(f)
res=[row[0].split(" ")[:-1] for row in reader]
labels=[int(r[0]) for r in res]
patterns=[[float(p[2:]) for p in r[1:]] for r in res]
res=[LabeledExample(p,l) for p,l in zip(patterns,labels)]
LabeledExample is a class is a class of a framework I'm using. This works perfectly for what i need but if i try to feed this thing to scikit, I need to do this.
X=[ example.pattern for example in training_set]
Y=[ example.label for example in training_set]
where training_set is a list of LabeledExample. This usually works as intended with other datasets but this time, if i try to fit a model with this dataset, it raises this error:
File "/home/chobeat/git/yaplf/yaplf/testsandbox/ensembleexperiment.py", line 29, in ensembletreeexp
clf.fit(X,Y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py", line 257, in fit
check_ccontiguous=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 230, in check_arrays
array = np.ascontiguousarray(array, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 548, in ascontiguousarray
return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Trying to debug it out I went to check the shape of the X array and it's not what it is supposed to be.
It should be (768,8) but it is (768,). For other datasets it works as intended but here it does not. I went back to the parsing code and checked the types of basically everything and for what I can see, patterns is a list of list of float, as it should be and there are no meaningful differences between the buggy parsed dataset and the others. I found out that the function "split" though introduce the behaviour. Before I split the big string, I have an array of shape (768,1) and after the split, instead of a (768,8) I have a (768,) despite the fact that it's still a list of lists.
This is libsvm / svmlight format. There is a reader for that in scikit-learn: sklearn.datasets.load_svmlight_file
Ok, found out the problem. There were empty values in the dataset that broke my parsing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.