简体   繁体   中英

String split+numpy.array = weird behaviour

I'm trying to read a dataset for binary classification from a .txt file.

+1 1:-0.882353 2:-0.0653266 3:0.147541 4:-0.373737 5:-1 6:-0.0938897 7:-0.797609 8:-0.933333

This is an example row.

And this is the code i use to parse the file.

    reader=csv.reader(f)
    res=[row[0].split(" ")[:-1] for row in reader]
    labels=[int(r[0]) for r in res]
    patterns=[[float(p[2:]) for p in r[1:]] for r in res]
    res=[LabeledExample(p,l) for p,l in zip(patterns,labels)]

LabeledExample is a class is a class of a framework I'm using. This works perfectly for what i need but if i try to feed this thing to scikit, I need to do this.

 X=[ example.pattern for example in training_set]
 Y=[ example.label for example in training_set]

where training_set is a list of LabeledExample. This usually works as intended with other datasets but this time, if i try to fit a model with this dataset, it raises this error:

 File "/home/chobeat/git/yaplf/yaplf/testsandbox/ensembleexperiment.py", line 29, in ensembletreeexp
    clf.fit(X,Y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py", line 257, in fit
    check_ccontiguous=True)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 230, in check_arrays
    array = np.ascontiguousarray(array, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 548, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.

Trying to debug it out I went to check the shape of the X array and it's not what it is supposed to be.

It should be (768,8) but it is (768,). For other datasets it works as intended but here it does not. I went back to the parsing code and checked the types of basically everything and for what I can see, patterns is a list of list of float, as it should be and there are no meaningful differences between the buggy parsed dataset and the others. I found out that the function "split" though introduce the behaviour. Before I split the big string, I have an array of shape (768,1) and after the split, instead of a (768,8) I have a (768,) despite the fact that it's still a list of lists.

This is libsvm / svmlight format. There is a reader for that in scikit-learn: sklearn.datasets.load_svmlight_file

Ok, found out the problem. There were empty values in the dataset that broke my parsing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM