[英]Python panda read_csv: can we load STRING to NUMPY in one line?
I have confused with parameters of read_csv of panda. 我对熊猫的read_csv参数感到困惑。
I wanna make a classifier with Support Vector Machines. 我想用支持向量机做一个分类器。 To use classifier I need both vectors X and Y to be numpy. 要使用分类器,我需要向量X和Y都为numpy。 I got a csv file in which there are TWO columns: 我有一个csv文件,其中有两列:
the first column is a number(target), for instance 1 or 0 第一列是数字(目标),例如1或0
the second column is a vector(feature) with " " seperator, for instance 12 32 63 73 563 34. 第二列是带有分隔符“”的向量(功能),例如12 32 63 73 563 34。
The problem I ran into: 我遇到的问题是:
values from the first column are being loaded as 'numpy.int32' 第一列中的值将被加载为“ numpy.int32”
values from the second column are being loaded as 'str' while I want them to be numpy arrays. 当我希望它们是numpy数组时,第二列中的值将被加载为“ str”。
import pandas as pd import numpy as np DF = pd.read_csv("C:\\\\STUFF\\\\foo.csv") df = DF.head(2) X = df["firstcol"] target = X.values for i in target: print (type(i)) Y = df["secondcol"] feature = Y.values for j in feature: print (type(j))
So the output is 所以输出是
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'str'>
<class 'str'>
The question is: What is the fastest and appropriate way to transform second column into numpy? 问题是:将第二列转换为numpy的最快,最适当的方法是什么?
尝试这个:
df["secondcol"].apply(lambda x: np.array(x.split()).astype(int))
This works for me: 这对我有用:
t = ['12 32 63 73 563 34']
y = [int(x) for x in str.split(" ") for str in t]
print(y)
prints: [12, 32, 63, 73, 563, 34]
. 打印: [12, 32, 63, 73, 563, 34]
。 This only works if all the cells are in the format you specified and there are not any letters in there. 仅当所有单元格都采用您指定的格式并且其中没有任何字母时,此方法才有效。
Supposing your csv file look like that : 假设您的csv文件如下所示:
1,12 32 63 73 563 34
2,12 32 63 73 563 33
4,12 32 63 73 563 35
the more logic way to read it is : 读取它的更多逻辑方法是:
df=pd.read_csv('data.csv',header=None,sep='[ ,]',engine='python',index_col=0)
then you have directly you data in cols, with first column as index. 那么您可以直接在cols中获取数据,第一列为索引。 each row is (like) a numpy array. 每行都是(像)一个numpy数组。
In [4]: df
Out[4]:
1 2 3 4 5 6
0
1 12 32 63 73 563 34
2 12 32 63 73 563 33
4 12 32 63 73 563 35
In [5]: df.loc[4]
Out[5]:
1 12
2 32
3 63
4 73
5 563
6 35
Name: 4, dtype: int64
In [6]: df.loc[4].values
Out[6]: array([ 12, 32, 63, 73, 563, 35], dtype=int64)
In [7]: df.loc[4].sum()
Out[7]: 778
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.