简体   繁体   English

什么时候应该使用numpy.genfromtxt而不是pandas.read_csv来读取csv文件?

[英]When should I use the numpy.genfromtxt instead of pandas.read_csv to read a csv file?

I was recently doing an image extraction part from a .csv file,the file contained a column named pixels with 48x48 values given as strings, so normally seeing a .csv file I used pandas.read_csv to try to convert pixels column to later on images, converting to PIL images. 我最近正在从.csv文件中提取图像,该文件包含一列名为pixel的列,并以字符串形式给出了48x48值,因此通常会看到一个.csv文件,我使用pandas.read_csv尝试将像素列转换为稍后的图像,转换为PIL图片。

import pandas as pd
data = pd.read_csv('fer2013.csv') # fer2013 competition dataset.
data.head()

        emotion pixels  Usage
    0   0   70 80 82 72 58 58 60 63 54 58 60 48 89 115 121...   Training
    1   0   151 150 147 155 148 133 111 140 170 174 182 15...   Training
    2   2   231 212 156 164 174 138 161 173 182 200 106 38...   Training
    3   4   24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1...   Training
    4   6   4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84...

But, I saw another guy use numpy.genfromtxt to load the csv file from the discussions: 但是,我看到另一个人使用numpy.genfromtxt从讨论中加载csv文件:

data = np.genfromtxt('fer2013.csv',delimiter=',',dtype=None)

But, I don't understand what's the use of numpy.genfromtxt, I saw the examples on the scipy numpy.genfromtxt docs too, 但是,我不明白numpy.genfromtxt的用途是什么,我也看到了scipy numpy.genfromtxt文档上的示例,

I found the dtype naming methods to be great , but those are available in pd.read_csv too! 我发现dtype命名方法很棒 ,但是pd.read_csv中也提供了这些方法

: np.genfromtxt np.genfromtxt

Would be great if someone could explain the need and use for numpy.genfromtxt load method, and where it would benefit on top of other methods for reading a file. 如果有人可以解释numpy.genfromtxt加载方法的必要性和用途,以及在读取文件的其他方法之外又有什么用,那将很棒

You can find the data here: fer2013 competition Kaggle 您可以在此处找到数据: fer2013竞赛Kaggle

As I understand it , the pandas reader is a optimized program written in C and is faster in much situation. 据我了解,pandas阅读器是用C编写的优化程序,在很多情况下速度更快。 genfromtext is an old python fonction with less inferring skills, that you can forget if you have pandas. genfromtext是一种古老的python功能,具有较低的推断能力,如果您拥有熊猫,您会忘记它。

In [45]: df=pd.DataFrame(np.arange(10**6).reshape(1000,1000))

In [46]: df.to_csv("data.csv")

In [47]: %time v=np.genfromtxt("data.csv",delimiter=',',dtype=int,skip_header=1)
Wall time: 5.62 s

In [48]: %time u=pd.read_csv("data.csv",engine='python')
Wall time: 3.97 s

In [49]: %time u=pd.read_csv("data.csv")
Wall time: 781 ms

The docs describe the engine option : 文档描述了engine选项:

engine : {'c', 'python'}, optional 引擎:{'c','python'},可选

Parser engine to use. 要使用的解析器引擎。 The C engine is faster while the python engine is currently more feature-complete. C引擎速度更快,而python引擎当前功能更完善。

I can't download the linked dataset, but tried to recreate it from your header: 我无法下载链接的数据集,但是尝试从标题中重新创建它:

In [2]: cat stack53997674.csv
emotion, pixels,  Usage
0,   "70 80 82 72 58 58 60 63 54 58 60 48 89 115 121",   Training
0,   "151 150 147 155 148 133 111 140 170 174 182 15",   Training
2,   "231 212 156 164 174 138 161 173 182 200 106 38",   Training
4,   "24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1",   Training
6,   "4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84",   Testing

With pandas: 大熊猫:

In [11]: df = pd.read_csv("stack53997674.csv")
In [12]: df
Out[12]: 
   emotion     ...             Usage
0        0     ...          Training
1        0     ...          Training
2        2     ...          Training
3        4     ...          Training
4        6     ...           Testing

[5 rows x 3 columns]
In [13]: df.dtypes
Out[13]: 
emotion     int64
 pixels    object
  Usage    object
dtype: object

values is a 2d object dtype array, with strings in the 2nd column: values是2d对象dtype数组,在第二列中包含字符串:

In [20]: df.values[:,1]
Out[20]: 
array(['   "70 80 82 72 58 58 60 63 54 58 60 48 89 115 121"',
       '   "151 150 147 155 148 133 111 140 170 174 182 15"',
       '   "231 212 156 164 174 138 161 173 182 200 106 38"',
       '   "24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1"',
       '   "4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84"'],
      dtype=object)

With genfromtxt : 使用genfromtxt

In [21]: data = np.genfromtxt("stack53997674.csv", delimiter=',', names=True, dt
    ...: ype=None, encoding=None, autostrip=True)
In [22]: data
Out[22]: 
array([(0, '"70 80 82 72 58 58 60 63 54 58 60 48 89 115 121"', 'Training'),
       (0, '"151 150 147 155 148 133 111 140 170 174 182 15"', 'Training'),
       (2, '"231 212 156 164 174 138 161 173 182 200 106 38"', 'Training'),
       (4, '"24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1"', 'Training'),
       (6, '"4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84"', 'Testing')],
      dtype=[('emotion', '<i8'), ('pixels', '<U48'), ('Usage', '<U8')])
In [23]: data['pixels']
Out[23]: 
array(['"70 80 82 72 58 58 60 63 54 58 60 48 89 115 121"',
       '"151 150 147 155 148 133 111 140 170 174 182 15"',
       '"231 212 156 164 174 138 161 173 182 200 106 38"',
       '"24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1"',
       '"4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84"'], dtype='<U48')

pixels is a 1d array of string dtype. pixels是字符串dtype的一维数组。 Both can be converted to/from the other dtype. 两者都可以与另一个dtype相互转换。 And both will require similar processing to produce images. 两者都需要类似的处理才能产生图像。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM