简体   繁体   English

如何加快将数据帧导入熊猫的速度

[英]How to speed up importing dataframes into pandas

I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv ).我知道 pandas 导入 csv 文件的速度相对较慢的原因之一是它需要在猜测类型之前扫描列的整个内容(请参阅有关low_memory的大部分已弃用的low_memory选项的pandas.read_csv )。 Is my understanding correct?我的理解正确吗?

If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?如果是,那么存储数据帧的好格式是什么,并且明确指定数据类型,因此熊猫不必猜测(SQL 目前不是一个选项)?

Any option in particular from those listed here ?此处列出的选项中的任何选项?

My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.我的数据帧有浮点数、整数、日期、字符串和 Y/N,因此仅支持数值的格式是行不通的。

One option is to use numpy.genfromtxt with delimiter=',', names=True , then to initialize the pandas dataframe with the numpy array.一种选择是使用numpy.genfromtxtdelimiter=',', names=True ,然后用 numpy 数组初始化熊猫数据帧。 The numpy array will be structured and the pandas constructor should automatically set the field names. numpy 数组将被结构化,pandas 构造函数应自动设置字段名称。

In my experience this performs well.根据我的经验,这表现良好。

You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv .您可以通过在调用pandas.read_csv指定列名及其数据类型来提高从 CSV 文件导入的效率。 If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:如果文件中有现有的列标题,您可能不必指定名称,只需使用这些名称即可,但我喜欢跳过标题并指定名称以确保完整性:

import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names}  # create the type dict
col_types['a'] = 'object'  # can change whichever ones you like
df = pd.read_csv(fname,
                 header = None,  # since we are specifying our own names
                 skiprows=[0],  # if you *do* have a header row, skip it
                 names=col_names,
                 dtype=col_types)

On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.在主要包含整数列的大型样本数据集上,这比我在调用pd.read_csv指定dtype='object'快约 20%。

I would consider either HDF5 format or Feather Format.我会考虑 HDF5 格式或 Feather 格式。 Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.它们都非常快(Feather 可能更快,但 HDF5 功能更丰富 - 例如按索引从磁盘读取)并且它们都存储列的类型,因此它们不必猜测dtypes并且它们不加载数据时必须转换数据类型(例如字符串到数字或字符串到日期时间)。

Here are some speed comparisons:以下是一些速度比较:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM