Python numpy MemoryError-将多个CSV文件加载到HDF5存储中并读入DataFrame

Question

(Using Python 3.3 and Pandas 0.12) （使用Python 3.3和Pandas 0.12）

My question consists of two parts. 我的问题包括两部分。

First 第一

I'm trying to iteratively read/append multiple csv files - that amount to about 8GB in total - into a HDF5 store based on this solution and this solution for creating a unique index. 我正在尝试基于此解决方案和此解决方案创建一个唯一索引，将多个csv文件（总计总计约8GB）迭代读取/追加到HDF5存储中。 Why I started to do this is because I read that doing so would result in a file that would be fast accessible and relatively small in size, and thus to be able to read into memory. 之所以开始执行此操作，是因为我读到，这样做会导致文件可以快速访问并且相对较小，因此能够读入内存。 However as it turns out I get a h5 file that is 18GB large. 但是事实证明，我得到的是一个18GB的h5文件。 My (Windows) laptop has 8GB of RAM. 我的（Windows）笔记本电脑有8GB的RAM。 My first question is why the resulting h5 is much larger than the sum of the original csv files? 我的第一个问题是，为什么生成的h5比原始csv文件的总和大得多？ My second question is why do I not indeed get a unique index on the table? 我的第二个问题是为什么我的确没有在表上获得唯一索引？

My code is the following: 我的代码如下：

def to_hdf(path):
    """ Function that reads multiple csv files to HDF5 Store """
    # If path exists delete it such that a new instance can be created
    if os.path.exists(path):
        os.remove(path)
    # Creating HDF5 Store
    store = pd.HDFStore(path)

    # Reading csv files from list_files function
    with pd.get_store(path) as store:
        for f in list_files():
            try:
                # Creating reader in chunks -- reduces memory load
                df = pd.read_csv(f, encoding='utf-8', chunksize=50000, index_col=False)
                try:
                    nrows = store.get_storer('ta_store').nrows
                except:
                    nrows = 0
                # Looping over chunks and storing them in store file, node name 'ta_data'
                for chunk in df:
                    # Append chunk to store called 'ta_data'
                    store.append('ta_data', chunk, index=False, min_itemsize={'Placement Ref': 50, 'Click Ref': 50})
            # Print filename if corrupt (i.e. CParserError)
            except (parser.CParserError, ValueError) as detail:
                print(f, detail)

    print("Finished reading to HDF5 store, continuing processing data.")

Second 第二

The second part of my script reads the HDF5 store into a Pandas DataFrame. 脚本的第二部分将HDF5存储读取到Pandas DataFrame中。 Why? 为什么？ Because I need to do some data transformations and filtering to get the final data that I would like to have output into a csv file. 因为我需要进行一些数据转换和过滤，才能获得想要输出到csv文件中的最终数据。 However, any attempt to reading the HDF5 store I get a MemoryError , using the following piece of code: 但是，任何尝试读取HDF5存储的尝试都会使用以下代码得到MemoryError ：

def read_store(filename, node):
    df = pd.read_hdf(filename, node)
    # Some data transformation and filtering code below

Another example when this error occurred was when I wanted to print the store to show that the index is not unique using the following function: 发生此错误的另一个示例是我想使用以下功能打印商店以显示索引不是唯一的：

def print_store(filename, node):
    store = pd.HDFStore(filename)
    print(store.select(node))

My question here is first of all how I can overcome this MemoryError issue. 我的问题首先是如何克服MemoryError问题。 I'm guessing I need to reduce the size of the hdf5 file, but I'm quite new to programming/python/pandas so I would be very happy to receive any input. 我猜想我需要减小hdf5文件的大小，但是我对Programming / python / pandas很陌生，所以我很高兴收到任何输入。 Secondly, I'm wondering whether reading the store into a Pandas DataFrame is the most efficient way to do my data transformations (creating one new column) and filtering (based on string and datetime values). 其次，我想知道是否将商店读入Pandas DataFrame是进行数据转换（创建一个新列）和过滤（基于字符串和日期时间值）的最有效方法。

Any help is very much appreciated! 很感谢任何形式的帮助！ Thanks :) 谢谢：）

Edit 编辑

As requested, an censored sample from a csv file (first) and the result from ptdump -av (below) 根据要求，检查来自csv文件的样本（第一个）和ptdump -av的结果（下）

csv sample CSV样本

A               B   C               D       E           F           G         H                       I                   J       K               L                               M           N       O
4/28/2013 0:00  1   4/25/2013 20:34 View    Anon 2288 optional1   Optional2   Anon | 306742    252.027.323-306742  8.05    10303:41916417  14613669178715620788:10303      Duplicate   Anon  Display
4/28/2013 0:00  2   4/27/2013 13:40 View    Anon 2289 optional1   Optional2   Anon | 306742    252.027.323-306742  8.05    10303:41916417  14613669178715620788:10303      Duplicate   Anon  Display
4/28/2013 0:00  1   4/27/2013 23:41 View    Anon 5791 optional1   Optional2   Anon | 304142    478.323.464-304142  20.66   10304:37464168  14613663710835083509:10305      Duplicate   Anon  Display
4/28/2013 0:00  1   4/27/2013 16:18 View    Anon 4300 optional1   Optional2   Anon | 304142    196.470.934-304142  3.12    10303:41916420  15013670724970033908:291515610  Normal      Anon  Display

ptdump -av ptdump -av

/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/ta_data (Group) ''
  /ta_data._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['F', 'G'],
    encoding := 'UTF-8',
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2', 'F', 'G']]
/ta_data/table (Table(41957511,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=30, shape=(11,), dflt=b'', pos=3),
  "F": StringCol(itemsize=50, shape=(), dflt=b'', pos=4),
  "G": StringCol(itemsize=50, shape=(), dflt=b'', pos=5)}
  byteorder := 'little'
  chunkshape := (288,)
  /ta_data/table._v_attrs (AttributeSet), 27 attributes:
   [CLASS := 'TABLE',
    G_dtype := 'bytes400',
    G_kind := ['G'],
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := b'',
    FIELD_3_NAME := 'values_block_2',
    FIELD_4_FILL := b'',
    FIELD_4_NAME := 'F',
    FIELD_5_FILL := b'',
    FIELD_5_NAME := 'G',
    NROWS := 41957511,
    F_dtype := 'bytes400',
    F_kind := ['F'],
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['J'],
    values_block_1_dtype := 'int64',
    values_block_1_kind := ['B'],
    values_block_2_dtype := 'bytes240',
    values_block_2_kind := ['E', 'O', 'A', 'H', 'C', 'D', 'L', 'N', 'M', 'K', 'I']]

Example transformation and filtering 转换和过滤示例

df['NewColumn'] = df['I'].str.split('-').str[0]

mask = df.groupby('NewColumn').E.transform(lambda x: x.nunique() == 1).astype('bool')
df = df[mask]

Answer 1

You need to parse the dates in the csv, try adding parse_dates = ['A','C'] when you read_csv ; 您需要解析的CSV的日期，尝试添加parse_dates = ['A','C']当你read_csv ; if you do df.get_dtype_count() these should show up as datetime64[ns] , otherwise they are strings, which take a large storage space and are not easy to work with 如果执行df.get_dtype_count()它们应显示为datetime64[ns] ，否则它们是字符串，它们占用很大的存储空间，使用起来并不容易
the min_itemsize argument specifies the minimum size of this string column (for 'F','G'); min_itemsize参数指定此字符串列的最小大小（对于“ F”，“ G”）； this is only to guarantee that your strings don't exceed this limit; 这只是为了确保您的字符串不超过此限制； but it makes ALL rows for that column that width. 但这会使该列的所有行都具有该宽度。 If you can lower this it will cut your storage size 如果您可以降低此尺寸，则会缩小存储空间
You are not creating a unique index; 您不是在创建唯一索引； there is a line missing from the code above. 上面的代码中缺少一行。 add df.index = Series(df.index) + nrows after reading read_csv 读取read_csv后添加df.index = Series(df.index) + nrows read_csv
You need to iterate on the hdf in chunks, just as you do the csv files; 您需要像对csv文件一样对hdf进行大块迭代。 see here , and see the docs on compression here 请参阅此处，并在此处查看有关压缩的文档

Its not clear what your filtering is actually going to do; 目前尚不清楚您的过滤实际上要做什么； Can you explain a bit more? 你能解释更多吗？ You need to thoroughly understand how HDF storage works (eg you can append rows, but not columns; likely you need to create a results table where you append transformed/filtered forws). 您需要彻底了解HDF存储的工作原理（例如，可以追加行，但不能追加列；可能需要创建一个结果表，在其中追加转换/过滤后的前叉）。 You also need to understand how the indexing works, you need a way to access these rows (which a global unique will do, but depending on the structure of your data might not be necessary) 您还需要了解索引的工作原理，需要一种访问这些行的方式（全局唯一的方式可以做到这一点，但是可能不需要依赖于数据的结构）

Python numpy MemoryError-将多个CSV文件加载到HDF5存储中并读入DataFrame

问题描述

First 第一

Second 第二

Edit 编辑

csv sample CSV样本

ptdump -av ptdump -av

Example transformation and filtering 转换和过滤示例

1 个解决方案

解决方案1
2 已采纳 2013-08-20 12:33:40

Python numpy MemoryError-将多个CSV文件加载到HDF5存储中并读入DataFrame

问题描述

First 第一

Second 第二

Edit 编辑

csv sample CSV样本

ptdump -av ptdump -av

Example transformation and filtering 转换和过滤示例

1 个解决方案

解决方案1 2 已采纳 2013-08-20 12:33:40

解决方案1
2 已采纳 2013-08-20 12:33:40