如何在熊猫中为HDF5添加新类别？

Question

Answered : It appears that this datatype will not be suited for adding arbitrary strings into hdf5store. 回答：看来这个数据类型不适合在hdf5store中添加任意字符串。

Background 背景

I work with a script which generates single rows of results and appends them to a file on disk in an iterative approach. 我使用脚本生成单行结果，并以迭代方式将它们附加到磁盘上的文件中。 To speed things up, I decided to use HDF5 containers rather than .csv. 为了加快速度，我决定使用HDF5容器而不是.csv。 A benchmarking then revealed that strings slow HDF5 down. 然后基准测试显示字符串降低了HDF5的速度。 I was told this can be mitigated when converting strings to categorical dtype. 我被告知在将字符串转换为categorical dtype时可以减轻这种情况。

Issue 问题

I have not been able to append categorical rows with new categories to HDF5. 我无法将带有新类别的分类行附加到HDF5。 Also, I don't know how to control the dtypes of cat.codes , which AFAIK can be done somehow. 另外，我不知道如何控制cat.codes的cat.codes ，AFAIK可以以某种方式完成。

Reproducible example: 可重复的例子：

1 - Create large dataframe with categorical data 1 - 使用分类数据创建大型数据框

import pandas as pd
import numpy as np
from pandas import HDFStore, DataFrame
import random, string

dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(100000)]
df_big = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])
df_big['Dummy_Data'] = df_big['Dummy_Data'].astype('category')

2 - Create one row to append 2 - 创建一行以追加

df_small = pd.DataFrame(['New_category'], columns = ['Dummy_Data'])
df_small['Dummy_Data'] = df_small['Dummy_Data'].astype('category')

3 - Save (1) to HDF and try to append (2) 3 - 保存（1）到HDF并尝试追加（2）

df_big.to_hdf('h5_file.h5', \
      'symbols_dict', format = "table", data_columns = True, append = False, \
       complevel = 9, complib ='blosc')

df_small.to_hdf('h5_file.h5', \
      'symbols_dict', format = "table", data_columns = True, append = True, \
       complevel = 9, complib ='blosc')

This results in the following Exception 这会导致以下异常

ValueError: invalid combinate of [values_axes] on appending data [name->Dummy_Data,cname->Dummy_Data,dtype->int8,kind->integer,shape->(1,)] vs current table [name->Dummy_Data,cname->Dummy_Data,dtype->int32,kind->integer,shape->None] ValueError：附加数据[name-> Dummy_Data，cname-> Dummy_Data，dtype-> int8，kind-> integer，shape - >（1，）] vs当前表[name-> Dummy_Data，cname]的[values_axes]组合无效 - > Dummy_Data，dtype-> INT32，kind->整数，形状 - >无]

My fixing attempts 我的修复尝试

I tried to adjust the dtypes of cat.catcodes : 我试着调整cat.catcodes的cat.catcodes ：

df_big['Dummy_Data'] = df_big['Dummy_Data'].cat.codes.astype('int32')
df_small['Dummy_Data'] = df_small['Dummy_Data'].cat.codes.astype('int32')

When I do this, the error disappears, but so does the categorical dtype: 当我这样做时，错误消失，但分类dtype也是如此：

df_test = pd.read_hdf('h5_file.h5', key='symbols_dict')
print df_mydict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100001 entries, 0 to 0       # The appending worked now
Data columns (total 1 columns):
Dummy_Data    100001 non-null int32      # Categorical dtype gone
dtypes: int32(1)                         # I need to change dtype of cat.codes of categorical    
memory usage: 1.1 MB                     # Not of categorical itself

In addition, df_small.info() does not show the dtype of cat.codes in the first place, which makes it difficult to debug. 另外， df_small.info()不显示cat.codes ，这使得调试变得困难。 What am I doing wrong? 我究竟做错了什么？

Questions 问题

1. How to properly change dtypes of cat.codes ? 1.如何正确更改cat.codes的cat.codes ？
2. How to properly append Categorical Data to HDF5 in python? 2.如何在python中正确地将分类数据附加到HDF5？

Answer 1

if it is helpfull for you, I will rewrite the beginning of your code. 如果它对你有帮助，我会重写你的代码的开头。 It works for me. 这个对我有用。

import pandas as pd
from pandas import HDFStore, DataFrame
import random, string


def create_dummy(nb_iteration):

    dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(nb_iteration)]
    df = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])

    return df

df_small= create_dummy(53)
df_big= create_dummy(100000)

df_big.to_hdf('h5_file.h5', \
  'symbols_dict', format = "table", data_columns = True, append = False, \
  complevel = 9, complib ='blosc')

df_small.to_hdf('h5_file.h5', \
  'symbols_dict', format = "table", data_columns = True, append = True, \
  complevel = 9, complib ='blosc')

df_test = pd.read_hdf('test_def.h5', key='table')
df_test

Answer 2

I am not an expert on this, but as far as I looked at least at h5py module, http://docs.h5py.org/en/latest/high/dataset.html , HDF5 supports Numpy datatypes, which do not include any categorical datatype. 我不是这方面的专家，但据我至少在h5py模块， http：//docs.h5py.org/en/latest/high/dataset.html，HDF5支持Numpy数据类型，不包括任何分类数据类型。

Same for PyTables , which is used by Pandas. 对于Pandas使用的PyTables也是如此。

Categories datatype is introduced and used in Pandas datatypes , and is described: 在Pandas数据类型中引入并使用了类别数据类型，并描述了：

A categorical variable takes on a limited , and usually fixed , number of possible values (categories; levels in R) 分类变量采用有限的 ， 通常是固定的可能值（类别; R中的级别）

So what might be happening is perhaps every time in order to add a new category, you have to somehow re-read all existing categories from hdf5store in order for Pandas to reindex it? 那么可能发生的事情可能是每次为了添加一个新类别，你必须以某种方式重新读取hdf5store中的所有现有类别，以便Pandas重新索引它？

From the docs in general, however, it appears that this datatype will not be suited for adding arbitrary strings into hdf5store, unless you are sure after maybe a couple of additions there will be no new categories. 但是，从一般的文档中可以看出，这种数据类型似乎不适合在hdf5store中添加任意字符串，除非您确定在添加几个新类别之后。

As additional note, unless your application demands extremely high performance, storing data in SQL might potentially be a better option -- SQL has better support for strings, for one thing. 另外需要注意的是，除非您的应用程序需要极高的性能，否则在SQL中存储数据可能是更好的选择 - 一方面，SQL可以更好地支持字符串。 For example, while SQLite was found slower than HDF5 in some test , they didn't include processing strings. 例如，虽然在某些测试中发现SQLite比HDF5慢，但它们不包括处理字符串。 Jumping from CSV to HDF5 sounds like jumping from a horsecart to a rocket, but perhaps a car or airplane would work just as well (or better, as it has more options, to stretch the analogy)? 从CSV跳到HDF5听起来像是从马车跳到火箭，但也许汽车或飞机也可以起作用（或者更好，因为它有更多的选择，可以进行类比）？

如何在熊猫中为HDF5添加新类别？

问题描述

Reproducible example: 可重复的例子：

2 个解决方案

解决方案1
3 2018-06-13 09:35:44

解决方案2
2 已采纳 2018-06-13 10:59:30

如何在熊猫中为HDF5添加新类别？

问题描述

Reproducible example: 可重复的例子：

2 个解决方案

解决方案1 3 2018-06-13 09:35:44

解决方案2 2 已采纳 2018-06-13 10:59:30

解决方案1
3 2018-06-13 09:35:44

解决方案2
2 已采纳 2018-06-13 10:59:30