将 pandas.DataFrame 转换为字节

Question

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point).我需要将存储在pandas.DataFrame中的数据转换为字节字符串，其中每一列都可以有一个单独的数据类型（整数或浮点数）。 Here is a simple set of data:下面是一组简单的数据：

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')

and df looks something like this: df 看起来像这样：

    a            b                  c
0   10  18446744073709551615    1.324000e+10
1   15  230498234019            3.141590e+00
2   20  32094812309             2.341341e+02

The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this: DataFrame知道每一列df.dtypes的类型所以我想做这样的事情：

data_to_pack = [tuple(record) for _, record in df.iterrows()]
data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
data_bytes = data_array.tostring()

This typically works fine but in this case (due to the maximum value stored in df['b'][0] . The second line above converting the array of tuples to an np.array with a given set of types causes the following error:这通常可以正常工作，但在这种情况下（由于存储在df['b'][0]中的最大值。上面的第二行将元组数组转换为具有给定类型集的np.array会导致以下错误:

OverflowError: Python int too large to convert to C long

The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64 ) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64 .错误结果（我相信）在第一行将记录提取为具有单一数据类型（默认为float64 ）的Series并且在float64中为最大uint64值选择的表示不能直接转换回uint64 。

1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? 1）由于DataFrame已经知道每一列的类型，是否有办法绕过创建一行元组以输入到类型化的numpy.array构造函数中？ Or is there a better way than outlined above to preserve the type information in such a conversion?或者有没有比上面概述的更好的方法来保留这种转换中的类型信息？

2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column. 2）有没有办法使用每列的类型信息将 go 直接从DataFrame转换为表示数据的字节字符串。

Answer 1

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes: 您可以使用df.to_records()将数据帧转换为numpy重新排列，然后调用.tostring()将其转换为字节字符串：

rec = df.to_records(index=False)

print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
#  (20, 32094812309, 234.1341)], 
#           dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])

s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)

print(np.all(rec2 == rec))
# True

Answer 2

import pandas as pd

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df_byte = df.to_json().encode()
print(df_byte)

将 pandas.DataFrame 转换为字节

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-01-08 00:21:02

解决方案2
1 2022-04-27 18:33:01

将 pandas.DataFrame 转换为字节

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-01-08 00:21:02

解决方案2 1 2022-04-27 18:33:01

解决方案1
4 已采纳 2016-01-08 00:21:02

解决方案2
1 2022-04-27 18:33:01