简体   繁体   English

将 pandas.DataFrame 转换为字节

[英]Converting pandas.DataFrame to bytes

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point).我需要将存储在pandas.DataFrame中的数据转换为字节字符串,其中每一列都可以有一个单独的数据类型(整数或浮点数)。 Here is a simple set of data:下面是一组简单的数据:

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')

and df looks something like this: df 看起来像这样:

    a            b                  c
0   10  18446744073709551615    1.324000e+10
1   15  230498234019            3.141590e+00
2   20  32094812309             2.341341e+02

The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this: DataFrame知道每一列df.dtypes的类型所以我想做这样的事情:

data_to_pack = [tuple(record) for _, record in df.iterrows()]
data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
data_bytes = data_array.tostring()

This typically works fine but in this case (due to the maximum value stored in df['b'][0] . The second line above converting the array of tuples to an np.array with a given set of types causes the following error:这通常可以正常工作,但在这种情况下(由于存储在df['b'][0]中的最大值。上面的第二行将元组数组转换为具有给定类型集的np.array会导致以下错误:

OverflowError: Python int too large to convert to C long

The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64 ) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64 .错误结果(我相信)在第一行将记录提取为具有单一数据类型(默认为float64 )的Series并且在float64中为最大uint64值选择的表示不能直接转换回uint64

1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? 1)由于DataFrame已经知道每一列的类型,是否有办法绕过创建一行元组以输入到类型化的numpy.array构造函数中? Or is there a better way than outlined above to preserve the type information in such a conversion?或者有没有比上面概述的更好的方法来保留这种转换中的类型信息?

2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column. 2)有没有办法使用每列的类型信息将 go 直接从DataFrame转换为表示数据的字节字符串。

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes: 您可以使用df.to_records()将数据帧转换为numpy重新排列,然后调用.tostring()将其转换为字节字符串:

rec = df.to_records(index=False)

print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
#  (20, 32094812309, 234.1341)], 
#           dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])

s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)

print(np.all(rec2 == rec))
# True
import pandas as pd

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df_byte = df.to_json().encode()
print(df_byte)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM