简体   繁体   English

Pandas 在读取 SAS 文件时因正确的数据类型而失败

[英]Pandas fails with correct data type while reading a SAS file

I have a SAS dataset and when I run it I get the following output on SAS:我有一个SAS 数据集,当我运行它时,我在 SAS 上得到以下输出:

在此处输入图片说明

I also have the following Python code which gets the .sas7bdat file and displays the output, ie here the first five observations.我还有以下 Python 代码,它获取 .sas7bdat 文件并显示输出,即这里的前五个观察结果。

import pandas as pd
file_name = "cars.sas7bdat"
my_df = pd.read_sas(file_name)
my_df = my_df.head()
print(my_df)

在此处输入图片说明

As you can see, it doesn't work correct when it comes to integer data types.如您所见,当涉及整数数据类型时,它无法正常工作。 CYL and WGT variables are integers but are not displaying correctly if I use pandas' read_sas function . CYL 和 WGT 变量是整数,但如果我使用 pandas 的read_sas 函数,则无法正确显示。

Any idea what heck is going on with this?知道这到底是怎么回事吗?

SAS represents all numbers as 64bit (8 byte) floating point numbers. SAS 将所有数字表示为 64 位(8 字节)浮点数。 But you can save disk space by telling it to store less than 8 bytes.但是您可以通过告诉它存储少于 8 个字节来节省磁盘空间。 The dataset you posted did this for CYL and WGT.您发布的数据集为 CYL 和 WGT 执行了此操作。

在此处输入图片说明

When SAS reads the dataset back from disk to use it sets the missing least significant bytes to binary zeros.当 SAS 从磁盘读回数据集以使用时,它将丢失的最低有效字节设置为二进制零。 Apparently read_sas didn't understand this and instead of setting the missing bytes to binary zeros it did something else.显然read_sas不明白这一点,它没有将丢失的字节设置为二进制零,而是做了其他事情。 Hence the seemingly random data.因此,看似随机的数据。

The first value of CYL is 8 which in IEEE floating point would be the hexcode CYL 的第一个值是8 ,在 IEEE 浮点数中将是十六进制代码

40 20 00 00 00 00 00 00

The value you displayed of 8.00046 would be this value instead.您显示的值8.00046将改为此值。

40 20 00 06 07 80 FD C1

Finally solved the issue.终于解决了这个问题。 Well, that seems definitely pandas' bug.嗯,这似乎肯定是熊猫的错误​​。 I used directly the .sas7bdat library by typing this(installing):我通过键入以下内容(安装)直接使用了 .sas7bdat 库:

pip install sas7bdat

Then I run the following code:然后我运行以下代码:

import sas7bdat
from sas7bdat import *

file_name = file_path + "cars.sas7bdat"
foo = SAS7BDAT(file_name)
my_df = foo.to_data_frame()
my_df = my_df.head()
print(my_df)

After running the above code, I get the following output in Python:运行上述代码后,我在 Python 中得到以下输出:

在此处输入图片说明

So, I get the output with correct data types displayed.所以,我得到了显示正确数据类型的输出。

Hope pandas developers find out a solutions for the mentioned bug above.希望 Pandas 开发者找到解决上述 bug 的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM