简体   繁体   English

如何使用 a.SAS 或 SPS 元数据文件来读取 CSV 作为 Pandas Z6A8064B5DF47945557DZC5053?

[英]How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

I have a big CSV file and it comes with two metadata description files.我有一个大的 CSV 文件,它带有两个元数据描述文件。 One has a .sas extension and the other a .sps .一个有.sas扩展名,另一个有.sps Opening them, they describe the CSV data format and categories.打开它们,它们描述了 CSV 数据格式和类别。 The files describe the data format and possible categories of each column.这些文件描述了每列的数据格式和可能的类别。 For example, a column with values 1 or 2 is mapped to yes and no .例如,值为 1 或 2 的列映射到yesno

How can I use these metadata files to help me read the CSV file?如何使用这些元数据文件来帮助我阅读 CSV 文件?

I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories.我可以使用 read_csv 轻松阅读它,但这些文件对于自动创建包含可能类别的列很有用。 I can create a parser for them, but there must be a package or function to do it.我可以为它们创建一个解析器,但必须有一个 package 或 function 才能做到。 Maybe I'm not using the correct search terms.也许我没有使用正确的搜索词。

Here is the .sas file (sorry, it is Portuguese):这是.sas文件(抱歉,它是葡萄牙语):

proc format;
Value $SG_AREA
        CH='Ciךncias Humanas'
        CN='Ciךncias da Natureza'
        LC='Linguagens e Cףdigos'
        MT='Matemבtica';

Value $TP_LINGUA
        0='Inglךs'
        1='Espanhol';

Value $IN_ITEM_ADAPTADO
        0='Nדo'
        1='Sim';


DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
        LRECL=33
        FIRSTOBS=2
        DLM=';'
        MISSOVER
        DSD ;

INPUT
        CO_POSICAO       : BEST2.
        SG_AREA          : $CHAR2.
        CO_ITEM          : BEST6.
        TX_GABARITO      : $CHAR1.
        CO_HABILIDADE    : BEST2.
        TX_COR           : $CHAR7.
        CO_PROVA         : BEST3.
        TP_LINGUA        : $CHAR1.
        IN_ITEM_ADAPTADO : $CHAR1. ;

ATTRIB  SG_AREA          FORMAT = $SG_AREA20.;         
ATTRIB  TP_LINGUA        FORMAT = $TP_LINGUA8.;       
ATTRIB  IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;

LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'

;RUN;

And here you can see the equivalent .sps file:在这里您可以看到等效的.sps文件:

GET DATA
  /TYPE=TXT
  /FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/ 
  /DELCASE=LINE
  /DELIMITERS=";"
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /IMPORTCASE= ALL
  /VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1       
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.

VARIABLE LABELS
CO_POSICAO  Posiחדo do Item na Prova
SG_AREA     ֱrea de Conhecimento do Item
CO_ITEM     Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE   Habilidade do Item
TX_COR      Cor da Prova
CO_PROVA    Identificador da Prova
TP_LINGUA       Lםngua Estrangeira
IN_ITEM_ADAPTADO    Item pertencente א prova adaptada para Ledor.


VALUE LABELS
SG_AREA
        "CH"    Ciךncias Humanas
        "CN"    Ciךncias da Natureza
        "LC"    Linguagens e Cףdigos
        "MT"    Matemבtica
/TP_LINGUA
        0   Inglךs
        1   Espanhol
/IN_ITEM_ADAPTADO
        0   Nדo
        1   Sim.

You can see that they describe the metadata for each column.您可以看到它们描述了每一列的元数据。

.sas is the program file extension for SAS, and is designed to be used via SAS. .sas 是 SAS 的程序文件扩展名,旨在通过 SAS 使用。 It is essentially a command file serving as a dictionary file.它本质上是一个用作字典文件的命令文件。

.sps is the program file extension for SPSS, and is designed to be used via SPSS. .sps 是 SPSS 的程序文件扩展名,旨在通过 SPSS 使用。 It is essentially a command file serving as a dictionary file.它本质上是一个用作字典文件的命令文件。 I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.我也会在这里提供一个方便的链接,但 SPSS 是 IBM 产品,他们的文档是一个地狱般的风景,任何人都不应该涉足。

What you're trying to do should be possible despite that.尽管如此,您尝试做的事情应该是可能的。 Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas 本身是不够的,因为它没有内置功能来解决这些情况。 Pandas support for SAS only extends to.sas7bdat data files, and for SPSS only extends to.sav data files. Pandas 对 SAS 的支持仅扩展至 .sas7bdat 数据文件,而对于 SPSS 仅扩展至 .sav 数据文件。

Python (and Pandas) can read the.sas and.sps extensions because they're plain text files, but can't actually do anything with them. Python(和 Pandas)可以读取 .sas 和 .sps 扩展名,因为它们是纯文本文件,但实际上不能对它们做任何事情。


Here are two paths for you to take to get what you're after.您可以通过以下两条途径获得您所追求的东西。

1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format. 1) 在试用版上安装 SAS 或 SPSS,用它来读取数据,然后以另一种格式导出。

2) Install and attempt to use the pyreadstat extension for Pandas. 2) 安装并尝试使用 Pandas 的 pyreadstat 扩展。

It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do.听起来 pandas 框架是您的偏好,要使其工作,您需要扩展它的功能。 In this case, with the pyreadstat extension.在这种情况下,使用pyreadstat扩展。 It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself.它旨在与 SAS 和 SPSS 数据文件一起使用,并且处理它们的效率远远高于 pandas 本身。 This solution comes with a caveat.此解决方案附带一个警告。

Pyreadstat is itself a conversion of ReadStat. Pyreadstat 本身就是 ReadStat 的转换。 Quoting the pyreadstat readme file:引用 pyreadstat 自述文件:

This module is a wrapper around the excellent Readstat C library by Evan Miller. 
Readstat is the library used in the back of the R library Haven, 
meaning pyreadstat is a python equivalent to R Haven.

If you look only at the pyreadstat files you won't find anything touching on.sas or.sps or dictionary files in general.如果您只查看 pyreadstat 文件,您将找不到任何涉及的内容。sas 或 .sps 或一般的字典文件。 Instead, you'll want to look at the readme for ReadStat found here .相反,您需要查看此处找到的 ReadStat 的自述文件。 It has a section specifically covering such circumstances.它有一节专门涵盖此类情况。

As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.到目前为止,我还没有测试 pyreadstat 中字典文件存在的 ReadStat 命令和函数,所以这有可能不起作用。

If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.如果您尝试此解决方案但它失败了,请跟进该线程,我将帮助您进行故障排除。

The clean way would be to export your SAS data either as XPORT or SAS7BDAT format files.干净的方法是将 SAS 数据导出为 XPORT 或 SAS7BDAT 格式文件。

Afterwards you can use the pandas function pandas.read_sas :之后您可以使用 pandas function pandas.read_sas

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html

The import numpy as np import pandas as pd导入 numpy 作为 np 导入 pandas 作为 pd

df = pd.read_sas('test.sas7bdat')

If you have large files you can use then the chunksize parameter to read only x file lines at a time, returns iterator.如果你有大文件,你可以使用chunksize参数一次只读取 x 个文件行,返回迭代器。 Or you can use iterator parameter in order to return an iterator for reading the file incrementally.或者您可以使用iterator参数来返回一个迭代器以增量读取文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM