读取 SAS 文件以获取元信息

Question

Very new to data science technologies.非常新的数据科学技术。 Currently working on reading a SAS File (.sas7dbat).目前正在阅读 SAS 文件 (.sas7dbat)。

Able to read the file using :能够使用以下方式读取文件：

SAS7BDAT('/dbfs/mnt/myMntScrum1/sasFile.sas7bdat') as f:
    for row in f:
      print(row)

Row prints all the data. Row 打印所有数据。

When we view SAS files in SAS viewer we can see metadata Eg Label Information & variable (column names) used on actual data当我们在 SAS 查看器中查看 SAS 文件时，我们可以看到元数据，例如用于实际数据的标签信息和变量（列名）

How can I read this metadata in Spark (Databricks) using Python ?如何使用 Python 在 Spark (Databricks) 中读取此元数据？

Answer 1

Did you try pyreadstat ?你试过pyreadstat吗？

It can directly read metadata.它可以直接读取元数据。

    import pyreadstat

    df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')

Answer 2

Most data analysis in Python is done using the pandas library which has a method called 'read_sas' which preserves the meta-data unless you are being ordered to use spark I strongly recommend pandas. Python 中的大多数数据分析都是使用 Pandas 库完成的，该库有一个名为“read_sas”的方法，它会保留元数据，除非您被命令使用 spark 我强烈推荐 Pandas。 Here is a set of instructions for SAS users: https://blog.dominodatalab.com/pandas-for-sas-users-part-1/以下是 SAS 用户的一组说明： https : //blog.dominodatalab.com/pandas-for-sas-users-part-1/

Answer 3

You can use a Spark external package called spark-sas7bdat for reading sas_file_name.sas7bdat您可以使用名为spark-sas7bdat的 Spark 外部包来读取sas_file_name.sas7bdat

Here is how to install it into Spark Application https://spark-packages.org/package/saurfang/spark-sas7bdat and some examples on its github page https://github.com/saurfang/spark-sas7bdat以下是如何将其安装到 Spark 应用程序https://spark-packages.org/package/saurfang/spark-sas7bdat及其 github 页面上的一些示例https://github.com/saurfang/spark-sas7bdat

Then just using Spark read method然后只需使用 Spark 读取方法

spark.read.format("com.github.saurfang.sas.spark")
          .load("path to the sas_file_name.sas7bdat", inferLong=True)

Answer 4

If you are interested on metadata only, you can use pyreadstat passing metadataonly parameter as True , it will not read any data, but just the metadata, so size of the file is going to have no impact on the amount of time required to read the metadata.如果您只对metadata感兴趣，您可以使用pyreadstat将metadataonly参数设为True ，它不会读取任何数据，而只会读取元数据，因此文件的大小不会影响读取所需的时间元数据。

import pyreadstat

df, meta = pyreadstat.read_sas7bdat('/dbfs/mnt/myMntScrum1/sasFile.sas7bdat', metadataonly=True)

Note that df will be an empty dataframe when metadataonly=True is passed, you may want to omit this if you want both the data as well as the metadata请注意，当metadataonly=True传递时， df 将是一个空数据帧，如果您想要数据和元数据，您可能希望省略它

You can access variable labels using meta.column_names_to_labels , it will give a dictionary where variable name is the key and variable label is the value.您可以使用meta.column_names_to_labels访问变量标签，它将给出一个字典，其中变量名称是键，变量标签是值。

Other useful metadata are: metadata.number_columns , metadata.number_rows , metadata.file_encoding , metadata.file_label , etc.其他有用的元数据有： metadata.number_columns 、 metadata.number_rows 、 metadata.file_encoding 、 metadata.file_label等。

You can find the complete list of available metadata at pyreadstat documentation您可以在pyreadstat 文档中找到可用元数据的完整列表

读取 SAS 文件以获取元信息

问题描述

4 个解决方案

解决方案1
2 2019-04-04 14:02:00

解决方案2
1 2018-05-31 21:16:03

解决方案3
0 已采纳 2020-11-11 21:28:08

解决方案4
0 2021-06-10 00:46:17

读取 SAS 文件以获取元信息

问题描述

4 个解决方案

解决方案1 2 2019-04-04 14:02:00

解决方案2 1 2018-05-31 21:16:03

解决方案3 0 已采纳 2020-11-11 21:28:08

解决方案4 0 2021-06-10 00:46:17

解决方案1
2 2019-04-04 14:02:00

解决方案2
1 2018-05-31 21:16:03

解决方案3
0 已采纳 2020-11-11 21:28:08

解决方案4
0 2021-06-10 00:46:17