简体   繁体   English

使用 S3 中的嵌套分区文件创建数据框,并在架构中加载具有分区列名称的数据框

[英]Create data frame using Nested partitioned file in S3 and load Data frame with partition column name in schema

Nested S3 parition file As below :嵌套的 S3 分区文件如下:

s3://alex/covid/exposure_date=**2021-02-01**/aec600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2020-05-01**/dec600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2021-06-01**/efe600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2021-08-01**/acd600b87b9a467d9395d2f5c0e2eeaa.parquet

Tried different ways to read :尝试了不同的阅读方式:

Way 1:方式一:

df = spark.read.parquet("s3:/alex/covid/**/*.parquet")

Way 2:方式二:

df=spark.read.option("recursiveFileLookup","true").parquet("s3:/alex/covid/covid_contact_mock_data.parquet/")

way 3:方式3:

df = spark.read.parquet("s3://alex/covid//**/*.parquet")

I am expecting the partition column also one of the dataframe columns我期待分区列也是数据框列之一

In local tried to do the but getting error在本地尝试执行但出现错误

df = spark.read.parquet("C:/Users/mthma/covid_contact_mock_data.parquet/exposure_date=2020-12-03/*")

the error:错误:

Exception in thread "globPath-ForkJoinPool-6-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1215)
        at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1420)
        at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)

You are having trouble with wildcards.您在使用通配符时遇到问题。

At least in the local case, you can expand them to individual filenames with this:至少在本地情况下,您可以使用以下命令将它们扩展为单个文件名:

folder = "C:/Users/mthma/covid_contact_mock_data.parquet/exposure_date=2020-12-03"
for file in glob.glob(f'{folder}/*'):
    df = spark.read.parquet(file)
    print(df.head())

https://docs.python.org/3/library/glob.html#glob.glob https://docs.python.org/3/library/glob.html#glob.glob

S3 is different. S3不一样。 You should use the boto3 library to query S3 for individual path names.您应该使用boto3库来查询 S3 以获取单个路径名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM