[英]Create data frame using Nested partitioned file in S3 and load Data frame with partition column name in schema
Nested S3 parition file As below :嵌套的 S3 分区文件如下:
s3://alex/covid/exposure_date=**2021-02-01**/aec600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2020-05-01**/dec600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2021-06-01**/efe600b87b9a467d9395d2f5c0e2eeaa.parquet
s3://alex/covid/exposure_date=**2021-08-01**/acd600b87b9a467d9395d2f5c0e2eeaa.parquet
Tried different ways to read :尝试了不同的阅读方式:
Way 1:方式一:
df = spark.read.parquet("s3:/alex/covid/**/*.parquet")
Way 2:方式二:
df=spark.read.option("recursiveFileLookup","true").parquet("s3:/alex/covid/covid_contact_mock_data.parquet/")
way 3:方式3:
df = spark.read.parquet("s3://alex/covid//**/*.parquet")
I am expecting the partition column also one of the dataframe columns我期待分区列也是数据框列之一
In local tried to do the but getting error在本地尝试执行但出现错误
df = spark.read.parquet("C:/Users/mthma/covid_contact_mock_data.parquet/exposure_date=2020-12-03/*")
the error:错误:
Exception in thread "globPath-ForkJoinPool-6-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1215)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1420)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
You are having trouble with wildcards.您在使用通配符时遇到问题。
At least in the local case, you can expand them to individual filenames with this:至少在本地情况下,您可以使用以下命令将它们扩展为单个文件名:
folder = "C:/Users/mthma/covid_contact_mock_data.parquet/exposure_date=2020-12-03"
for file in glob.glob(f'{folder}/*'):
df = spark.read.parquet(file)
print(df.head())
https://docs.python.org/3/library/glob.html#glob.glob https://docs.python.org/3/library/glob.html#glob.glob
S3 is different. S3不一样。 You should use the
boto3
library to query S3 for individual path names.您应该使用
boto3
库来查询 S3 以获取单个路径名。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.