简体   繁体   English

将CSV载入PySpark中的DataFrame时出现问题

[英]Problem loading csv into DataFrame in PySpark

I'm trying to aggregate a bunch of CSV files into one and output it to S3 in ORC format using an ETL job in AWS Glue. 我正在尝试将一堆CSV文件聚合到一个文件中,并使用AWS Glue中的ETL作业以ORC格式将其输出到S3。 My aggregated CSV looks like this: 我的汇总CSV如下所示:

header1,header2,header3
foo1,foo2,foo3
bar1,bar2,bar3

I have a string representation of the aggregated CSV called aggregated_csv has content header1,header2,header3\\nfoo1,foo2,foo3\\nbar1,bar2,bar3 . 我有一个称为aggregated_csv的聚合CSV字符串表示形式,其内容为header1,header2,header3\\nfoo1,foo2,foo3\\nbar1,bar2,bar3 I've read that pyspark has a straightforward way to convert CSV files into DataFrames (which I need so that I can leverage Glue's ability to easily output in ORC). 我读过pyspark有一种直接的方式将CSV文件转换为DataFrames(我需要这样做,以便我可以利用Glue的能力轻松在ORC中输出)。 Here is a snippet of what I've tried: 这是我尝试过的片段:

def f(glueContext, aggregated_csv, schema):
    with open('somefile', 'a+') as agg_file:
        agg_file.write(aggregated_csv)
        #agg_file.seek(0)
        df = glueContext.read.csv(agg_file, schema=schema, header="true")
        df.show()

I've tried it both with and without seek. 我已经尝试过,无论有没有寻求。 When I don't call seek(), the job completes successfully but df.show() doesn't display any data other than the headers. 当我不调用seek()时,作业将成功完成,但df.show()不会显示标题以外的任何数据。 When I do call seek(), I get the following exception: 当我调用seek()时,出现以下异常:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-48-255.us-west-2.compute.internal:8020/user/root/header1,header2,header3\n;'

Since seek seems to change the behavior and since the headers in my csv are part of the exception string, I'm assuming that the problem is somehow related to where the file cursor is when I pass the file to glueContext.read.csv() but I'm not sure how to resolve it. 由于seek似乎改变了行为,并且由于我的csv中的标头是异常字符串的一部分,因此我假设问题与我将文件传递给glueContext.read.csv()时文件游标的位置有关。但我不确定如何解决。 If I uncomment the seek(0) call and add an agg_file.read() command, I can see the entire contents of the file as expected. 如果取消注释seek(0)调用并添加agg_file.read()命令,则可以按预期看到文件的全部内容。 What do I need to change so that I'm able to successfully read a csv file that I've just written into a spark dataframe? 为了使我能够成功读取刚刚写入spark数据帧的csv文件,我需要更改什么?

I think you're passing wrong arguments to csv function. 我认为您正在将错误的参数传递给csv函数。 I believe that GlueContext.read.csv() will get an instance of DataFrameReader.csv() , and it's signature takes a filename as a first argument, and what you're passing is a file-like object. 我相信, GlueContext.read.csv()将获得的实例DataFrameReader.csv()它的签名将文件名作为第一个参数,而你正在传递是一个类似文件的对象。

def f(glueContext, aggregated_csv, schema):
    with open('somefile', 'a+') as agg_file:
        agg_file.write(aggregated_csv)
        #agg_file.seek(0)
    df = glueContext.read.csv('somefile', schema=schema, header="true")
    df.show()

BUT, if all that you want it to write an ORC file, and you already have the data read as aggregated_csv you can create a DataFrame directly out of a list of tuples. 但是,如果您只希望它编写一个ORC文件,并且已经将数据读取为aggregated_csv ,则可以直接从元组列表中创建DataFrame

df = spark.createDataFrame([('foo1','foo2','foo3'), ('bar1','bar2','bar3')], ['header1', 'header2', 'header3'])

then, if you need a Glue DynamicFrame use fromDF function 然后,如果您需要Glue DynamicFrame使用fromDF函数

dynF = fromDF(df, glueContext, 'myFrame')

ONE MORE BUT: you don't need glue to write ORC - spark it totally capable of it. 一个更重要的是:您不需要胶水来编写ORC-完全激发它即可。 Just use DataFrameWriter.orc() function: 只需使用DataFrameWriter.orc()函数:

df.write.orc('s3://path')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM