Reading parquet file in Spark from S3

Question

I am reading data from S3 in the parquet format, and then I process this data as a DataFrame . The question is how to efficiently iterate over rows in DataFrame ? I know that the method collect loads data into memory, so, though my DataFrame is not big, I would prefer to avoid loading the complete data set into memory. How could I optimize the given code? Also, I am using indices to access columns in DataFrame . Can I access them by column names (I know them)?

DataFrame parquetFile = sqlContext.read().parquet("s3n://"+this.aws_bucket+"/"+this.aws_key_members);
parquetFile.registerTempTable("mydata");
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM mydata");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
   Map<String, String> props = new HashMap<>();
   props.put("field1", rddRows[rowIdx].get(0).toString());
   props.put("field2", rddRows[rowIdx].get(1).toString());
   // further processing
}

Answer 1

You can use Map function in spark. You can iterate the whole data frame without collecting the dataset/dataframe.

Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age 
BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map((MapFunction<Row, String>) row -> "Name:" + row.getString(0),Encoders.STRING());

namesDS.show();

You can create a map function if the operations you are doing are complex.

 // Map function
 Row doSomething(Row row){
   // get column
     String field = row.getAs(COLUMN)
// construct a new row and add all the existing/modified columns in the row .  
return row.
    }

Now this map function can be called into dataframe's map function

StructType structType = dataset.schema();
namesDF.map((MapFunction<Row, Row>)dosomething,
        RowEncoder.apply(structType))

Source : https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Reading parquet file in Spark from S3

Question

1 answers

solution1
0 2019-11-19 04:16:53

Reading parquet file in Spark from S3

Question

1 answers

solution1 0 2019-11-19 04:16:53

solution1
0 2019-11-19 04:16:53