I am reading data from S3 in the parquet
format, and then I process this data as a DataFrame
. The question is how to efficiently iterate over rows in DataFrame
? I know that the method collect
loads data into memory, so, though my DataFrame
is not big, I would prefer to avoid loading the complete data set into memory. How could I optimize the given code? Also, I am using indices to access columns in DataFrame
. Can I access them by column names (I know them)?
DataFrame parquetFile = sqlContext.read().parquet("s3n://"+this.aws_bucket+"/"+this.aws_key_members);
parquetFile.registerTempTable("mydata");
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM mydata");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> props = new HashMap<>();
props.put("field1", rddRows[rowIdx].get(0).toString());
props.put("field2", rddRows[rowIdx].get(1).toString());
// further processing
}
You can use Map function in spark. You can iterate the whole data frame without collecting the dataset/dataframe.
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age
BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map((MapFunction<Row, String>) row -> "Name:" + row.getString(0),Encoders.STRING());
namesDS.show();
You can create a map function if the operations you are doing are complex.
// Map function
Row doSomething(Row row){
// get column
String field = row.getAs(COLUMN)
// construct a new row and add all the existing/modified columns in the row .
return row.
}
Now this map function can be called into dataframe's map function
StructType structType = dataset.schema();
namesDF.map((MapFunction<Row, Row>)dosomething,
RowEncoder.apply(structType))
Source : https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.