How do I project parquet file in spark?

Question

I load a data set from Parquet files as

val sqc = new org.apache.spark.sql.SQLContext(sc)
val data = sqc.parquetFile("f1,f2,f3,f4,f5")

here files "fN" &c have common columns "c1" and "c2" but some of them may also have other columns.

Thus, when I do

data.registerAsTable("MyTable")

I get the error:

java.lang.RuntimeException: could not merge metadata: key pig.schema has conflicting values

The question is: how do I get those parquet files into a single table with just two columns?

Ie, how do I project them ?

It would seem reasonable to load "fN" one by one, project them, then merge together using unionAll .

Answer 1

The rough equivalent of a project on a SchemaRDD is .select() which takes an Expression object instance and returns a new SchemaRDD with the filtered fields. After doing the selects you can use unionAll as suggested. eg

val sqc = new org.apache.spark.sql.SQLContext(sc)
import sqc._  
val file1 = sqc.parquetFile("file1").select('field1, 'field2)
val file2 = sqc.parquetFile("file2").select('field1, 'field2)
val all_files = file1.unionAll(file2)

The import sqc._ is required to load the implicit functions for building Expression instances from symbols).

Answer 2

Do you know how these files are generated ?

If you know then you should know the schema already and categories accordingly.

Otherwise I don't think there is another way. you need to load to one by one. Once you extract the data in schemaRDD but even can caltl unionAll If they belong to same schema.

Check sample code from github project https://github.com/pankaj-infoshore/spark-twitter-analysis where the parquet files are handled.

var path ="/home/infoshore/java/Trends/urls"
var files =new java.io.File(path).listFiles() 
var parquetFiles =           files.filter(file=>file.isDirectory).map(file=>file.getName)
var tweetsRDD= parquetFiles.map(pfile=>sqlContext.parquetFile(path+"/"+pfile))
var allTweets =tweetsRDD.reduce((s1,s2)=>s1.unionAll(s2))
allTweets.registerAsTable("tweets")
sqlContext.cacheTable("tweets")
import sqlContext._
val popularHashTags = sqlContext.sql("SELECT hashtags,usersMentioned,Url FROMtweets")

Check how I have called UnionAll. You can not call unionAll on schemaRDD which represent different schema.

Let me know If you need specific help

Regards Pankaj

How do I project parquet file in spark?

Question

2 answers

solution1
3 ACCPTED 2015-01-21 18:26:05

solution2
1 2015-01-21 06:00:15

How do I project parquet file in spark?

Question

2 answers

solution1 3 ACCPTED 2015-01-21 18:26:05

solution2 1 2015-01-21 06:00:15

solution1
3 ACCPTED 2015-01-21 18:26:05

solution2
1 2015-01-21 06:00:15