简体   繁体   中英

Spark Dataframe from SQL Query

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query .

Unfortunately, my attempts to do so result in an error from Parquet:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.

Here is an example query:

(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
    FROM DBO.TEMP 
    WHERE  BUSINESS_DATE = '2019-06-18' 
    AND   STORE_NBR IN (999) 
    ORDER BY BUSINESS_DATE) as reports

My Spark code looks like this.

val reportsDataFrame = spark
  .read
  .option("url", db2JdbcUrl)
  .option("dbtable", queries.reports)
  .load()

scheduledHoursDf.show(10)

I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".

I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.

It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.

Quick look at your code tells me you are missing .format("jdbc")

val reportsDataFrame = spark
  .read
  .format("jdbc")
  .option("url", db2JdbcUrl)
  .option("dbtable", queries.reports)
  .load()

This should work provided you have username and password set to connect to the database.

Good resource to know more about the JDBC Sources in spark ( https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM