简体   繁体   English

SQL查询中的Spark Dataframe

[英]Spark Dataframe from SQL Query

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query . 我正在尝试使用Apache Spark来将具有多个联接和子选择的(大型)SQL查询的结果加载到Spark的DataFrame中,如从SQL Query创建Spark数据帧中所述

Unfortunately, my attempts to do so result in an error from Parquet: 不幸的是,我这样做的尝试导致Parquet错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 线程“主”中的异常org.apache.spark.sql.AnalysisException:无法推断Parquet的架构。 It must be specified manually. 必须手动指定。

I have seen information from google implying that this error occurs when a DataFrame is empty. 我已经从Google看到了一些信息,这暗示着当DataFrame为空时会发生此错误。 However, the results of the query load plenty of rows in DBeaver. 但是,查询结果在DBeaver中加载了大量行。

Here is an example query: 这是查询示例:

(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
    FROM DBO.TEMP 
    WHERE  BUSINESS_DATE = '2019-06-18' 
    AND   STORE_NBR IN (999) 
    ORDER BY BUSINESS_DATE) as reports

My Spark code looks like this. 我的Spark代码如下所示。

val reportsDataFrame = spark
  .read
  .option("url", db2JdbcUrl)
  .option("dbtable", queries.reports)
  .load()

scheduledHoursDf.show(10)

I read in the previous answer that it is possible to run queries against an entire database using this method. 我在上一个答案中读到,可以使用此方法对整个数据库运行查询。 In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. 特别是,当您首次在Spark中构建DataFrame时,如果将“ dbtable”参数指定为别名查询 You can see I've done this in the query by specifying the entire query to be aliased "as reports". 您可以通过将整个查询指定为别名作为“报告”来看到我已经在查询中完成了此操作。

I don't believe this to be a duplicate question. 我不认为这是重复的问题。 I've extensively researched this specific problem and have not found anyone facing the same issue online. 我已经广泛研究了此特定问题,但没有找到任何在线面临相同问题的人。 In particular, the Parquet error resulting from running the query. 特别是,运行查询会导致Parquet错误。

It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. 似乎已经达成共识,即不应以这种方式运行SQL查询,而应使用Spark的DataFrames的许多方法来过滤,分组和聚合数据。 However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty. 但是,即使使用原始SQL会导致性能下降,这对我们来说也非常有价值。

Quick look at your code tells me you are missing .format("jdbc") 快速查看您的代码告诉我您缺少.format("jdbc")

val reportsDataFrame = spark
  .read
  .format("jdbc")
  .option("url", db2JdbcUrl)
  .option("dbtable", queries.reports)
  .load()

This should work provided you have username and password set to connect to the database. 只要您设置了用于连接数据库的用户名和密码,这应该可以工作。

Good resource to know more about the JDBC Sources in spark ( https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html ) 一个很好的资源,可以在Spark中进一步了解JDBC源( https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM