Spark.sql 運行查詢沒有任何優化（aws 膠水雅典娜）

Question

我需要從 Glue Job (spark.sql) 向 AWS 上的 Athena 執行 SQL 請求。

我的查詢很簡單

df = spark.sql("select * from hashes 
               where year='2109' and month='10' and day='08' 
               and myhashes in (%s) order by timestamp desc" % ( 
               ",".join( "'"+str(x)+"'" for x in myhashes ))  )

此代碼產生一個字符串，如

select * from hashes where year='2019' 
     and month='10' and day='08' 
     and myhashes in (
    '06SN931', 
    '06SN931', 
    '06SP317', 
    ...........
    '86X0297'
    )

它在雅典娜中運行得很好

但是，如果我運行 Glue Job 火花似乎會將查詢從 IN 轉換為 OR 語法，例如

其中.... day ='08' and (myhashes = '06XH8V3' or myhashes = '06X68P4' or my.....) 並產生錯誤。

Here the exception
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759)
    ... 64 more


Caused by: MetaException(message:1 validation error detected: Value 'year = '2019' and month = '10' and day = '08' and (myhashes = '06XH58V3' or myhashes = '06X658P4' or myhashes = '45X42051' or myhashes = '15S03560' or myhashes = '10S2868' or myhashes = '416S2661' or myhashes = 'dDSD' or myhashes = 'DSSD' or myhashes = '13XE639' or myhashes = '06X668N7' or myhashes = '06X364T2' or 
.......
myhashes = '96S652207' or myhashes = '06X26365M' or myhashes = '10X560c89' or myhashes = '06X01N8' or )' 


at 'expression' failed to satisfy constraint: Member must have length less than or equal to 2048 (Service: AWSGlue; Status Code: 400; Error Code: ValidationException; Request ID: 83f7bc7b-0d10-11ea-9a8c-fdfadfa2a22b))
            at com.amazonaws.glue.catalog.converters.CatalogToHiveConverter.getHiveException(CatalogToHiveConverter.java:100)
            at com.amazonaws.glue.catalog.converters.CatalogToHiveConverter.wrapInHiveException(CatalogToHiveConverter.java:88)
            at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getCatalogPartitions(GlueMetastoreClientDelegate.java:948)
            at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getPartitions(GlueMetastoreClientDelegate.java:911)
            at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.listPartitionsByFilter(AWSCatalogMetastoreClient.java:1179)
            at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2255)
            ... 69 more

        End of LogType:stdout

有沒有辦法禁用 SQL 的 spark 內部優化？

Answer 1

錯誤消息暗示您的查詢太長（超過 2048 個字符）。 AWS Athena 和 AWS Glue 具有不同的約束條件。

如果可能，嘗試通過將表與包含myhashes值的表連接起來來過濾表（“哈希”），而不是在要比較的元素數量變大時使用 SQL in 。

Spark.sql 運行查詢沒有任何優化（aws 膠水雅典娜）

問題描述

1 個解決方案

解決方案1
0 2019-11-22 23:23:31

Spark.sql 運行查詢沒有任何優化（aws 膠水雅典娜）

問題描述

1 個解決方案

解決方案1 0 2019-11-22 23:23:31

解決方案1
0 2019-11-22 23:23:31