Spark DataFrame ORC Hive 表读取问题

Question

I am trying to read a Hive table in Spark.我正在尝试在 Spark 中读取 Hive 表。 Below is the Hive Table format:以下是 Hive 表格式：

# Storage Information       
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde   
InputFormat:    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat    
Compressed: No  
Num Buckets:    -1  
Bucket Columns: []  
Sort Columns:   []  
Storage Desc Params:        
    field.delim \u0001
    serialization.format    \u0001

When I am trying to read it using the Spark SQL with the below command:当我尝试使用 Spark SQL 和以下命令读取它时：

val c = hiveContext.sql("""select  
        a
    from c_db.c cs 
    where dt >=  '2016-05-12' """)
c. show

I am getting the below warning:-我收到以下警告：-

18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67, 18/07/02 18:02:02 WARN ReaderImpl：找不到字段：_col0、_col1、_col2、_col3、_col4、_col5、_col6、_col7、_col8、_col9、_col10、_col12、_col13、_col13 _col15，_col16，_col17，_col18，_col19，_col20，_col21，_col22，_col23，_col24，_col25，_col26，_col27，_col28，_col29，_col30，_col31，_col32，_col33，_col34，_col35，_col36，_col37，_col38，_col39， _col40，_col41，_col42，_col43，_col44，_col45，_col46，_col47，_col48，_col49，_col50，_col51，_col52，_col53，_col54，_col55，_col56，_col57，_col58，_col59，_col60，_col61，_col62，_col63，_col64， _col65, _col66, _col67,

The read starts but it is very slow and getting network time out.读取开始，但速度非常慢并且网络超时。

When i am trying to read the Hive table directory directly i am getting the below error.当我尝试直接读取 Hive 表目录时，出现以下错误。

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true") 
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()

org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31, _col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53]; org.apache.spark.sql.AnalysisException：无法解析“a”给定的输入列：[_col18、_col3、_col8、_col66、_col45、_col42、_col31、_col17、_col52、_col58、_col50、_col26、_col26、_col27 _col23，_col6，_col28，_col54，_col48，_col33，_col56，_col22，_col35，_col44，_col67，_col15，_col32，_col9，_col11，_col41，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_4col，_col35 _col14，_col13，_col19，_col43，_col65，_col29，_col10，_col7，_col21，_col39，_col46，_col4，_col5，_col62，_col0，_col30，_col47，trans_dt，_col57，_col16，_col36，_col38，_col59，_col1，_col37， _col55, _col51, _col60, _col53]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)在 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.我可以将 Hive 表转换为 TextInputFormat 但这应该是我的最后一个选择，因为我想利用 OrcInputFormat 来压缩表大小。

Really appreciate your suggestion.真的很感谢你的建议。

Answer 1

I found workaround with reading table such way:我以这种方式找到了阅读表的解决方法：

val schema = spark.table("db.name").schema

spark.read.schema(schema).orc("/path/to/table")

Answer 2

The issue occurs generally with large tables, as it fails to read to max field length.该问题通常发生在大表中，因为它无法读取到最大字段长度。 I added meta-store read as true ( set spark.sql.hive.convertMetastoreOrc=true; ) and it worked for me.我添加了元存储读取为真（ set spark.sql.hive.convertMetastoreOrc=true; ），它对我set spark.sql.hive.convertMetastoreOrc=true; 。

Answer 3

I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.我认为该表没有命名列，或者如果有，Spark 可能无法读取名称。 You can use the default column names that Spark has given as mentioned in the Error.您可以使用 Spark 给出的默认列名，如错误中所述。 Or also set column names in the Spark code.或者也可以在 Spark 代码中设置列名。 Use printSchema and toDF method to rename the columns.使用 printSchema 和 toDF 方法重命名列。 But yes, you will need the mappings.但是，是的，您将需要映射。 This might require selecting and showing columns individually.这可能需要单独选择和显示列。

Answer 4

Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working.设置 (set spark.sql.hive.convertMetastoreOrc=true;) conf 正在工作。 But its trying to modify metadata of hive table.但它试图修改配置单元表的元数据。 Can you please explain me, What is going to modify and does it effect the table.你能解释一下，什么将修改并影响表格。 Thanks谢谢

Spark DataFrame ORC Hive 表读取问题

问题描述

4 个解决方案

解决方案1
2 2019-03-12 09:45:59

解决方案2
2 2019-12-24 13:02:02

解决方案3
0 2018-07-03 06:17:32

解决方案4
0 2020-11-17 04:13:40

Spark DataFrame ORC Hive 表读取问题

问题描述

4 个解决方案

解决方案1 2 2019-03-12 09:45:59

解决方案2 2 2019-12-24 13:02:02

解决方案3 0 2018-07-03 06:17:32

解决方案4 0 2020-11-17 04:13:40

解决方案1
2 2019-03-12 09:45:59

解决方案2
2 2019-12-24 13:02:02

解决方案3
0 2018-07-03 06:17:32

解决方案4
0 2020-11-17 04:13:40