用于字符串引用空格分隔文件的Hadoop Hive SerDe行格式

Question

I'm trying to create a Hive table for a log file with the following format. 我正在尝试使用以下格式为日志文件创建Hive表。

Log file: 日志文件：

#Software: 1
#Version: 1
#Start-Date: xx
#Date: xx
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation
#Remark: 3215330049 "SHPROD24A" "10.0.16.162" "main"
2016-08-12 00:35:31 2 172.28.212.88 - - authentication_failed DENIED "unavailable" -  407 TCP_DENIED CONNECT - tcp psoc.ebayc3.com 443 / - - "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0" 10.0.10.198 529 296 - "unavailable" "unavailable"

Note: 注意：

The first 6 lines of every log file are comment lines, (starting with a '#') 每个日志文件的前6行都是注释行（以“＃”开头）
There are 27 fields in every line that is not a comment. 每行中有27个字段不是注释。 Some fields are space delimited strings. 一些字段是用空格分隔的字符串。 Other fields are space delimited quoted strings, with spaces within the field. 其他字段是用空格分隔的带引号的字符串，该字段内有空格。 Ex: "str ing" 例如：“ string”

Simple space delimited files break because of this quirk of having quoted string fields. 由于用引号引起来的字符串字段的这种怪异，简单的以空格分隔的文件中断了。 For this reason, I'm trying to use a SerDe RegEx pattern in the Row format. 因此，我正在尝试使用“行”格式的SerDe RegEx模式。

HiveQL Create Table Query: HiveQL创建表查询：

CREATE TABLE test (date_field STRING, 
time_field  STRING, 
time_taken  STRING, 
c_ip  STRING, 
cs_username  STRING, 
cs_auth_group  STRING, 
x_exception_id  STRING, 
sc_filter_result  STRING, 
cs_categories  STRING, 
csReferer  STRING, 
sc_status  STRING, 
s_action  STRING, 
cs_method  STRING, 
rsContent_Type  STRING, 
cs_uri_scheme  STRING, 
cs_host  STRING, 
cs_uri_port  STRING, 
cs_uri_path  STRING, 
cs_uri_query  STRING, 
cs_uri_extension  STRING, 
csUser_Agent  STRING, 
s_ip  STRING, 
sc_bytes  STRING, 
cs_bytes  STRING, 
x_virus_id  STRING, 
x_bluecoat_application_name  STRING, 
x_bluecoat_application_operation  STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^\"]*\"|'[^']*'|[\S]+)+"
)
STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES ("skip.header.line.count"="6");

Results: Running 结果：运行

SELECT * FROM test LIMIT 10;

gives me this error: Failed with exception 给我这个错误：失败失败

java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns java.io.IOException：org.apache.hadoop.hive.serde2.SerDeException：匹配组数与列数不匹配

I'm confused because I've got 27 fields in my table (verified with DESCRIBE), and I've got 27 matches on the Regex. 我很困惑，因为我的表中有27个字段（用DESCRIBE验证），而正则表达式上有27个匹配项。 I've got a table property to ignore first 6 lines, so the comments shouldn't be a problem here. 我有一个table属性可以忽略前6行，因此注释在这里应该不是问题。 The error message doesn't make too much sense with this logic. 该错误消息对这种逻辑没有太大意义。

I've tested the RegEx on https://regex101.com/ with positive results. 我已经在https://regex101.com/上测试了RegEx，并获得了积极的结果。 The matches break down the fields as I want them: 比赛根据我的需要细分了这些字段：

I've tried switching the Regex pattern for various other configurations, without any luck. 我尝试过将Regex模式切换为其他各种配置，但没有任何运气。

Any suggestions or hints around what could be going wrong here? 关于这里可能出什么问题的任何建议或提示？

Thanks in advance! 提前致谢！

Answer 1

我尝试使用下面的正则表达式，我可以获取26个字段，不确定哪个字段丢失

^(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(".*")\s(\S+)\s(\S+)\s(\S+)\s(\S)\s(".*")\s(\S*)

用于字符串引用空格分隔文件的Hadoop Hive SerDe行格式

问题描述

1 个解决方案

解决方案1
0 2017-09-18 19:10:17

用于字符串引用空格分隔文件的Hadoop Hive SerDe行格式

问题描述

1 个解决方案

解决方案1 0 2017-09-18 19:10:17

解决方案1
0 2017-09-18 19:10:17