[英]Hadoop Hive SerDe Row Format for String Quoted Space delimited file
I'm trying to create a Hive table for a log file with the following format. 我正在尝试使用以下格式为日志文件创建Hive表。
Log file: 日志文件:
#Software: 1
#Version: 1
#Start-Date: xx
#Date: xx
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation
#Remark: 3215330049 "SHPROD24A" "10.0.16.162" "main"
2016-08-12 00:35:31 2 172.28.212.88 - - authentication_failed DENIED "unavailable" - 407 TCP_DENIED CONNECT - tcp psoc.ebayc3.com 443 / - - "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0" 10.0.10.198 529 296 - "unavailable" "unavailable"
Note: 注意:
Simple space delimited files break because of this quirk of having quoted string fields. 由于用引号引起来的字符串字段的这种怪异,简单的以空格分隔的文件中断了。 For this reason, I'm trying to use a SerDe RegEx pattern in the Row format.
因此,我正在尝试使用“行”格式的SerDe RegEx模式。
HiveQL Create Table Query: HiveQL创建表查询:
CREATE TABLE test (date_field STRING,
time_field STRING,
time_taken STRING,
c_ip STRING,
cs_username STRING,
cs_auth_group STRING,
x_exception_id STRING,
sc_filter_result STRING,
cs_categories STRING,
csReferer STRING,
sc_status STRING,
s_action STRING,
cs_method STRING,
rsContent_Type STRING,
cs_uri_scheme STRING,
cs_host STRING,
cs_uri_port STRING,
cs_uri_path STRING,
cs_uri_query STRING,
cs_uri_extension STRING,
csUser_Agent STRING,
s_ip STRING,
sc_bytes STRING,
cs_bytes STRING,
x_virus_id STRING,
x_bluecoat_application_name STRING,
x_bluecoat_application_operation STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\"[^\"]*\"|'[^']*'|[\S]+)+"
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES ("skip.header.line.count"="6");
Results: Running 结果:运行
SELECT * FROM test LIMIT 10;
gives me this error: Failed with exception 给我这个错误:失败失败
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:匹配组数与列数不匹配
I'm confused because I've got 27 fields in my table (verified with DESCRIBE), and I've got 27 matches on the Regex. 我很困惑,因为我的表中有27个字段(用DESCRIBE验证),而正则表达式上有27个匹配项。 I've got a table property to ignore first 6 lines, so the comments shouldn't be a problem here.
我有一个table属性可以忽略前6行,因此注释在这里应该不是问题。 The error message doesn't make too much sense with this logic.
该错误消息对这种逻辑没有太大意义。
I've tested the RegEx on https://regex101.com/ with positive results. 我已经在https://regex101.com/上测试了RegEx,并获得了积极的结果。 The matches break down the fields as I want them:
比赛根据我的需要细分了这些字段:
I've tried switching the Regex pattern for various other configurations, without any luck. 我尝试过将Regex模式切换为其他各种配置,但没有任何运气。
Any suggestions or hints around what could be going wrong here? 关于这里可能出什么问题的任何建议或提示?
Thanks in advance! 提前致谢!
我尝试使用下面的正则表达式,我可以获取26个字段,不确定哪个字段丢失
^(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(".*")\s(\S+)\s(\S+)\s(\S+)\s(\S)\s(".*")\s(\S*)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.