简体   繁体   English

用于字符串引用空格分隔文件的Hadoop Hive SerDe行格式

[英]Hadoop Hive SerDe Row Format for String Quoted Space delimited file

I'm trying to create a Hive table for a log file with the following format. 我正在尝试使用以下格式为日志文件创建Hive表。

Log file: 日志文件:

#Software: 1
#Version: 1
#Start-Date: xx
#Date: xx
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation
#Remark: 3215330049 "SHPROD24A" "10.0.16.162" "main"
2016-08-12 00:35:31 2 172.28.212.88 - - authentication_failed DENIED "unavailable" -  407 TCP_DENIED CONNECT - tcp psoc.ebayc3.com 443 / - - "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0" 10.0.10.198 529 296 - "unavailable" "unavailable"

Note: 注意:

  • The first 6 lines of every log file are comment lines, (starting with a '#') 每个日志文件的前6行都是注释行(以“#”开头)
  • There are 27 fields in every line that is not a comment. 每行中有27个字段不是注释。 Some fields are space delimited strings. 一些字段是用空格分隔的字符串。 Other fields are space delimited quoted strings, with spaces within the field. 其他字段是用空格分隔的带引号的字符串,该字段内有空格。 Ex: "str ing" 例如:“ string”

Simple space delimited files break because of this quirk of having quoted string fields. 由于用引号引起来的字符串字段的这种怪异,简单的以空格分隔的文件中断了。 For this reason, I'm trying to use a SerDe RegEx pattern in the Row format. 因此,我正在尝试使用“行”格式的SerDe RegEx模式。

HiveQL Create Table Query: HiveQL创建表查询:

CREATE TABLE test (date_field STRING, 
time_field  STRING, 
time_taken  STRING, 
c_ip  STRING, 
cs_username  STRING, 
cs_auth_group  STRING, 
x_exception_id  STRING, 
sc_filter_result  STRING, 
cs_categories  STRING, 
csReferer  STRING, 
sc_status  STRING, 
s_action  STRING, 
cs_method  STRING, 
rsContent_Type  STRING, 
cs_uri_scheme  STRING, 
cs_host  STRING, 
cs_uri_port  STRING, 
cs_uri_path  STRING, 
cs_uri_query  STRING, 
cs_uri_extension  STRING, 
csUser_Agent  STRING, 
s_ip  STRING, 
sc_bytes  STRING, 
cs_bytes  STRING, 
x_virus_id  STRING, 
x_bluecoat_application_name  STRING, 
x_bluecoat_application_operation  STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^\"]*\"|'[^']*'|[\S]+)+"
)
STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES ("skip.header.line.count"="6");

Results: Running 结果:运行

SELECT * FROM test LIMIT 10;

gives me this error: Failed with exception 给我这个错误:失败失败

java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:匹配组数与列数不匹配

I'm confused because I've got 27 fields in my table (verified with DESCRIBE), and I've got 27 matches on the Regex. 我很困惑,因为我的表中有27个字段(用DESCRIBE验证),而正则表达式上有27个匹配项。 I've got a table property to ignore first 6 lines, so the comments shouldn't be a problem here. 我有一个table属性可以忽略前6行,因此注释在这里应该不是问题。 The error message doesn't make too much sense with this logic. 该错误消息对这种逻辑没有太大意义。

I've tested the RegEx on https://regex101.com/ with positive results. 我已经在https://regex101.com/上测试了RegEx,并获得了积极的结果。 The matches break down the fields as I want them: 比赛根据我的需要细分了这些字段:

正则表达式输出

I've tried switching the Regex pattern for various other configurations, without any luck. 我尝试过将Regex模式切换为其他各种配置,但没有任何运气。

Any suggestions or hints around what could be going wrong here? 关于这里可能出什么问题的任何建议或提示?

Thanks in advance! 提前致谢!

我尝试使用下面的正则表达式,我可以获取26个字段,不确定哪个字段丢失

^(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(".*")\s(\S+)\s(\S+)\s(\S+)\s(\S)\s(".*")\s(\S*)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM