简体   繁体   English

正则表达式在PIG中带有双引号

[英]Regex with double quotes in PIG

I'm writing a pig script to process an access log from a sophos proxy. 我正在编写一个猪脚本来处理来自sophos代理的访问日志。

Each line is like: 每行就像:

2015:01:13-00:00:01 AR-BADC-FAST-01 httpproxy[27983]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="10.20.7.210" dstip="10.24.2.7" user="" ad_domain="" statuscode="302" cached="0" profile="REF_DefaultHTTPProfile (Default Web Filter Profile)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="0" request="0x9ac68d0" url="http://www.google.com" exceptions="av,auth,content,url,ssl,certcheck,certdate,mime,cache,fileextension" error="" authtime="0" dnstime="1" cattime="0" avscantime="0" fullreqtime="239428" device="0" auth="0" 2015:01:13-00:00:01 AR-BADC-FAST-01 httpproxy [27983]:id =“ 0001”严重性=“ info” sys =“ SecureWeb” sub =“ http” name =“ http access”操作=“ pass” method =“ GET” srcip =“ 10.20.7.210” dstip =“ 10.24.2.7” user =“” ad_domain =“” statuscode =“ 302” cached =“ 0” profile =“ REF_DefaultHTTPProfile(默认Web过滤器配置文件)“ filteraction =” REF_DefaultHTTPCFFAction(默认内容过滤器操作)“ size =” 0“ request =” 0x9ac68d0“ url =” http://www.google.com“ exceptions =” av,auth,content,url,ssl,certcheck ,certdate,mime,cache,fileextension“ error =”“ authtime =” 0“ dnstime =” 1“ cattime =” 0“ avscantime =” 0“ fullreqtime =” 239428“ device =” 0“ auth =” 0“

So I managed to do it in Java with MapReduce, using the following regex: \\"([^\\"]*)\\" to get the values between the quotes and then process it. Now I want to do the same with pig, but I'm not able to apply the regex to the each of the lines. 因此,我设法使用MapReduce在Java中使用以下正则表达式来做到这一点: \\"([^\\"]*)\\"以获取引号之间的值,然后对其进行处理。现在,我想对pig进行相同的操作,但我无法将正则表达式应用于每行。

I'm doing: 我正在做:

input = load './http.log' as (line : chararray);
splt = foreach input generate FLATTEN(REGEX_EXTRACT_ALL(line,'(\\"([^\\"]*)\\")'));
dump splt;

And the result of the dump is: (). 转储的结果是:()。

There is something that I'm missing with the use of REGEX_EXTRACT_ALL or I have to escape some characters of the regex in a different way? 使用REGEX_EXTRACT_ALL缺少某些东西,还是我不得不以其他方式转义正则表达式的某些字符?

Thanks! 谢谢!

I managed to extract the values with a different approach, because I just wanted some of the fields of the line. 我设法用另一种方法提取值,因为我只想要该行的某些字段。

In order to get the values I'm doing: 为了获得我正在做的值:

splt = FOREACH A GENERATE
    FLATTEN(REGEX_EXTRACT(line,'.*url="([^"]*)".*',1)) AS url,
    FLATTEN(REGEX_EXTRACT(line,'.*fullreqtime="([^"]*)".*',1)) AS duration,
    FLATTEN(REGEX_EXTRACT(line,'.*size="([^"]*)".*',1)) AS bytes;

And then I can continue with the rest of the script 然后我可以继续其余的脚本

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM