简体   繁体   English

正则表达式模式匹配巨大文件中的IP和UserAgent

[英]Regexp pattern matching IP and UserAgent in an Huge File

I have a huge log file that has a structure like this: 我有一个巨大的日志文件,其结构如下:

ip=X.X.X.X
userAgent=Firefox
-----
Referer=hxxp://www.bla.org

I want to create a custom output like this: ip:userAgent 我想创建这样的自定义输出:ip:userAgent

for ex: 例如:

X.X.X.X:Firefox

and the pattern will ignore lines which don't start with ip= and userAgent=. 并且该模式将忽略不以ip =和userAgent =开头的行。 (these two must form a pair as i mentioned above.) (如上所述,这两个必须成对。)

I am a newbie administrator and our client needs a sorted file immediately. 我是新手管理员,我们的客户需要立即整理文件。 Any help will be wonderful. 任何帮助都会很棒。 Thanks. 谢谢。

^ip=(\d+(?:\.\d+){3})[\r\n]+userAgent=(.+)$

Apply in global + multiline mode. 在全局+多行模式下应用。

Group 1 will contain the IP, group 2 will contain the user agent string. 组1将包含IP,组2将包含用户代理字符串。

Edit: The above expression can be simplified a bit, we can remove the IP address format checking - assuming that there will be nothing but real IP addresses in the log file: 编辑:上面的表达式可以简化一点,我们可以删除IP地址格式检查-假设日志文件中除了真实IP地址外什么都没有:

^ip=(\d+\.?)+[\r\n]+userAgent=(.+)$

You can use: 您可以使用:

^ip=((?:[0-9]{1,3}\.){3}[0-9]{1,3})$

And

^userAgent=(.*)$ 

Get the group 1 for both and you will have the desired data. 获取两个的组1,您将获得所需的数据。

give it a try (this is in no way robust if there are lines where your log file differs from the example snippet above): 尝试一下(如果您的日志文件中的行与上面的示例代码段不同,这绝对不会健壮):

sed -n -e '/^ip=/ {s///
N
s/\nuserAgent=/:/
p 
}' HugeFile > customoutput

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM