正则表达式-从日志文件中提取网站地址

Question

我需要编写正则表达式查询来提取日志文件中所有网站地址的帮助。 日志文件的每一行都包含一堆信息（IP地址，协议，字节，请求的网站等）。

具体来说，我想删除以“ http：//”开头并以“ .ENDING”结尾的所有内容，其中我指定“ ENDING = com，biz，net，tv，info”，我不在乎完整的网址（即：http：// // www.google.com/bla/page2=blablabla，只需http://www.google.com ）。此正则表达式查询中最难的部分是我希望它选择包含.com或.info或.biz作为子域的域（例如：http：// www.google.com.MaliciousWebsite.com）在这种情况下赶上整个域名，而不是在google.com上砍短整个域名？

我之前从未编写过正则表达式查询，因此我尝试使用在线参考图表（http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/），但遇到了麻烦。 这是我到目前为止的内容：

"\A[http://]\Z[\.][com,info,biz,tv,net]"

*很抱歉，网址中的空格是问题，但是stackoverflow正在标记它们，由于我是新用户，我最多只能张贴2。

感谢您的帮助。

更新：根据来自人人到目前为止，我认为这将是更好，所以，它的一切拿起之间（HTTP或HTTPS）和（非有效的URL字符写入此规则的极好的反馈：？！，@，＃， $，％，^，＆，*，（，），[，{，}，]，|，/，'，“，;，<，>）

这将确保抓住所有TLD，并确保抓住诸如google.com.bad.website.com之类的网站。 到目前为止，这是我的模型：

"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"

再次感谢您提供的所有帮助。

Answer 1

不知道您使用的是哪种正则表达式语言，因此我将使用.NET语法。 怎么样：

@"^https?://[^?/#\s\r]+"

它不是完美的，但是域名的真正规范是野兽，并且http://或https://应该足以告诉您域名即将到来。

? 和字符类中的# 应该没问题，但是我还没有机会检查一下。 您可能需要使用\\对其进行转义。

此外，这还将捕获端口号。 如果您不想这样做，请在否定的字符类中添加: 。

编辑：PCRE版本应该是这样的：

^https?:\/\/[^?\/#\s\r]+

不过，我最近还没有使用过PCRE，因此您可能想与有此经历的人确认一下。 我不确定哪些字符需要在PCRE的字符类中转义。

Answer 2

您可以尝试以下表达式：

\b((?:http://)(?:.)*(?:\.)(?:com|info|biz|tv|net))

您可以在这里查看说明：)

r"""
\b               # Assert position at a word boundary
(                # Match the regular expression below and capture its match into backreference number 1
   (?:              # Match the regular expression below
      http://          # Match the characters “http://” literally
   )
   (?:              # Match the regular expression below
      .                # Match any single character that is not a line break character
   )*               # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
      \.               # Match the character “.” literally
   )
   (?:              # Match the regular expression below
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
         com              # Match the characters “com” literally
      |                # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         info             # Match the characters “info” literally
      |                # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
         biz              # Match the characters “biz” literally
      |                # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
         tv               # Match the characters “tv” literally
      |                # Or match regular expression number 5 below (the entire group fails if this one fails to match)
         net              # Match the characters “net” literally
   )
)
"""

Answer 3

这将捕获http或https，后跟：//和不包含空格或斜杠的域名。
请注意，各种编程语言都有一些正则表达式的缺陷。 您可能需要躲避/由\\/或在Java中，你必须仔细\\由\\\\

https?://[^ /]+\.(?:com|info|biz|tv|net)

Answer 4

^http\:\/\/(.+)\.(com|info|biz|tv|net)

将捕获以指定的tld结尾的http领域中的所有域，也捕获诸如http://test.commercial.ly之类的所有内容。 我没有添加结尾斜杠，因为我不确定域上是否总是有结尾斜杠，但是如果您在域上总是有结尾斜杠，则可以简单地在结尾添加/正则表达式 如果您不总是以斜杠结尾，那可能会给您带来一些误报。 您还可以根据需要添加https支持。 您确定要指定顶级域名吗？ 还是您想获取任何顶级域名？

Answer 5

\\A[http://]\\Z[\\.][.*][com,info,biz,tv,net]?![\\.]

不知道您使用的是哪种类型的正则表达式，但似乎您正在尝试查找包含“ .com，net等”的地址的要点。 AND“ /”，或更具体的可能是：以.com结尾，并且不位于另一个'。'之前。

因此，.com.com无效，但.com /或.com将是有效的

Answer 6

嗯，你好用户662772：

好吧，我不是想打招呼，但您考虑使用awk吗？ 它将日志文件拆分为多个字段，然后您可以简单地打印所需的字段。 Bonus Awk进行正则表达式模式匹配和替换。

但是你在问正则表达式：

我正在使用Perl的正则表达式：

http。*（\\。com | \\ .org | \\ .net）

低音不得不加倍转义反斜线。

正则表达式-从日志文件中提取网站地址

问题描述

6 个解决方案

解决方案1
0 2011-03-16 16:12:35

解决方案2
0 2011-03-16 16:15:32

解决方案3
0 2011-03-16 16:16:19

解决方案4
0 2011-03-16 16:17:15

解决方案5
0 2011-03-16 16:35:05

解决方案6
0 2011-03-16 16:50:52

正则表达式-从日志文件中提取网站地址

问题描述

6 个解决方案

解决方案1 0 2011-03-16 16:12:35

解决方案2 0 2011-03-16 16:15:32

解决方案3 0 2011-03-16 16:16:19

解决方案4 0 2011-03-16 16:17:15

解决方案5 0 2011-03-16 16:35:05

解决方案6 0 2011-03-16 16:50:52

解决方案1
0 2011-03-16 16:12:35

解决方案2
0 2011-03-16 16:15:32

解决方案3
0 2011-03-16 16:16:19

解决方案4
0 2011-03-16 16:17:15

解决方案5
0 2011-03-16 16:35:05

解决方案6
0 2011-03-16 16:50:52