正則表達式-從日志文件中提取網站地址

Question

我需要編寫正則表達式查詢來提取日志文件中所有網站地址的幫助。 日志文件的每一行都包含一堆信息（IP地址，協議，字節，請求的網站等）。

具體來說，我想刪除以“ http：//”開頭並以“ .ENDING”結尾的所有內容，其中我指定“ ENDING = com，biz，net，tv，info”，我不在乎完整的網址（即：http：// // www.google.com/bla/page2=blablabla，只需http://www.google.com ）。此正則表達式查詢中最難的部分是我希望它選擇包含.com或.info或.biz作為子域的域（例如：http：// www.google.com.MaliciousWebsite.com）在這種情況下趕上整個域名，而不是在google.com上砍短整個域名？

我之前從未編寫過正則表達式查詢，因此我嘗試使用在線參考圖表（http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/），但遇到了麻煩。 這是我到目前為止的內容：

"\A[http://]\Z[\.][com,info,biz,tv,net]"

*很抱歉，網址中的空格是問題，但是stackoverflow正在標記它們，由於我是新用戶，我最多只能張貼2。

感謝您的幫助。

更新：根據來自人人到目前為止，我認為這將是更好，所以，它的一切拿起之間（HTTP或HTTPS）和（非有效的URL字符寫入此規則的極好的反饋：？！，@，＃， $，％，^，＆，*，（，），[，{，}，]，|，/，'，“，;，<，>）

這將確保抓住所有TLD，並確保抓住諸如google.com.bad.website.com之類的網站。 到目前為止，這是我的模型：

"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"

再次感謝您提供的所有幫助。

Answer 1

不知道您使用的是哪種正則表達式語言，因此我將使用.NET語法。 怎么樣：

@"^https?://[^?/#\s\r]+"

它不是完美的，但是域名的真正規范是野獸，並且http://或https://應該足以告訴您域名即將到來。

? 和字符類中的# 應該沒問題，但是我還沒有機會檢查一下。 您可能需要使用\\對其進行轉義。

此外，這還將捕獲端口號。 如果您不想這樣做，請在否定的字符類中添加: 。

編輯：PCRE版本應該是這樣的：

^https?:\/\/[^?\/#\s\r]+

不過，我最近還沒有使用過PCRE，因此您可能想與有此經歷的人確認一下。 我不確定哪些字符需要在PCRE的字符類中轉義。

Answer 2

您可以嘗試以下表達式：

\b((?:http://)(?:.)*(?:\.)(?:com|info|biz|tv|net))

您可以在這里查看說明：)

r"""
\b               # Assert position at a word boundary
(                # Match the regular expression below and capture its match into backreference number 1
   (?:              # Match the regular expression below
      http://          # Match the characters “http://” literally
   )
   (?:              # Match the regular expression below
      .                # Match any single character that is not a line break character
   )*               # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
      \.               # Match the character “.” literally
   )
   (?:              # Match the regular expression below
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
         com              # Match the characters “com” literally
      |                # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         info             # Match the characters “info” literally
      |                # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
         biz              # Match the characters “biz” literally
      |                # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
         tv               # Match the characters “tv” literally
      |                # Or match regular expression number 5 below (the entire group fails if this one fails to match)
         net              # Match the characters “net” literally
   )
)
"""

Answer 3

這將捕獲http或https，后跟：//和不包含空格或斜杠的域名。
請注意，各種編程語言都有一些正則表達式的缺陷。 您可能需要躲避/由\\/或在Java中，你必須仔細\\由\\\\

https?://[^ /]+\.(?:com|info|biz|tv|net)

Answer 4

^http\:\/\/(.+)\.(com|info|biz|tv|net)

將捕獲以指定的tld結尾的http領域中的所有域，也捕獲諸如http://test.commercial.ly之類的所有內容。 我沒有添加結尾斜杠，因為我不確定域上是否總是有結尾斜杠，但是如果您在域上總是有結尾斜杠，則可以簡單地在結尾添加/正則表達式 如果您不總是以斜杠結尾，那可能會給您帶來一些誤報。 您還可以根據需要添加https支持。 您確定要指定頂級域名嗎？ 還是您想獲取任何頂級域名？

Answer 5

\\A[http://]\\Z[\\.][.*][com,info,biz,tv,net]?![\\.]

不知道您使用的是哪種類型的正則表達式，但似乎您正在嘗試查找包含“ .com，net等”的地址的要點。 AND“ /”，或更具體的可能是：以.com結尾，並且不位於另一個'。'之前。

因此，.com.com無效，但.com /或.com將是有效的

Answer 6

嗯，你好用戶662772：

好吧，我不是想打招呼，但您考慮使用awk嗎？ 它將日志文件拆分為多個字段，然后您可以簡單地打印所需的字段。 Bonus Awk進行正則表達式模式匹配和替換。

但是你在問正則表達式：

我正在使用Perl的正則表達式：

http。*（\\。com | \\ .org | \\ .net）

低音不得不加倍轉義反斜線。

正則表達式-從日志文件中提取網站地址

問題描述

6 個解決方案

解決方案1
0 2011-03-16 16:12:35

解決方案2
0 2011-03-16 16:15:32

解決方案3
0 2011-03-16 16:16:19

解決方案4
0 2011-03-16 16:17:15

解決方案5
0 2011-03-16 16:35:05

解決方案6
0 2011-03-16 16:50:52

正則表達式-從日志文件中提取網站地址

問題描述

6 個解決方案

解決方案1 0 2011-03-16 16:12:35

解決方案2 0 2011-03-16 16:15:32

解決方案3 0 2011-03-16 16:16:19

解決方案4 0 2011-03-16 16:17:15

解決方案5 0 2011-03-16 16:35:05

解決方案6 0 2011-03-16 16:50:52

解決方案1
0 2011-03-16 16:12:35

解決方案2
0 2011-03-16 16:15:32

解決方案3
0 2011-03-16 16:16:19

解決方案4
0 2011-03-16 16:17:15

解決方案5
0 2011-03-16 16:35:05

解決方案6
0 2011-03-16 16:50:52