简体   繁体   中英

regex - extract website address from log file

I am in need of assistance in writing a regex query to extract all the website addresses in a log file. Each line of the log file contains a bunch of info (IP address, protocol, bytes, requested website, etc...).

Specifically, I would like to strip out anything that starts with "http://" and ends in specific ".ENDING" where I specify "ENDING = com, biz, net, tv, info" I do not care about the full url (ie: http : // www.google.com/bla/page2=blablabla, simply http://www.google.com ). The harder part of this regex query is I want it to pick up on domains that contain .com or .info or .biz as a subdomain (ie: http : // www.google.com.MaliciousWebsite.com) Is there any way to catch the full domain instead of chopping it short at google.com in this situation?

I have never written a regex query before so I have tried to use an online reference chart (http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/) but am struggling. Here is what I have so far:

"\A[http://]\Z[\.][com,info,biz,tv,net]"

*sorry for the spacing in the URLs but stackoverflow is flagging them and I can only post a max of 2 since I am new.

Thank you for the help.

UPDATED : Based on the excellent feedback from everyone so far I think it would be better to write this rule so that it picks up on everything between (http OR https) and (non-valid URL character: ?,!,@,#,$,%,^,&,*,(,),[,{,},],|,/,',",;,<,>)

This will ensure that all TLDs are grabbed and that webistes such as google.com.bad.website.com are also grabbed. Here is my mockup so far:

"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"

Thanks again for all the help.

Not sure what regex language you're using, so I'll go with .NET syntax. How about:

@"^https?://[^?/#\s\r]+"

It's not perfect, but the real spec for domain names is a beast , and the presence of http:// or https:// should be enough to tell you there's a domain name on the way.

The ? and # inside the character class should be fine , but I haven't had a chance to check it. You might need to escape them with a \\ .

Also, this will capture port numbers as well. If you don't want that, add : to the negated character class.


Edit: The PCRE version should be something like this:

^https?:\/\/[^?\/#\s\r]+

I haven't used PCRE recently, though, so you might want to check that with someone who has. I'm not sure which characters need to be escaped inside a character class in PCRE.

You can try this expresion:

\b((?:http://)(?:.)*(?:\.)(?:com|info|biz|tv|net))

and you can take a look of the description here :)

r"""
\b               # Assert position at a word boundary
(                # Match the regular expression below and capture its match into backreference number 1
   (?:              # Match the regular expression below
      http://          # Match the characters “http://” literally
   )
   (?:              # Match the regular expression below
      .                # Match any single character that is not a line break character
   )*               # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
      \.               # Match the character “.” literally
   )
   (?:              # Match the regular expression below
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
         com              # Match the characters “com” literally
      |                # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         info             # Match the characters “info” literally
      |                # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
         biz              # Match the characters “biz” literally
      |                # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
         tv               # Match the characters “tv” literally
      |                # Or match regular expression number 5 below (the entire group fails if this one fails to match)
         net              # Match the characters “net” literally
   )
)
"""

this will catch http or https followed by :// and a domain name not containing space or slash.
note that there are some flawors of regex for various programming languages. you may need to escape the / by \\/ or in Java you have to double \\ by \\\\

https?://[^ /]+\.(?:com|info|biz|tv|net)
^http\:\/\/(.+)\.(com|info|biz|tv|net)

will catch all domains in the http realm ending in the specified tld, but also everything like: http://test.commercial.ly as well. I didn't add an ending slash since I'm not sure if you will always have an ending slash or not on the domain, but if you do always have an ending slash on the domain, you can simple add a / to the end of the regex. If you don't always have an ending slash, that could give you some false positives. You could also add https support if you wanted. Are you sure you want to specify the tld's? or would you want to grab any tld's?

\\A[http://]\\Z[\\.][.*][com,info,biz,tv,net]?![\\.]

Not sure what type of regex you're using, but it would seem that you're trying to find the point of an address that includes BOTH ".com, net,etc." AND "/", or more specific might be: ends in .com and does NOT precede another '.'

So .com.com isn't valid, but .com/, or .com would be

Umm hello user662772:

Okay, I'm not trying to be snarky but have you consider using awk? It will split your log file up into fields and then you can simply print the field you are after. Bonus Awk does regular expression pattern matching and substitution.

But you were asking about regexs:

I'm using Perl's regular expressions:

http.*(\\.com|\\.org|\\.net)

woops had to double escape the backslashes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM