简体   繁体   中英

How to improve this regular expression to work in other situations?

I can split this string:

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

with this RegEx:

'([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)'

So how to improve this RegEx to split this kind of string ( where there is internet address instead of the IP ):

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985

and this kind of string ( where there are double quotation marks between the double quotation marks and I don't have the last number )

frank.mtsu.edu - - [03/Jul/1995:02:41:15 -0400] "GET /images/" HTTP/1.0" 404 -

Thanks!

For such situation | operator meaning or is useful, for your second example you might modify your expression to:

'([(\d\.)]+|[a-z\d\.]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)'

Note that this assumes that all addresses consist solely of lowercase letters digits and dots. EDIT: After @tripleee comment I must admit that addresses might contain more different characters, thus I add more tolerant solution:

'([(\d\.)]+|[^ ]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)'

this one assumes that address might contain any character which is not space. If this is too tolerant, feel free to improve earlier version. As noted in comments it is redundant and might be replaced with

'([^ ]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)'

To make it working with last case, just replace last (\\d+) with (\\d+|-) , as suggested by @solarc earlier

I don't know exactly what it is you are trying to do but your regex is not very specific as it stands. Below is a suggested solution of what could be an improvement. It looks complicated but it isn't really too bad once broken down.

^(\\b(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b|\\w+\\.\\w+\\.(?:net|com|gov|edu))\\s-\\s-\\s(\\[[0-9]{2}\\/\\w{3}\\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}\\s-[0-9]{4}\\])\\s(\\"[^\\"]+\\")\\s(.*)$

Check out https://regex101.com/r/ojIGIA/3 to see it in action and for explanations read the right hand side bar.

Edit: I realised I missed out a ? in the IP address part of the regex.I also forgot to escape a " since I didn't take into account the python flavour. Fixed and updated the Regex and the link.

Now I have a little more time I'll explain a bit further what I've done. The above Regex can be split up as follows.
^ start of line

( start capture group 1

\\b(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b This is capturing the IP address. Depends on how precise you want to be, you could just get away with doing something like \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} if you aren't too worried about it. That will match all valid IP addresses too but it also will match some invalid ones.

| OR operator

\\w+\\.\\w+\\.(?:net|com|gov|edu) This is a very basic example of what a URL capture could look like.

) End capture group 1

\\s-\\s-\\s Matches your " - - " exactly

(\\[[0-9]{2}\\/\\w{3}\\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}\\s-[0-9]{4}\\]) This is my suggestion for capturing the date and other stuff in the middle. It will need tweaking depending on exactly what you want. This is also capture group 2.

\\s A space

(\\"[^\\"]+\\") Matches everything within inverted commas at this point in the match. Capture group 3.

\\s a space

(.*) Matches everything else up until the end and puts in capture group 4.

$ End of line

Now these are all just suggestions since I don't know what exactly you are trying to do but hopefully this helps out and gives you some ideas.

One note is that I use \\s instead of a space. There is nothing wrong with using a space, I personally like using \\s because it is easier to read for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM