简体   繁体   中英

Python re.findall regex problems

I'm trying to find some very specific data in a string. The problem is I'm not finding all of the data with the current regex I'm using. Here is some sample data:

[img:2gcfa9cc]http://img823.imageshack.us/img823/3295/pokaijumonlogo.jpg[/img:2gcfa9cc]

Making these little guys into Kaiju monsters.  Again, I know nothing about them, other then which ones I thought would make for cool possible Kaiju (of the original 150) so here's Day 01

[b:2gcfa9cc][size=150:2gcfa9cc]BULBASAUR[/size:2gcfa9cc][/b:2gcfa9cc]
[i:2gcfa9cc]Feb 01[/i:2gcfa9cc]
[ddf2k12:2gcfa9cc]http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg[/ddf2k12:2gcfa9cc]

Setting myself up with the same "parameters" as last year

I may be breaking my own Challenge rules right now but...well I started this last night and I couldn't just leave 'em out in the cold all unfinished 'n' shit.  

Obligatory Skyrim drawing.

[ddf2k12:2ytorpmj]http://4.bp.blogspot.com/-UIUSNXvnHz4/TynYf1BZ9oI/AAAAAAAAAl4/pRLHVP0Ny3U/s1600/01_cheatingcheaterwarmup1.jpg[/ddf2k12:2ytorpmj]

What I'm trying to get is the data between the ddf2k12 tags and the img tags. I've only worked on the ddf2k12 tags thus far (I figure the latter will be the former with img instead of ddf2k12 ) and out of the 1586 tags I should have found, I'm only getting 5. Here's my regex:

ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[[ddf2k12]+\:[A-Za-z0-9]+\])'
ddf2k12_find = re.findall(ddf2k12_regex, post)

Obviously there's something wrong with my regex, but after banging my head against a wall I can't sort it out, so any help is appreciated. Thanks.

You will do yourself a big favor by breaking down that big regex into parts and use composition. This seems to work correctly, and it's more obvious how to debug it.

import re

start_tag =    '(\[{tagname}:[^\]]+\])'
end_tag = start_tag.replace('\[', '\[\/', 1)
content = '((?:.|\n)*?)' # The ?: indicates a non-capturing group.                                                                                             
tag = start_tag + content + end_tag

ddf_tag=tag.format(tagname='ddf2k12')

for match in re.findall(ddf_tag, post):
    print match

Two things. First, you're missing the / in the closing ddf2k12 tag.

>>> ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[/[ddf2k12]+\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg', '[/ddf2k12:2gcfa9cc]')]

So now it works. But you're putting the ddf2k12 characters in brackets, which will match any tag with the characters 1 , 2 , d , f or k .

>>> silly_s = '[dddd:a]a[/ffff:a]'
>>> re.findall(ddf2k12_regex, silly_s)
[('[dddd:a]', 'a', '[/ffff:a]')]

So you need to match the exact tag instead; to do so, remove those outer brackets:

>>> ddf2k12_regex = '(\[ddf2k12\:[A-Za-z0-9]+\])(.*?)(\[/ddf2k12\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg', '[/ddf2k12:2gcfa9cc]')]
>>> re.findall(ddf2k12_regex, silly_s)
[]

This worked for me -

post = "[the data you want to be searched for using regex]"
ddf2k12_regex = re.compile(r"\[ddf2k12(?P<data>[\n.]*?)\[/ddf2k12")
ddf2k12_find = ddf2k12_regex.findall(post)

The problem is that you are using character set where you shouldn't. Try the following regex instead:

pattern = r'\[ddf2k12:\w+?\](.*?)\[/ddf2k12:\w+?\]'

\\w is equivalent to [a-zA-Z0-9_]

Note that the semantics of \\w and of the dot, as in (.*?), can be changed by using the DOTALL, LOCALE and UNICODE flags, or by adding (?s), (?L) or (?u) to the regex.

Grouping text together is (sometex) , not [sometext] . And I thought that ddf2k12 tag could appear once in side your [...] . Drop + off and you'll now no need an (...) .

\[ddf2k12:[a-zA-Z0-9]+\](.*?)\[/ddf2k12:[a-zA-Z0-9]+\]

Would do the work pretty well. Note that return value is text from (.*?) . If you want to get tag name you may use (...) wrap ddf2k12 . Then the combination version with your img tag would be like this.

\[(ddf2k12|img):[a-zA-Z0-9]+\](.*?)\[/(ddf2k12|img):[a-zA-Z0-9]+\]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM