简体   繁体   中英

How to find all repeat pattern and capture the sub-pattern in python?

I try to grab some data from a webpage, some lines just like the following

<td><a href="some_web_site">Mr. Google</a></td>
<td>12.42%</td>
<td>1360</td>
<td><span style="color: #E3170D">49.12%</span></td>
<td><span style="color: #008000">2.513</span></td>
<td><span style="color: #E3170D">0.945</span></td>
<td>5.074</td>
<td>5.371</td>
<td>8.424</td>
</tr>

Of course, there is a \\n at the end of each line. I try to grab the name "Mr. Google" and also the data as a line in my data matrix. (There are other data to be other lines from the same webpage) It seems hardly to match all of them at once. The only way I can figure out is:

pattern=re.complie(r'>([\w\s]*)</a></td>\n
                     (?:<td>([\d\.\%]*)</td>\n){2} 
                     (?:.*>([\d\.\%]*)</span></td>\n){3}
                     (?:<td>([\d\.]*)</td>\n){3}')

Unfortunately, it only match the last one, ie "Mr. Google",1360,0.945,8.424, but not all the data. Should I repeat the pattern several times instead of using {2} or {3}? Maybe repeat can fix it but really ugly.:( I am wondering if anyone can help me out of this re pattern.

Another choice is to get the name and data separately with different easy pattern. The problem is that there are some other separate data in the webpage, so I don't want to mix the this "name-data" line with separate ones. I have no choice but need to get the name and data at one time so I can make sure I get the right data I want.

Thank you.

Below regex is seeking for any visible character(not space, not newline) in between the > and <

data = re.findall('>\s*([^<\n\r]+)\s*<', html)
print data

This will work only for the sample strings that you have provided as example.

Better use xpath:

import requests
import urllib2
from lxml import html
url = 'HTTP'

page = requests.get(url)
tree = html.fromstring(page.text)

a = tree.xpath('//td/a/text()|//td/text()')
b = tree.xpath('//td/span/text()') 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM