简体   繁体   English

如何找到所有重复模式并在python中捕获子模式?

[英]How to find all repeat pattern and capture the sub-pattern in python?

I try to grab some data from a webpage, some lines just like the following 我尝试从网页中获取一些数据,如下所示

<td><a href="some_web_site">Mr. Google</a></td>
<td>12.42%</td>
<td>1360</td>
<td><span style="color: #E3170D">49.12%</span></td>
<td><span style="color: #008000">2.513</span></td>
<td><span style="color: #E3170D">0.945</span></td>
<td>5.074</td>
<td>5.371</td>
<td>8.424</td>
</tr>

Of course, there is a \\n at the end of each line. 当然,每行的末尾都有一个\\ n。 I try to grab the name "Mr. Google" and also the data as a line in my data matrix. 我尝试使用“ Mr. Google”这个名字,并将数据也当作一行放在我的数据矩阵中。 (There are other data to be other lines from the same webpage) It seems hardly to match all of them at once. (同一网页上还有其他行的其他数据)似乎很难一次匹配所有这些行。 The only way I can figure out is: 我能弄清楚的唯一方法是:

pattern=re.complie(r'>([\w\s]*)</a></td>\n
                     (?:<td>([\d\.\%]*)</td>\n){2} 
                     (?:.*>([\d\.\%]*)</span></td>\n){3}
                     (?:<td>([\d\.]*)</td>\n){3}')

Unfortunately, it only match the last one, ie "Mr. Google",1360,0.945,8.424, but not all the data. 不幸的是,它仅匹配最后一个,即“ Google先生”,1360、0.945、8.424,但不是所有数据。 Should I repeat the pattern several times instead of using {2} or {3}? Maybe repeat can fix it but really ugly.:( I am wondering if anyone can help me out of this re pattern. 我是否应该重复使用模式几次而不是使用{2}或{3}?也许重复可以解决它,但确实很丑陋。:((我想知道是否有人可以帮助我摆脱这种重新模式。

Another choice is to get the name and data separately with different easy pattern. 另一种选择是使用不同的简单模式分别获取名称和数据。 The problem is that there are some other separate data in the webpage, so I don't want to mix the this "name-data" line with separate ones. 问题是网页中还有其他一些单独的数据,因此我不想将此“名称数据”行与单独的数据混合使用。 I have no choice but need to get the name and data at one time so I can make sure I get the right data I want. 我别无选择,只需要一次获取名称和数据,这样我就可以确保获得所需的正确数据。

Thank you. 谢谢。

Below regex is seeking for any visible character(not space, not newline) in between the > and < 正则表达式下方正在>和<之间寻找任何可见字符(不是空格,不是换行符)

data = re.findall('>\s*([^<\n\r]+)\s*<', html)
print data

This will work only for the sample strings that you have provided as example. 这仅适用于您作为示例提供的示例字符串。

Better use xpath: 最好使用xpath:

import requests
import urllib2
from lxml import html
url = 'HTTP'

page = requests.get(url)
tree = html.fromstring(page.text)

a = tree.xpath('//td/a/text()|//td/text()')
b = tree.xpath('//td/span/text()') 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM