[英]How to parse this HTML table with BeautifulSoup and Regex?
我的HTML:
<table cellspacing="0" cellpadding="2" rules="all" border="1" id="branchTable" width="100%">
<tr class="TitleTable">
<th scope="col" width="250"><b>Branch Name</b></th><th scope="col" width="35%"><b>Branch Date</b></th><th scope="col" width="35%"><b>Branch Origin</b></th>
</tr><tr class="RowSet">
<td><a class="blue" href="javascript: OpenWindow('/home/data/files/fetchRecord.php?fileID=342')">SFO Branch</a></td><td class="red">03/16/2012</td><td class="red"> </td>
</tr><tr class="RowSet">
<td><a class="blue" href="javascript: OpenWindow('/home/data/files/fetchRecord.php?fileID=884')">LAX Branch</a></td><td class="red">03/16/2012</td><td class="red">06/16/1985</td>
</tr><tr class="RowSet">
<td><a class="blue" href="javascript: OpenWindow('/home/data/files/fetchRecord.php?fileID=83')">DC Branch</a></td><td class="red">03/16/2012</td><td class="red"> </td>
</tr>
</table>
到目前為止,我的代碼:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(pageSource)
table = soup.find("table", id = "branchTable")
rows = table.findAll("tr", {"class":"RowSet"})
data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in rows]
print data
輸出:
SFO Branch 03/16/2012
LAX Branch 03/16/2012 06/16/1985
DC Branch 03/16/2012
期望的:
我想獲取包含在標簽中的數據以及ID(fetchRecord.php?fileID = 342 )。 不確定如何獲取該值。 BeautifulSoup或Regex,請提供幫助。 謝謝!
您可以使用正則表達式來解析href
但是我懶得寫一個。 有關檢索URI之后解析查詢字符串的正確方法,請參見下面的href_parse
:
from urlparse import urlparse
from urlparse import parse_qs
def href_parse(value):
if (value.startswith('javascript: OpenWindow('') and
value.endswith('')'):
begin_length = len('javascript: OpenWindow('')
end_length = len('')')
file_location = value[begin_length:-end_length]
query_string = urlparse(file_location).query
query_dict = parse_qs(query_string)
return query_dict.get('fileId', None)
href_data = [[href_parse(td.find('a', attrs={'class': 'blue'})['href'])
for td in tr.findAll("td")]
for tr in rows]
print href_data
這個怎么樣
import re
urlRE = re.compile('javascript: OpenWindow\(\'(.*)#39;\)')
...
urlMat = urlRE.match(value)
if urlMat:
url = urlMat.groups()[0]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.