简体   繁体   English

Python + BeautifulSoup-限制对特定表(网页上的多个表)的文本提取

[英]Python + BeautifulSoup - Limiting text extraction on a specific table (multiple tables on a webpage)

Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. 大家好……我正在尝试使用BeautifulSoup在网页上提取“就业日期:”的内容。 the webpage contains 5 tables. 该网页包含5个表格。 the 5 tables are similar and looked like below. 这5个表格相似,如下所示。

    <table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
            <tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
    <tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
    <tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>

<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
            <tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
    <tr><td style="width:20px;">Zones:</td><td>6</td></tr>
    <tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>

The texts I want is in the 3rd table, the table header is "Design Team" . 我想要的文本在第3个表格中,表格标题为“ Design Team”

I am Using below: 我在下面使用:

page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text

the problem is that, the “Date of Employment:” in this table sometimes is not available. 问题在于,该表中的“就业日期:”有时不可用。 when it's not there, the code picks the "Date of Employment:" in the next table. 如果不存在,代码将在下表中选择“就业日期:”。

How do I restrict my code to pick only the wanted ones in the table named “Design Team” ? 如何限制我的代码在“设计团队”表中仅选择所需的代码? thanks. 谢谢。

Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team 而不是寻找所有Date of Employment ,并寻找下一个td可以directy找到5台,鉴于thDesign Team

page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

aa = soup.find_all(text="Design Team")

nexttr = aa.next_sibling

if nexttr.td.text == "Date of Employment:":
     print nexttr.td.next_sibling.text 
else:
     print "No Date of Employment:"

nexttr = aa.next_sibling finds the next tr tag within the table tag. nexttr = aa.next_siblingtable标记中查找下一个tr标记。

if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:" if nexttr.td.text == "Date of Employment:":确保带有tr的下一个td标签内的文本为"No Date of Employment:"

nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment" nexttr.td.next_sibling"Date of Employment" nexttr.td.next_sibling提取即时td标签

print nexttr.td.next_sibling.text prints the date print nexttr.td.next_sibling.text打印日期

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM