简体   繁体   中英

Python + BeautifulSoup - Limiting text extraction on a specific table (multiple tables on a webpage)

Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. the webpage contains 5 tables. the 5 tables are similar and looked like below.

    <table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
            <tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
    <tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
    <tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>

<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
            <tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
    <tr><td style="width:20px;">Zones:</td><td>6</td></tr>
    <tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>

The texts I want is in the 3rd table, the table header is "Design Team" .

I am Using below:

page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text

the problem is that, the “Date of Employment:” in this table sometimes is not available. when it's not there, the code picks the "Date of Employment:" in the next table.

How do I restrict my code to pick only the wanted ones in the table named “Design Team” ? thanks.

Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team

page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

aa = soup.find_all(text="Design Team")

nexttr = aa.next_sibling

if nexttr.td.text == "Date of Employment:":
     print nexttr.td.next_sibling.text 
else:
     print "No Date of Employment:"

nexttr = aa.next_sibling finds the next tr tag within the table tag.

if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:"

nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment"

print nexttr.td.next_sibling.text prints the date

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM