簡體   English   中英

使用Python Beautiful湯從表中提取數據

[英]Extracting data from a table using Python Beautiful soup

我正在嘗試從以下內容解析表中的行(出發板時間):

buscms_widget_departureboard_ui_displayStop_Callback("
 <div class='\"livetimes\"'>
 <table class='\"busexpress-clientwidgets-departures-departureboard\"'>
  <thead>
   <tr class='\"rowStopName\"'>
    <th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
     Divinity Road
    </th>
    <tr>
     <tr class='\"textHeader\"'>
      <th colspan='\"3\"'>
       text 69325694 to 84637 for live times
      </th>
      <tr>
       <tr class='\"rowHeaders\"'>
        <th>
         service
        </th>
        <th>
         destination
        </th>
        <th>
         time
        </th>
        <tr>
        </tr>
       </tr>
      </tr>
     </tr>
    </tr>
   </tr>
  </thead>
  <tbody>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
     5 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
     27 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4  (OBC)
    </td>
    <td class='\"colDestination\"' title='\"Abingdon\"'>
     Abingdon
    </td>
    <td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
     22:29
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
     65 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
     23:09
    </td>
   </tr>
  </tbody>
 </table>
</div>
<div class='\"scrollmessage_container\"'>
 <div class='\"scrollmessage\"'>
 </div>
</div>
<div class='\"services\"'>
 <a class='\"service' href='\"#\"' onclick="\&quot;serviceNameClick('');\&quot;" selected\"="">
  all
 </a>
 <a class='\"service\"' href='\"#\"' onclick="\&quot;serviceNameClick('4');\&quot;">
  4
 </a>
</div>
<div class="dptime">
 <span>
  times generated at:
 </span>
 <span>
  21:43
 </span>
</div>
");

特別是,我嘗試提取所有出發時間-所以我想記錄出發時間-例如12分鍾。

我有以下代碼:

# import libraries
import urllib.request
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page) 

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')  

print(soup.prettify())

我不確定如何查找上述時間的分鍾數? 是這樣的嗎?

minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'}) 

你可以試試這個嗎?

import urllib.request
from bs4 import BeautifulSoup
import re

quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

page = urllib.request.urlopen(quote_page).read()

soup = BeautifulSoup(page, 'lxml')  

print(soup.prettify())

minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))

for elements in minutes:
    print(elements.getText())

因此,我得到了以下代碼的答案-實際上,一旦我使用了soup.find_all函數,這實際上就很容易了:

import urllib.request
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page) 

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')  

for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
    print(link.get_text())

我得到以下輸出:

10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM