[英]Get href of an <a> element inside of a html table
我有一個 HTML 列表,從這個列表中我只想要<tr>
具有class=""
元素。 我想稍后下載文件,所以我只需要第三個<td>
和里面的<a>
元素的href
,我怎樣才能將這些直接作為字符串讀出?
我想要所有帶有class = ""
<tr>
元素。
例如:
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
在這個<tr>
元素內部有一個<td>
元素。 我想要在第三個<td>
元素中<a>
元素的 href 。 所以我想要的是網址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz
(不僅是這個:D,我想要所有網址)
代碼
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
print(antwerp_table)
# antwerp_table is my html table
html 示例(訪問http://insideairbnb.com/get-the-data.html了解更多信息)
<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
Description
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
有不同的方法來獲得未歸檔的href
我建議由表的結構引起的與bs4
css 選擇器一起使用,該選擇器獲取所有<tr>
與一個空class
和一個<a>
包括:
soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = [url['href'] for url in soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')]
['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson']
首先你必須獨自拿到桌子
如果你使用 find 它會找到所有的表
我檢查了該類有 1 個表,因此我們可以使用.select_one()
之后你必須select()
<a>
元素
這是你想要的工作代碼
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.select_one(f".{DATASET_CITY.lower()}")
for i in antwerp_table.select("a"):
print(i.get("href"))
迭代表結果以查找鏈接
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
#print(antwerp_table)
rows = (antwerp_table.find_all('tr', class_=''))
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 4:
link = cols[2].find('a').get('href')
print link
首先,我們使用class=""
獲取所有<tr>
,然后獲取所有<a>
,最后獲取所有href
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
c = requests.get(DATASET_URL).content
soup = BeautifulSoup(c, "html.parser")
trs = soup.find(class_=DATASET_CITY.lower()).find_all('tr', class_='')
hrefs = [a for k in [tr.find_all('a') for tr in trs] for a in k]
links = [x.get('href') for x in hrefs]
print(links)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.