簡體   English   中英

獲取<a>html表中元素的</a>href

[英]Get href of an <a> element inside of a html table

HTML網站

我有一個 HTML 列表,從這個列表中我只想要<tr>具有class=""元素。 我想稍后下載文件,所以我只需要第三個<td>和里面的<a>元素的href ,我怎樣才能將這些直接作為字符串讀出?

我想要所有帶有class = "" <tr>元素。

例如:

<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>

在這個<tr>元素內部有一個<td>元素。 我想要在第三個<td>元素中<a>元素的 href 。 所以我想要的是網址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz (不僅是這個:D,我想要所有網址)

代碼

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
        
print(antwerp_table)
# antwerp_table is my html table 

html 示例(訪問http://insideairbnb.com/get-the-data.html了解更多信息)

<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
                        Description
                    </th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>

有不同的方法來獲得未歸檔的href我建議由表的結構引起的與bs4 css 選擇器一起使用,該選擇器獲取所有<tr>與一個空class和一個<a>包括:

soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')

例子

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = [url['href'] for url in soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')]

輸出

['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson']

首先你必須獨自拿到桌子
如果你使用 find 它會找到所有的表
我檢查了該類有 1 個表,因此我們可以使用.select_one()
之后你必須select() <a>元素
這是你想要的工作代碼

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.select_one(f".{DATASET_CITY.lower()}")
for i in antwerp_table.select("a"):
    print(i.get("href"))

迭代表結果以查找鏈接

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
        
#print(antwerp_table)
rows = (antwerp_table.find_all('tr', class_=''))
for tr in rows:
    cols = tr.findAll('td')
    if len(cols) >= 4:
        link = cols[2].find('a').get('href')
        print link

首先,我們使用class=""獲取所有<tr> ,然后獲取所有<a> ,最后獲取所有href

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
c = requests.get(DATASET_URL).content
soup = BeautifulSoup(c, "html.parser")
trs = soup.find(class_=DATASET_CITY.lower()).find_all('tr', class_='')
hrefs = [a for k in [tr.find_all('a') for tr in trs] for a in k]
links = [x.get('href') for x in hrefs]
print(links)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM