簡體   English   中英

Python美麗的湯只獲得第一個Href

[英]Python Beautiful Soup Get First Href Only

我正在嘗試從網頁的href抓取URL,我摘錄了我要抓取的div之一的列表項的摘要。

我的問題是如何縮小代碼范圍以僅刮取HTML的第一個Href?

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep

html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')


for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    temp = link.get('href')
    print(temp)

您可以使用find

from bs4 import BeautifulSoup as soup
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
result = soup(html, 'lxml').find('a')['href']

輸出:

'http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html'

這就是你的做法

import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep

html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')

print soup.findAll('a', attrs={'href': re.compile("^http://")})[0].get('href')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM