繁体   English   中英

Python美丽的汤只获得第一个Href

[英]Python Beautiful Soup Get First Href Only

我正在尝试从网页的href抓取URL,我摘录了我要抓取的div之一的列表项的摘要。

我的问题是如何缩小代码范围以仅刮取HTML的第一个Href?

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep

html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')


for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    temp = link.get('href')
    print(temp)

您可以使用find

from bs4 import BeautifulSoup as soup
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
result = soup(html, 'lxml').find('a')['href']

输出:

'http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html'

这就是你的做法

import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep

html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics  </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')

print soup.findAll('a', attrs={'href': re.compile("^http://")})[0].get('href')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM