[英]Tried to extract urls by using soup.select and soup.find_all
這是網頁HTML源代碼的一部分:
<a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&type=%E4&name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&type=%E4&name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>
我想提取我想要的網址,比如以/ Result開頭的網址? 我剛學會了你可以在美麗的湯中使用soup.find_all和soup.select。
soup.find_all:
icon = soup.find_all(id = re.compile("parts_img"))
其中一個結果將成功打印:
<a href="/Result?s=9&type=%E4&name=%E9" id="parts_img01"><h4 style=""><i aria-hidden="true" class="fa f-c"></i>apple</h4></a>
soup.select:
for item in soup.select(".fa f-c"):
print(item['href'])
這不起作用......
有可能我可以直接從HTML中提取網址嗎? 我只想打印:
/Result?s=9&type=%E4&name=%E9
/Result?s=12&type=%E4&name=%E4
/Result?s=10&type=%E4&name=%E8
/Result?s=14&type=%E4&name=%E8
要在不使用正則表達式的情況下獲得相同的輸出:
html = """
<a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&type=%E4&name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&type=%E4&name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for link in soup.select("[id^='parts_img']"):
print(link['href'])
結果:
/Result?s=9&type=%E4&name=%E9
/Result?s=12&type=%E4&name=%E4
/Result?s=10&type=%E4&name=%E8
/Result?s=14&type=%E4&name=%E8
我認為這段代碼將說明從給定的html中提取href
。
html = """<a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&type=%E4&name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&type=%E4&name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&type=%E4&name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>"""
from bs4 import BeautifulSoup as Soup
import re
from urllib.parse import urljoin
parser = Soup(html, "lxml")
href = [ urljoin("http://www.abcde.com", a["href"]) for a in parser.findAll("a", {"id" : re.compile('parts_img.*')})]
print(href)
我正在使用
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
import re
top_url = 'https://a-certain.org/item-index'
response = requests.get(top_url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('a[href^="http://a-certain.org/items"]')
for item in items:
print(items['href'])
輸出是
http://a-certain.org/items/item1/
http://a-certain.org/items/item2/
http://a-certain.org/items/item3/
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.