试图通过使用soup.select和soup.find_all来提取URL

Question

This is a part of the HTML source code of a web page: 这是网页HTML源代码的一部分：

<a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&amp;type=%E4&amp;name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&amp;type=%E4&amp;name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>

And I want to extract the urls I want, like the one start with /Result? 我想提取我想要的网址，比如以/ Result开头的网址？ I just learned that you can use soup.find_all and soup.select in beautiful soup. 我刚学会了你可以在美丽的汤中使用soup.find_all和soup.select。

soup.find_all: soup.find_all：

icon = soup.find_all(id = re.compile("parts_img"))

and one of the result will successfully print: 其中一个结果将成功打印：

<a href="/Result?s=9&amp;type=%E4&amp;name=%E9" id="parts_img01"><h4 style=""><i aria-hidden="true" class="fa f-c"></i>apple</h4></a>

soup.select: soup.select：

for item in soup.select(".fa f-c"):
    print(item['href'])

And this is not working... 这不起作用......

Is there possibly a way I can extract urls directly from html? 有可能我可以直接从HTML中提取网址吗？ I just want to print: 我只想打印：

/Result?s=9&amp;type=%E4&amp;name=%E9
/Result?s=12&amp;type=%E4&amp;name=%E4
/Result?s=10&amp;type=%E4&amp;name=%E8
/Result?s=14&amp;type=%E4&amp;name=%E8

Answer 1

To get the same output without using regex: 要在不使用正则表达式的情况下获得相同的输出：

html = """
 <a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&amp;type=%E4&amp;name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&amp;type=%E4&amp;name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
for link in soup.select("[id^='parts_img']"):
    print(link['href'])

Result: 结果：

/Result?s=9&type=%E4&name=%E9
/Result?s=12&type=%E4&name=%E4
/Result?s=10&type=%E4&name=%E8
/Result?s=14&type=%E4&name=%E8

Answer 2

I think this code will illustrate extracting href from the given html. 我认为这段代码将说明从给定的html中提取href 。

 html = """<a href="http://www.abcde.com"> <img style="width:100%" src="/FileUploads/B/763846f.jpg" alt="search" title="search" /></a>
<a id="parts_img01" href="/Result?s=9&amp;type=%E4&amp;name=%E9"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>apple</h4></a>
<a id="parts_img02" href="/Result?s=12&amp;type=%E4&amp;name=%E4"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>banana</h4></a>
<a id="parts_img03" href="/Result?s=10&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>cherry</h4></a>
<a id="parts_img07" href="/Result?s=14&amp;type=%E4&amp;name=%E8"><h4 style=""><i class="fa f-c" aria-hidden="true"></i>melon</h4></a>"""
from bs4 import BeautifulSoup as Soup
import re
from urllib.parse import urljoin
parser = Soup(html, "lxml")
href = [ urljoin("http://www.abcde.com", a["href"]) for a in parser.findAll("a", {"id" : re.compile('parts_img.*')})]
print(href)

Answer 3

I'm using 我正在使用

#!/usr/bin/python

import requests
from bs4 import BeautifulSoup
import re

top_url = 'https://a-certain.org/item-index'
response = requests.get(top_url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('a[href^="http://a-certain.org/items"]')
for item in items:
        print(items['href'])

Output is 输出是

http://a-certain.org/items/item1/
http://a-certain.org/items/item2/
http://a-certain.org/items/item3/

试图通过使用soup.select和soup.find_all来提取URL

问题描述

3 个解决方案

解决方案1
2 2017-10-13 08:00:28

解决方案2
1 已采纳 2017-10-13 07:07:15

解决方案3
0 2018-03-03 05:20:39

试图通过使用soup.select和soup.find_all来提取URL

问题描述

3 个解决方案

解决方案1 2 2017-10-13 08:00:28

解决方案2 1 已采纳 2017-10-13 07:07:15

解决方案3 0 2018-03-03 05:20:39

解决方案1
2 2017-10-13 08:00:28

解决方案2
1 已采纳 2017-10-13 07:07:15

解决方案3
0 2018-03-03 05:20:39