用漂亮的湯在python中提取深層嵌套的href

Question

我正在嘗試提取一個非常嵌套的href。 結構如下：

<div id="main">
 <ol>
   <li class>
     <div class>
       <div class>
         <a class>
         <h1 class="title entry-title">
           <a href="http://wwww.link_i_want_to_extract.com">
           <span class>
         </h1>
        </div>
       </div>
     </li>

然后還有其他帶有href的<li class> 。 所以基本上父母對孩子的順序是

li - div - div - h1 - a href

我嘗試了以下方法：

soup.select('li div div h1')

也

soup.find_all("h1", { "class" : "title entry-title" })

也

for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
        for link in item.find_all('a',href=TRUE):

這些似乎都不起作用，我得到了[]或空的.txt文件。

另外，更令人不安的是，在定義soup ，然后執行print(soup)我看不到嵌套的類，我只看到頂部的那個類， <div id=main>而且也沒有執行print soup.l檢索l類。 我認為Beautifulsoup不會識別l類和其他類。

Answer 1

這對我有用

from bs4 import BeautifulSoup

html = '''
<div id="main">
   <ol>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="http://www.link_i_want_to_extract.com">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="https://other_link_i_want_to_extract.net">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
   </ol>
</div>
'''

soup = BeautifulSoup(html, "lxml")
for h1 in soup.find_all('h1', class_="title entry-title"):
    print(h1.find("a")['href'])

Answer 2

您有錯別字： href=TRUE ，應為href=True 。

s = """
<div id="main">
   <ol>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="http://www.link_i_want_to_extract.com">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="https://other_link_i_want_to_extract.net">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
   </ol>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')

for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
    for link in item.find_all('a',href=True):
        print('bs link:', link['href'])

另外，您可以使用pyQuery ，它提供了類似於查詢語法的js / jquery：

from pyquery import PyQuery as pq
from lxml import etree

d = pq(s)
for link in d('h1.title.entry-title > a'):
    print('pq link:', pq(link).attr('href'))

返回值：

bs link: http://www.link_i_want_to_extract.com
bs link: https://other_link_i_want_to_extract.net
pq link: http://www.link_i_want_to_extract.com
pq link: https://other_link_i_want_to_extract.net

Answer 3

使用. 找到第一個后裔：

soup.find('div', id="main").h1.a['href']

或使用h1作為錨點：

soup.find("h1", { "class" : "title entry-title" }).a['href']

Answer 4

一種簡單的方法：

soup.select('a[href]')

要么：

soup.findAll('a', href=True)

用漂亮的湯在python中提取深層嵌套的href

問題描述

4 個解決方案

解決方案1
2 已采納 2017-01-28 10:01:56

解決方案2
1 2017-01-28 10:12:56

解決方案3
0 2017-01-28 09:59:45

解決方案4
0 2017-01-28 10:01:58

用漂亮的湯在python中提取深層嵌套的href

問題描述

4 個解決方案

解決方案1 2 已采納 2017-01-28 10:01:56

解決方案2 1 2017-01-28 10:12:56

解決方案3 0 2017-01-28 09:59:45

解決方案4 0 2017-01-28 10:01:58

解決方案1
2 已采納 2017-01-28 10:01:56

解決方案2
1 2017-01-28 10:12:56

解決方案3
0 2017-01-28 09:59:45

解決方案4
0 2017-01-28 10:01:58