简体   繁体   English

如何提取所有嵌套<option value="tags and their content with BeautifulSoupBeautifulSoup?">标签及其内容与 BeautifulSoup?</option>

[英]How to extract all nested <option> tags and their content with BeautifulSoup?

I'm trying to pull out all nested <option> tags and their values using BeautifulSoup in Python. The first block of code provides the desired Unicode-type result (more than 60 pages of output).我正在尝试使用BeautifulSoup中的 BeautifulSoup 提取所有嵌套的<option>标签及其值。第一段代码提供了所需的 Unicode 类型结果(超过 60 页的输出)。 Part of the HTML tree is included below.下面包含 HTML 树的一部分。 Please note that the desired <option> tags are nested.请注意,所需的<option>标签是嵌套的。

Issue: The second block of code below does not provide the output, throwing no error.问题:下面的第二个代码块没有提供 output,没有抛出错误。

from bs4 import BeautifulSoup
import requests

def main(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, "html.parser")
    print(soup.prettify)
        
main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')
from bs4 import BeautifulSoup
import requests

def main(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, "html.parser")
    select_id = soup.find_all("select", id="pufnumber")
    print(select_id)
    nested_option = [x.find_all("option") for x in select_id] 
    print(nested_option)
    
main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

Part of the output from print(soup.prettify) :来自print(soup.prettify)的 output 的一部分:

</table>
<!-- 3/23/06 <img src="../images/bullets/spacer.gif" width="1" height="3" alt="">
            <table role="presentation" width="430" height="15" border="0" cellpadding="6" cellspacing="0">
        <tr>
          <td height="0" bgcolor="#F9F9F9" class="contentStyle"><strong><font color="#006600">Option
                2: </font><font color="#003399"><label for="pufnumber">Select by data file number/title </label></font></strong></td>
        </tr>
      </table>      
      <table role="presentation" width="430" height="25" border="0" cellpadding="5" cellspacing="0" class="BlueBox">
        <tr>
          <td width="430" height="0"> <span class="contentStyle">
     
            <select id="pufnumber" size=1 name="cboPufNumber">
            <option value="All">All data files</option>
        
                    
              <option value="HC-225">MEPS HC-225: MEPS Panel 24 Longitudinal Data File</option> 
                    
                    
              <option value="HC-224">MEPS HC-224: 2020 Full Year Consolidated Data File</option> 
                    
                    
              <option value="HC-223">MEPS HC-223: 2020 Person Round Plan File</option> 

My goal is to pull out nested option tags like this:我的目标是像这样提取嵌套的选项标签:

<option value="HC-225">MEPS HC-225: MEPS Panel 24 Longitudinal Data File</option> 

I'm not interested in the following <option> tags:我对以下<option>标签不感兴趣:

<option value="All">All available years</option>
<option value="2020">2020</option>
<option value="2019">2019</option>
<option value="2018">2018</option>
<option value="2017">2017</option>
<option value="2016">2016</option>
...

I noticed that the part of the HTML you want to process is in a comment block, which means the BeautifulSoup cannot process the content.我注意到你要处理的HTML部分在注释块中,这意味着BeautifulSoup无法处理该内容。

<!-- 3/23/06 <img src=" -->

Try the code below to see all the comments,试试下面的代码看看所有的评论,

import requests
from bs4 import BeautifulSoup, Comment

def main(base_url):
   response = requests.get(base_url)
   soup = BeautifulSoup(response.text, "html.parser")
   comments = soup.find_all(string=lambda text: isinstance(text, Comment))
   for c in comments:
       print(c)
       print("===========")
   c.extract()
main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

Now, your problem becomes how to process the comments to extract the data you want.现在,您的问题变成了如何处理评论以提取您想要的数据。

Here is a working example, and I used the regular expression to process the raw text.这是一个工作示例,我使用正则表达式来处理原始文本。 Note that this is only designed for the specific web page structure and might not be useful for other sites.请注意,这仅适用于特定的 web 页面结构,可能对其他站点没有用。

import requests
from bs4 import BeautifulSoup, Comment
import re

# find all options match the start and end string
def extractOptions(inputData):
    sub1 = str(re.escape('<option value="All">All data files</option>'))
    sub2 = str(re.escape('</select>'))
    result = re.findall(sub1+"(.*)"+sub2, inputData, flags=re.S)
    if len(result) > 0:
        return result[0]

# find the actual data from each option
def extracData(inputData):
    sub1 = str(re.escape('>'))
    sub2 = str(re.escape('</option>'))
    result =  re.findall(sub1+"(.*)"+sub2, inputData, flags=re.S)
    if len(result) > 0:
        return result[0]
    return ''

def main(base_url):
   response = requests.get(base_url)
   soup = BeautifulSoup(response.text, "html.parser")
   comments = soup.find_all(string=lambda text: isinstance(text, Comment))

   for c in comments:
       if '<select id="pufnumber" size=1 name="cboPufNumber">' in c:
        options = extractOptions(c)
        ops = options.splitlines() #split text into lines
        for op in ops:
            data = extracData(op)
            if data != '': #check if the data found
                print(data)
       
   
main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM