简体   繁体   中英

Parsing html forms input tags with Beautiful Soup

I am trying to scrape a website. There is no problem if there is only one opening and one closing form-Tag and data is in between that. But when the data on the website is displayed under checked box, then data in the codes is in strange position. Does anybody have the same problem?

Here is a basic example Webpage where I want the data:

<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked="">
&nbsp;&nbsp;Airport
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77">
&nbsp;&nbsp;Bunkers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78">
&nbsp;&nbsp;Containers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79">
&nbsp;&nbsp;Cruise
<div class="label"></div>
....

I need to fetch the data: Airport,Bunkers, etc(data) which have 'checked =""' in their input array. 1st Problem: To make sure I only get checked value 2nd Problem: How to fetch the data which is between

<div>..</div><input...> data <div>...</div> 

By using the following code:

import requests
import bs4
from bs4 import BeautifulSoup
import pandas

r = requests.get("http://directories.lloydslist.com/?p=1635")
c = r.content 
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
all = soup.find_all("div",{"id":"section-1785-body"},{"class":"sectionbody"})

I get the following format:

<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-115"   name="t_pow_ports:f_p_a:5779" type="checkbox"/>  
Airport
<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-116" name="t_pow_ports:f_p_b:5779" type="checkbox"/>  
Bunkers
<div class="label"></div>
.....
....
<input checked="" class="forminput" disabled="" id="ajaxField-119"      name="t_pow_ports:f_p_y:5779" type="checkbox"/>  Dry Bulk
<div class="label"></div></div>

So if I use the following code:

abc = all[0].find_all("input", {"class":"forminput"},"checked")

I don't get any data:

<input class="forminput" disabled="" id="ajaxField-20"    name="t_pow_ports:f_p_a:595" type="checkbox"/>,
<input class="forminput" disabled="" id="ajaxField-21" name="t_pow_ports:f_p_b:595" type="checkbox"/>,
 <input class="forminput" disabled="" id="ajaxField-22" name="t_pow_ports:f_p_c:595" type="checkbox"/>,
....

Does anyone know a way around this problem?

You need to use navigableString for getting the next sibling after the checked input.

Try the following:

from bs4 import BeautifulSoup as Soup

html_str = """
<div>
    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked=""/>
    &nbsp;&nbsp;Airport

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77"/>
    &nbsp;&nbsp;Bunkers

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78"/>
    &nbsp;&nbsp;Containers

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79"/>
    &nbsp;&nbsp;Cruise

    <div class="label"></div>
</div>
"""

soup = Soup(html_str, "html.parser")

forminput = soup.find_all("input", {"class":"forminput"})
for item in forminput:
    if item.get('checked') is not None:
        # now work with navigable string! be careful for empty lines
        name = item.next_sibling.strip()
        print(name)

The output of this snippet is:

Airport
Bunkers

只需设置标志

soup.title.find_all(string=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM