简体   繁体   中英

How can I get data from a specific class of a html tag using beautifulsoup?

I want to get data located(name, city and address) in div tag from a HTML file like this:

<div class="mainInfoWrapper">
    <h4 itemprop="name">name</h4>
    <div>
        <a href="/Wiki/Province/Tehran"></a>
         city
        <a href="/Wiki/City/Tehran"></a>
         Address
    </div>
</div>

I don't know how can I get data that i want in that specific tag. obviously I'm using python with beautifulsoup library.

There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:

from bs4 import BeautifulSoup

html = '''<div class="mainInfoWrapper">
    <h4 itemprop="name">            
        NAME
        &nbsp;                          

    </h4>                           
    <div>                           
        <a href="/Wiki/Province/Tehran">PROVINCE</a> - <a href="/Wiki/City/Tehran">CITY</a> ADDRESS
    </div>                          
</div>'''

soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')

name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()

When run for the URL that you provided

import requests
from bs4 import BeautifulSoup

r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')

name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()

>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت

I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.

You can do it with built-in lxml.html module :

>>> s="""<div class="mainInfoWrapper">
...     <h4 itemprop="name">name</h4>
...     <div>
...         <a href="/Wiki/Province/Tehran"></a>
...          city
...         <a href="/Wiki/City/Tehran"></a>
...          Address
...     </div>
... </div>"""
>>> 
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']

And with BeautifulSoup to get the text between your tags:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text

And for get the text from a specific tag just use soup.find_all :

soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
    print line.text

If h4 is used only once then you can do this -

name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM