简体   繁体   中英

How to get the specific text of this table?

I'm quite familiar with BeatifulSoup but can't build the string for the following: The HTML is a snippet from the page I want to scrape (and I'm allowed to scrape by the way,):

import bs4 as BeautifulSoup
    
data= """<dl class="markt_expose_deflist markt_expose_deflist_lineless">
 <dt>
  Ort
 </dt>
 <dd>
  80995
  <a href="https://www.markt.de/suche.htm" title="München">
   München
  </a>
 </dd>
 <dt>
  Anzeigentyp
 </dt>
 <dd>
  Privatangebot
 </dd>
 <dt>
  Anzeigendatum
 </dt>
 <dd>
  04.10.2020
 </dd>
 <dt>
  Anzeigenkennung
 </dt>
 <dd>
  <a href="https://some.link/">
   blabla
  </a>
 </dd>
 <dt>
  Aufrufe dieser Anzeige
 </dt>
 <dd>
  734
 </dd>
</dl>"""
    
soup = BeautifulSoup(data, 'html.parser')

I want to assign the date 04.10.2020 from the HTML to the variable date. My last attempt was this:

date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'}).find('dt',{'text':'Anzeigentyp'}).find('dd').text

But it didn't work.

The date is present within the 3rd dd tag, so just use the find_all method to find all the dd tags and just assign the text present within the 3rd dd tag (which has an index of 2) to the var date . And your import statement was also wrong. Another suggestion from my side is to use html5lib instead of html.parser . Here is the final code:

from bs4 import BeautifulSoup

data= """    <dl class="markt_expose_deflist markt_expose_deflist_lineless">
        <dt>
          Ort
        </dt>
        <dd>
          80993&nbsp;<a href="https://www.markt.de/suche.htm" title="München">München</a>
        </dd>
      <dt>
        Anzeigentyp
      </dt>
      <dd>
        Privatangebot
      </dd>
        <dt>
          Anzeigendatum
        </dt>
        <dd>
          04.10.2020
        </dd>
        <dt>
          Anzeigenkennung
        </dt>
        <dd>
          <a href="https://some.link/">f2e7ae76</a>
        </dd>
        <dt>
          Aufrufe dieser Anzeige
        </dt>
        <dd>
          689
        </dd>
    </dl>"""

soup = BeautifulSoup(data, 'html5lib')

date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'})

date = date.find_all('dd')[2].text.strip()

print(date)

Output:

04.10.2020

You can use this if all the dates are in same format. It just extracts all dd tags, and checks the text inside them with a condition whether they have more than one " . ".

soup = BeautifulSoup(data, 'html.parser')


for tag in soup.find_all('dd'):
    if tag.text.count('.') > 1:
        date = tag.text.lstrip()
        print(date)

Output:

04.10.2020

So first of all your import statement is wrong. What you are doing is just renaming bs4 to BeautifulSoup . What I believe you wanted to do was to import BeautifulSoup from the modulue bs4 .

To do this in python we do:

from module import the_class_you_want_to_import

so in your case that would be:

from bs4 import BeautifulSoup

Now that we have the import sorted out let's move onto the actual code.

The <dt> element you are trying to find has no children elements and therefore we can't find any element <dd> inside it.

What I did instead was this:

soup = BeautifulSoup(data, 'html.parser')
date = soup.findAll('dd')
print(date[2].text.strip())

You can select neighbor tag of <dt> that contains string "Anzeigendatum":

print( soup.select_one('dt:contains("Anzeigendatum") + dd').text )

Prints:

  04.10.2020

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM