How to get the specific text of this table?

Question

I'm quite familiar with BeatifulSoup but can't build the string for the following: The HTML is a snippet from the page I want to scrape (and I'm allowed to scrape by the way,):

import bs4 as BeautifulSoup
    
data= """<dl class="markt_expose_deflist markt_expose_deflist_lineless">
 <dt>
  Ort
 </dt>
 <dd>
  80995
  <a href="https://www.markt.de/suche.htm" title="München">
   München
  </a>
 </dd>
 <dt>
  Anzeigentyp
 </dt>
 <dd>
  Privatangebot
 </dd>
 <dt>
  Anzeigendatum
 </dt>
 <dd>
  04.10.2020
 </dd>
 <dt>
  Anzeigenkennung
 </dt>
 <dd>
  <a href="https://some.link/">
   blabla
  </a>
 </dd>
 <dt>
  Aufrufe dieser Anzeige
 </dt>
 <dd>
  734
 </dd>
</dl>"""
    
soup = BeautifulSoup(data, 'html.parser')

I want to assign the date 04.10.2020 from the HTML to the variable date. My last attempt was this:

date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'}).find('dt',{'text':'Anzeigentyp'}).find('dd').text

But it didn't work.

Answer 1

The date is present within the 3rd dd tag, so just use the find_all method to find all the dd tags and just assign the text present within the 3rd dd tag (which has an index of 2) to the var date . And your import statement was also wrong. Another suggestion from my side is to use html5lib instead of html.parser . Here is the final code:

from bs4 import BeautifulSoup

data= """    <dl class="markt_expose_deflist markt_expose_deflist_lineless">
        <dt>
          Ort
        </dt>
        <dd>
          80993&nbsp;<a href="https://www.markt.de/suche.htm" title="München">München</a>
        </dd>
      <dt>
        Anzeigentyp
      </dt>
      <dd>
        Privatangebot
      </dd>
        <dt>
          Anzeigendatum
        </dt>
        <dd>
          04.10.2020
        </dd>
        <dt>
          Anzeigenkennung
        </dt>
        <dd>
          <a href="https://some.link/">f2e7ae76</a>
        </dd>
        <dt>
          Aufrufe dieser Anzeige
        </dt>
        <dd>
          689
        </dd>
    </dl>"""

soup = BeautifulSoup(data, 'html5lib')

date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'})

date = date.find_all('dd')[2].text.strip()

print(date)

Output:

04.10.2020

Answer 2

You can use this if all the dates are in same format. It just extracts all dd tags, and checks the text inside them with a condition whether they have more than one " . ".

soup = BeautifulSoup(data, 'html.parser')


for tag in soup.find_all('dd'):
    if tag.text.count('.') > 1:
        date = tag.text.lstrip()
        print(date)

Output:

04.10.2020

Answer 3

So first of all your import statement is wrong. What you are doing is just renaming bs4 to BeautifulSoup . What I believe you wanted to do was to import BeautifulSoup from the modulue bs4 .

To do this in python we do:

from module import the_class_you_want_to_import

so in your case that would be:

from bs4 import BeautifulSoup

Now that we have the import sorted out let's move onto the actual code.

The <dt> element you are trying to find has no children elements and therefore we can't find any element <dd> inside it.

What I did instead was this:

soup = BeautifulSoup(data, 'html.parser')
date = soup.findAll('dd')
print(date[2].text.strip())

Answer 4

You can select neighbor tag of <dt> that contains string "Anzeigendatum":

print( soup.select_one('dt:contains("Anzeigendatum") + dd').text )

Prints:

  04.10.2020

How to get the specific text of this table?

Question

4 answers

solution1
1 ACCPTED 2020-10-19 07:01:05

solution2
0 2020-10-19 07:33:51

solution3
0 2020-10-19 07:38:19

solution4
0 2020-10-19 11:35:01

How to get the specific text of this table?

Question

4 answers

solution1 1 ACCPTED 2020-10-19 07:01:05

solution2 0 2020-10-19 07:33:51

solution3 0 2020-10-19 07:38:19

solution4 0 2020-10-19 11:35:01

solution1
1 ACCPTED 2020-10-19 07:01:05

solution2
0 2020-10-19 07:33:51

solution3
0 2020-10-19 07:38:19

solution4
0 2020-10-19 11:35:01