I'm quite familiar with BeatifulSoup but can't build the string for the following: The HTML is a snippet from the page I want to scrape (and I'm allowed to scrape by the way,):
import bs4 as BeautifulSoup
data= """<dl class="markt_expose_deflist markt_expose_deflist_lineless">
<dt>
Ort
</dt>
<dd>
80995
<a href="https://www.markt.de/suche.htm" title="München">
München
</a>
</dd>
<dt>
Anzeigentyp
</dt>
<dd>
Privatangebot
</dd>
<dt>
Anzeigendatum
</dt>
<dd>
04.10.2020
</dd>
<dt>
Anzeigenkennung
</dt>
<dd>
<a href="https://some.link/">
blabla
</a>
</dd>
<dt>
Aufrufe dieser Anzeige
</dt>
<dd>
734
</dd>
</dl>"""
soup = BeautifulSoup(data, 'html.parser')
I want to assign the date 04.10.2020 from the HTML to the variable date. My last attempt was this:
date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'}).find('dt',{'text':'Anzeigentyp'}).find('dd').text
But it didn't work.
The date is present within the 3rd dd
tag, so just use the find_all
method to find all the dd
tags and just assign the text present within the 3rd dd
tag (which has an index of 2) to the var date
. And your import
statement was also wrong. Another suggestion from my side is to use html5lib
instead of html.parser
. Here is the final code:
from bs4 import BeautifulSoup
data= """ <dl class="markt_expose_deflist markt_expose_deflist_lineless">
<dt>
Ort
</dt>
<dd>
80993 <a href="https://www.markt.de/suche.htm" title="München">München</a>
</dd>
<dt>
Anzeigentyp
</dt>
<dd>
Privatangebot
</dd>
<dt>
Anzeigendatum
</dt>
<dd>
04.10.2020
</dd>
<dt>
Anzeigenkennung
</dt>
<dd>
<a href="https://some.link/">f2e7ae76</a>
</dd>
<dt>
Aufrufe dieser Anzeige
</dt>
<dd>
689
</dd>
</dl>"""
soup = BeautifulSoup(data, 'html5lib')
date = soup.find('dl',{'class':'markt_expose_deflist markt_expose_deflist_lineless'})
date = date.find_all('dd')[2].text.strip()
print(date)
Output:
04.10.2020
You can use this if all the dates are in same format. It just extracts all dd
tags, and checks the text inside them with a condition whether they have more than one " .
".
soup = BeautifulSoup(data, 'html.parser')
for tag in soup.find_all('dd'):
if tag.text.count('.') > 1:
date = tag.text.lstrip()
print(date)
Output:
04.10.2020
So first of all your import
statement is wrong. What you are doing is just renaming bs4
to BeautifulSoup
. What I believe you wanted to do was to import BeautifulSoup
from the modulue bs4
.
To do this in python we do:
from module import the_class_you_want_to_import
so in your case that would be:
from bs4 import BeautifulSoup
Now that we have the import sorted out let's move onto the actual code.
The <dt>
element you are trying to find has no children elements and therefore we can't find any element <dd>
inside it.
What I did instead was this:
soup = BeautifulSoup(data, 'html.parser')
date = soup.findAll('dd')
print(date[2].text.strip())
You can select neighbor tag of <dt>
that contains string "Anzeigendatum":
print( soup.select_one('dt:contains("Anzeigendatum") + dd').text )
Prints:
04.10.2020
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.