Following on from Beautifulsoup - trouble scraping datalist with links in it
This is an example of the HTML I'm scraping with Python/Beautifulsoup:
<dl>
<dd>
<strong>
<a name="45933" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45933">TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS</a>
</strong>
<br>
Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.
</dd>
<dd></dd>
<dd></dd>
<dd></dd>
</dl>
This is my program:
import bs4 as bs
import urllib.request
def chinaJobs():
sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
ads = []
for dd in soup.find_all('dd'):
link = dd.a.get('href')
link_text = dd.a.text
link_text = link_text.lower()
*_, dd_text = dd.stripped_strings
if 'university' in link_text:
ads.append([link, link_text, dd_text])
for ad in ads:
for job in ad:
print(job)
print("")
chinaJobs()
I can get the information after the <br>
tag, but it's the wrong information. This is the what the information on the website looks like:
TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS
Chongqing University -- Tuesday, 14 March 2017, at 6:58 am
This is what I would like my output to look like:
http://www.eslcafe.com/jobs/china/index.cgi?read=45933
top ranking university seeks professional lecturers
Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.
This is what my output looks like:
http://www.eslcafe.com/jobs/china/index.cgi?read=45933
top ranking university seeks professional lecturers
EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 p.m.
This is printed with every output:
EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 pm
Why do you think it is doing this, and what can I do to fix it?
The string you are looking for is wrapped in the <br>
tag, one option is to simply use br
to extract it:
soup.find("dd").a.text
# u'TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS'
soup.find('dd').a.get("href")
# u'http://www.eslcafe.com/jobs/china/index.cgi?read=45933'
soup.find('dd').br.text.strip()
# u'Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.'
You can try change the dd_text
line to dd_text = dd.br.text.strip()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.