In python I had:
response = s.get(url, allow_redirects=False, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
reg_cart = soup.find('form', attrs={"name": "regCart"})
registered_courses = [i.a.text for i in reg_cart.find_all('div', attrs={"class": "course-number"})]
Now I want to replace BeautifulSoup
with lxml
, reading this:
https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/
I tried to implement what they used there and got:
import lxml.html
doc = lxml.html.fromstring(response.content)
registered_courses = doc.xpath('//div[@class="course-number"]/text()')
But for some reason my output is:
['\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t']
While previously it correctly showed courses numbers.
How can I fix this? plus how can I edit my path to return only those div tags under the form regCart
and not in all response?
For example the html code looks something like:
<form name="regCart" ....>
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
<a href="https://university.com/rishum/course/236756">236756</a>
</div>
<div class="course-name">
מבוא למערכות לומדות
</div>
<div class="course-points">
3.0 נק'
</div>
<div class="entry-group">
קבוצה 13
</div>
Where I want to return 236756
The issue is in your relative xpath: //div[@class="course-number"]/text()
<div class="course-number">
<a href="https://university.com/rishum/course/236756">236756</a>
</div>
This would grab the text field under the corresponding div; however, there is no text under the div. The text field of interest is actually inside the tag, and the correct relative xpath is: //div[@class="course-number"]/a/text()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.