简体   繁体   中英

Python, How to use lxml XPath?

In python I had:

response = s.get(url, allow_redirects=False, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
reg_cart = soup.find('form', attrs={"name": "regCart"})
registered_courses = [i.a.text for i in reg_cart.find_all('div', attrs={"class": "course-number"})]

Now I want to replace BeautifulSoup with lxml , reading this:

https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/

I tried to implement what they used there and got:

import lxml.html
doc = lxml.html.fromstring(response.content)
registered_courses = doc.xpath('//div[@class="course-number"]/text()')

But for some reason my output is:

['\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t']

While previously it correctly showed courses numbers.

How can I fix this? plus how can I edit my path to return only those div tags under the form regCart and not in all response?

For example the html code looks something like:

        <form name="regCart" ....>
        </div><div class="entry-spacer"></div><div class="cart-entry">
                <div class="course-number">
                <a href="https://university.com/rishum/course/236756">236756</a>
            </div>
            <div class="course-name">
                מבוא למערכות לומדות              
            </div>
            <div class="course-points">
                3.0 נק'
            </div>
            <div class="entry-group">
                קבוצה 13
            </div>

Where I want to return 236756

The issue is in your relative xpath: //div[@class="course-number"]/text()

<div class="course-number">
  <a href="https://university.com/rishum/course/236756">236756</a>
</div>

This would grab the text field under the corresponding div; however, there is no text under the div. The text field of interest is actually inside the tag, and the correct relative xpath is: //div[@class="course-number"]/a/text()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM