Extracting data from webpage using lxml XPath in Python

Question

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.

The page url is www.mangapanda.com/one-piece/1/1

I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-

.//*[@id='chapterMenu']/option[1]/text()

I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.

from lxml import html
import requests

r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)

name = page.xpath(".//*[@id='chapterMenu']/option[1]/text()")

But in name nothing is stored. I even tried other XPath's like :-

//div/select[@id='chapterMenu']/option[1]/text()
//select[@id='chapterMenu']/option[1]/text()

The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.

But it is not that all aren't working. An xpath that working with lxml xpath here is :-

.//img[@id='img']/@src

Thank you.

Answer 1

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html

You might want to have a look at this: Evaluate javascript on a local html file (without browser)

Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919

Which is json and can be easily turned into a python dict/array using the json library. But you have to find out how to get the id and the which parameter if you want to automate this.

The id is part of the html, look for document['mangaid'] within one of the script tags and which ~~can maybe stay 191919~~ has to be 0... ~~although I couldn't find it in any source~~ I found it, when it is 0 you will be redirected to the proper url.

So there you go ;)

Answer 2

The source document of the page you are requesting is in a default namespace :

<html xmlns="http://www.w3.org/1999/xhtml">

even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.

name = page.xpath('//*[@id='chapterMenu']/xhtml:option[1]/text()',
   namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})

Then, the piece of the document the path expression above is concerned with is:

<select id="chapterMenu" name="chapterMenu"></select>

As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Extracting data from webpage using lxml XPath in Python

Question

2 answers

solution1
1 ACCPTED 2015-03-12 17:23:47

solution2
0 2015-03-12 17:16:50

Extracting data from webpage using lxml XPath in Python

Question

2 answers

solution1 1 ACCPTED 2015-03-12 17:23:47

solution2 0 2015-03-12 17:16:50

solution1
1 ACCPTED 2015-03-12 17:23:47

solution2
0 2015-03-12 17:16:50