简体   繁体   中英

python - parse html form with lxml.html with xpath syntax

Here is the form. The same exact form appears twice in the source.

<form method="POST" action="/login/?tok=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log"/>
</form>

I am getting the "action" attribute with this py code

import lxml.html
tree = lxml.html.fromstring(pagesource)
print tree.xpath('//action')
raw_input()

Since there are two forms, it prints both of the attributes

['/login/?session=sess', '/login/?session=sess']

How can I get it to print just one? I only need one, since they're the same exact form.

I also have a second question

how can I get the value of the token? I am talking about this line:

 <input type="hidden" name="ses_token" value="token"/>

I try similar code,

import lxml.html
tree = lxml.html.fromstring(pagesource)
print tree.xpath('//value')
raw_input()

However, since is more than one attribute named value, it will print out

['', 'token', 'Log In', '', 'token', 'Log In'] # or something close to that

How can I get just the token? And just one?

Is there a better way to do this?

Use find() instead of xpath() , since find() returns only the first match.

Here's an example based on the code you've provided:

import lxml.html


pagesource = """<form method="POST" action="/login/?session=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log In"/>
</form>
<form method="POST" action="/login/?session=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log In"/>
</form>
"""

tree = lxml.html.fromstring(pagesource)
form = tree.find('.//form')

print "Action:", form.action
print "Token:", form.find('.//input[@name="ses_token"]').value

Prints:

Action: /login/?session=sess
Token: token

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM