简体   繁体   中英

cannot get text in xpath (lxml/ python)

I assumed getting the right path to get all the professors' "emails" on the website, but the output list is [] , meaning nothing in the path? Really dunno why:(

import requests
from lxml import etree
headers = {"User-Agent":""}
res = requests.get("https://www.csie.ntu.edu.tw/members/teacher.php?mclass1=110",headers=headers)
content = res.content.decode()
html = etree.HTML(content)

email = html.xpath('//li[@class="mail"]/a/text')
for e in email:
    print(e)

Really appreciate your help. Thank the community so much.

I spent some time to investigate the problem and finaly found it. If you look into downloaded html you can notice there are no emails at all! Instead there are bunch of js scripts

var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|119';l[5]='|116';l[6]='|46';l[7]='|117';l[8]='|100';l[9]='|101';l[10]='|46';l[11]='|117';l[12]='|116';l[13]='|110';l[14]='|46';l[15]='|101';l[16]='|105';l[17]='|115';l[18]='|99';l[19]='|64';l[20]='|104';l[21]='|115';l[22]='|103';l[23]='|110';l[24]='|117';l[25]='|104';l[26]='>';l[27]='"';l[28]='|119';l[29]='|116';l[30]='|46';l[31]='|117';l[32]='|100';l[33]='|101';l[34]='|46';l[35]='|117';l[36]='|116';l[37]='|110';l[38]='|46';l[39]='|101';l[40]='|105';l[41]='|115';l[42]='|99';l[43]='|64';l[44]='|104';l[45]='|115';l[46]='|103';l[47]='|110';l[48]='|117';l[49]='|104';l[50]=':';l[51]='o';l[52]='t';l[53]='l';l[54]='i';l[55]='a';l[56]='m';l[57]='"';l[58]='=';l[59]='f';l[60]='e';l[61]='r';l[62]='h';l[63]=' ';l[64]='a';l[65]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+decodeURIComponent(l[i].substring(1))+";");
    else document.write(decodeURIComponent(l[i]));}

If you type it in browser's console (if element starts with '|' it prints ascii code of it, if not it prints element itself. And it's from the end) you can see it adds email to document. My guess is it's done to make that email impossible to google and hard to scrapy.

So what do you need to do is to evaluate scripts in html and then search for emails. This topic looks like exactly what you need

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM