简体   繁体   English

无法在 xpath (lxml/python) 中获取文本

[英]cannot get text in xpath (lxml/ python)

I assumed getting the right path to get all the professors' "emails" on the website, but the output list is [] , meaning nothing in the path?我假设在网站上获取所有教授的“电子邮件”的正确路径,但是 output 列表是[] ,这意味着路径中什么都没有? Really dunno why:(真的不知道为什么:(

import requests
from lxml import etree
headers = {"User-Agent":""}
res = requests.get("https://www.csie.ntu.edu.tw/members/teacher.php?mclass1=110",headers=headers)
content = res.content.decode()
html = etree.HTML(content)

email = html.xpath('//li[@class="mail"]/a/text')
for e in email:
    print(e)

Really appreciate your help.非常感谢您的帮助。 Thank the community so much.非常感谢社区。

I spent some time to investigate the problem and finaly found it.我花了一些时间调查问题并最终找到了它。 If you look into downloaded html you can notice there are no emails at all!如果您查看下载的 html,您会发现根本没有电子邮件! Instead there are bunch of js scripts取而代之的是一堆js脚本

var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|119';l[5]='|116';l[6]='|46';l[7]='|117';l[8]='|100';l[9]='|101';l[10]='|46';l[11]='|117';l[12]='|116';l[13]='|110';l[14]='|46';l[15]='|101';l[16]='|105';l[17]='|115';l[18]='|99';l[19]='|64';l[20]='|104';l[21]='|115';l[22]='|103';l[23]='|110';l[24]='|117';l[25]='|104';l[26]='>';l[27]='"';l[28]='|119';l[29]='|116';l[30]='|46';l[31]='|117';l[32]='|100';l[33]='|101';l[34]='|46';l[35]='|117';l[36]='|116';l[37]='|110';l[38]='|46';l[39]='|101';l[40]='|105';l[41]='|115';l[42]='|99';l[43]='|64';l[44]='|104';l[45]='|115';l[46]='|103';l[47]='|110';l[48]='|117';l[49]='|104';l[50]=':';l[51]='o';l[52]='t';l[53]='l';l[54]='i';l[55]='a';l[56]='m';l[57]='"';l[58]='=';l[59]='f';l[60]='e';l[61]='r';l[62]='h';l[63]=' ';l[64]='a';l[65]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+decodeURIComponent(l[i].substring(1))+";");
    else document.write(decodeURIComponent(l[i]));}

If you type it in browser's console (if element starts with '|' it prints ascii code of it, if not it prints element itself. And it's from the end) you can see it adds email to document.如果您在浏览器的控制台中键入它(如果元素以“|”开头,它会打印它的 ascii 代码,如果不是,它会打印元素本身。它是从最后开始的)您可以看到它将 email 添加到文档中。 My guess is it's done to make that email impossible to google and hard to scrapy.我的猜测是这样做是为了让 email 无法通过谷歌搜索,并且很难通过 scrapy。

So what do you need to do is to evaluate scripts in html and then search for emails.所以你需要做的是评估 html 中的脚本,然后搜索电子邮件。 This topic looks like exactly what you need 这个主题看起来正是你所需要的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM