简体   繁体   English

使用Jsoup解析html-返回的带有机器人meta标签的文档

[英]parsing html using Jsoup - returned document with robots meta tag

My problem is when I am using jsoup lib for parsing a specific url, it has been great till one day my parsing has corrupted, the document that has returned had some few tags which was not anything like the old document, it had meta tag named "ROBOTS". 我的问题是,当我使用jsoup lib解析特定的URL时,直到一天我的解析已损坏,这一直很棒,返回的文档有一些标签,与旧文档完全不同,它的元标签名为“ROBOTS”。

An example of the header in the response: 响应中标头的示例:

<head>
  <meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
  <meta name="format-detection" content="telephone=no" />
  <meta name="viewport" content="initial-scale=1.0" />
</head>

My question is, how do you think I can overcome this block? 我的问题是,您认为我如何克服这一障碍? Tried using several other libraries which parse javascript as well, but it wasnt helpful and resulted the same, maybe I didn't use it right. 尝试使用其他也可以解析javascript的库,但是它没有帮助,并且结果相同,也许我没有正确使用它。

(I have learnt that the meta tag robots was made for preventing bots, initially for search engines, how can I bypass this behavior? How can I act like a regular every-browser client?) (我了解到,元标记机器人是为防止机器人而设计的,最初是为搜索引擎而设计的,我该如何绕过此行为?如何像常规的浏览器客户端一样工作?)

You didn't explicitly state this in your answer, but I'm assuming Jsoup is being sent different HTML than what your browser sees. 您没有在回答中明确说明这一点,但是我假设向Jsoup发送的HTML与您的浏览器所看到的HTML不同。 In that case, you probably need to set the user agent header so Jsoup looks like your browser. 在这种情况下,您可能需要设置用户代理标头,以便Jsoup看起来像您的浏览器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM