简体   繁体   English

Python Scrapy - 被注释掉的抓取数据的问题

[英]Python Scrapy - Issues with scraping data that is commented out

After hours troubleshooting, I finally was able to determine that the reason I couldn't scrape this data is because the most vital data is being commented out, and js must be loading it.经过几个小时的故障排除,我终于能够确定我无法抓取这些数据的原因是因为最重要的数据被注释掉了,并且 js 必须加载它。 A "print response" does actually see it, but scrapy will not pull that data. “打印响应”确实会看到它,但 scrapy 不会提取该数据。

Scrapy 问题

xpath has comment() to get comment. xpathcomment()来获取评论。

But it gives comment as normal text and you have to remove <!-- and --> and parse it to search inside this HTML .但它以普通文本的形式提供注释,您必须删除<!---->并对其进行解析以在此HTML内进行搜索。 In scrapy you can use class Selector() to parse it.scrapy ,您可以使用 class Selector()来解析它。


Minimal working code最少的工作代码

from scrapy.selector import Selector

sel = Selector(text='''
<div>
<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->
</div>''')

comment = sel.xpath('//comment()').get()
print(comment)

#html = comment.replace('<!--', '').replace('-->', '')
html = comment[4:-3]
print(html)

sel = Selector(text=html)

divs = sel.xpath('//div').getall()
print(divs)

Result:结果:

<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->

<div class="outer">
<div class="inner">Hello World</div>
</div>

['<div class="outer">\n<div class="inner">Hello World</div>\n</div>', '<div class="inner">Hello World</div>']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM