简体   繁体   中英

Python Scrapy - Issues with scraping data that is commented out

After hours troubleshooting, I finally was able to determine that the reason I couldn't scrape this data is because the most vital data is being commented out, and js must be loading it. A "print response" does actually see it, but scrapy will not pull that data.

Scrapy 问题

xpath has comment() to get comment.

But it gives comment as normal text and you have to remove <!-- and --> and parse it to search inside this HTML . In scrapy you can use class Selector() to parse it.


Minimal working code

from scrapy.selector import Selector

sel = Selector(text='''
<div>
<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->
</div>''')

comment = sel.xpath('//comment()').get()
print(comment)

#html = comment.replace('<!--', '').replace('-->', '')
html = comment[4:-3]
print(html)

sel = Selector(text=html)

divs = sel.xpath('//div').getall()
print(divs)

Result:

<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->

<div class="outer">
<div class="inner">Hello World</div>
</div>

['<div class="outer">\n<div class="inner">Hello World</div>\n</div>', '<div class="inner">Hello World</div>']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM