I am working on scrapy , i am scraping a site and using xpath
to scrape items. But some of the div
contains javascript
, so when i used xpath until the div id
that contains javascript code is returning an empty list,and without including that div element(which contains javascript) can able to fetch HTML data
HTML code
<div class="subContent2">
<div id="contentDetails">
<div class="eventDetails">
<h2>
<a href="javascript:;" onclick="jdevents.getEvent(117032)">Some data</a>
</h2>
</div>
</div>
</div>
Spider Code
class ExampleSpider(BaseSpider):
name = "example"
domain_name = "www.example.com"
start_urls = ["http://www.example.com/jkl/index.php"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]')
So how can i get text(Some data)
from the anchor tag
inside the h2 element
as mentioned above, is there any alternate way for fetching data from the elements that contains javascript in scrapy
<div class="subContent2">
<div id="contentDetails">
<div class="eventDetails">
<h2>
<a href="javascript:;" onclick="jdevents.getEvent(117032)">Some data</a>
</h2>
</div>
</div>
</div>
The problem is not the javascript code in this case to get 'Some data' string.
You need either to get the subnode:
required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]/h2/a/text()')
or use string
function:
required_data = hxs.select('string(//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"])')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.