I am trying to scrape an xml file with the below format
file_sample.xml:
<rss version="2.0">
<channel>
<item>
<title>SENIOR BUDGET ANALYST (new)</title>
<link>https://hr.example.org/psp/hrapp&SeqId=1</link>
<pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
<category>All Open Jobs</category>
</item>
<item>
<title>BUDGET ANALYST (healthcare)</title>
<link>https://hr.example.org/psp/hrapp&SeqId=2</link>
<pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
<category>All category</category>
</item>
</channel>
</rss>
Below is my spider.py code
class TestSpider(XMLFeedSpider):
name = "testproject"
allowed_domains = {"www.example.com"}
start_urls = [
"https://www.example.com/hrapp/rss/careers_jo_rss.xml"
]
iterator = 'iternodes'
itertag = 'channel'
def parse_node(self, response, node):
title = node.select('item/title/text()').extract()
link = node.select('item/link/text()').extract()
pubdate = node.select('item/pubDate/text()').extract()
category = node.select('item/category/text()').extract()
item = TestprojectItem()
item['title'] = title
item['link'] = link
item['pubdate'] = pubdate
item['category'] = category
return item
Result:
2012-07-25 13:24:14+0530 [testproject] DEBUG: Scraped from <200 https://hr.templehealth.org/hrapp/rss/careers_jo_rss.xml>
{'title': [u'SENIOR BUDGET ANALYST (hospital/healthcare)',
u'BUDGET ANALYST'],
'link': [u'https://hr.example.org/psp/hrapp&SeqId=1',
u'https://hr.example.org/psp/hrapp&SeqId=2']
'pubdate': [u'Wed, 18 Jul 2012 04:00:00 GMT',
u'Wed, 18 Jul 2012 04:00:00 GMT']
'category': [u'All Open Jobs',
u'All category']
}
here as u can observe from the above result, all the results from the corresponding tags are combined in to single list, but i want to map according to their individual item tag like below as we do it for html scraping.
{'title': u'SENIOR BUDGET ANALYST (hospital/healthcare)'
'link': u'https://hr.example.org/psp/hrapp&SeqId=1'
'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
'category': u'All Open Jobs'
}
{'title': u'BUDGET ANALYST'
'link': u'https://hr.example.org/psp/hrapp&SeqId=2'
'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
'category': u'All category'
}
How can we scrape xml tag data according to separate main tag like item tag above.
Thanks in advance.............
尝试将您的itertag从itertag = 'channel'
更改为'itertag = 'item'
Just change itertag = 'item'.
If you refer to the documentation of parse_node method, it states that the method is called for the nodes matching the provided tag name (itertag). In you case it is 'item'(child node to 'channel' rootnode).
I recommend the use of feedparser :
feedparser.parse(url)
results in
{'bozo': 1,
'bozo_exception': xml.sax._exceptions.SAXParseException("EntityRef: expecting ';'\n"),
'encoding': u'utf-8',
'entries': [{'link': u'https://hr.example.org/psp/hrapp&SeqId=1',
'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=1',
'rel': u'alternate',
'type': u'text/html'}],
'tags': [{'label': None, 'scheme': None, 'term': u'All Open Jobs'}],
'title': u'SENIOR BUDGET ANALYST (new)',
'title_detail': {'base': u'',
'language': None,
'type': u'text/plain',
'value': u'SENIOR BUDGET ANALYST (new)'},
'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)},
{'link': u'https://hr.example.org/psp/hrapp&SeqId=2',
'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=2',
'rel': u'alternate',
'type': u'text/html'}],
'tags': [{'label': None, 'scheme': None, 'term': u'All category'}],
'title': u'BUDGET ANALYST (healthcare)',
'title_detail': {'base': u'',
'language': None,
'type': u'text/plain',
'value': u'BUDGET ANALYST (healthcare)'},
'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)}],
'feed': {},
'namespaces': {},
'version': u'rss20'}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.