like the title says. my scrapy code seems to be running correctly except it is pulling only the first row of table and repeating it for the amount of rows in the table.
import scrapy class FightersSpider(scrapy.Spider): name = "fighters" start_urls = [ 'http://www.ufcstats.com/statistics/fighters?char=a&page=all' ] def start_requests(self): urls = [ 'http://www.ufcstats.com/statistics/fighters?char=a&page=all' ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response, **kwargs): for fighter in response.xpath('//*[@class="b-statistics__table"]//tbody/tr'): yield { 'first': fighter.xpath('//td[1]/a//text()').extract_first(), 'last': fighter.xpath('//td[2]/a//text()').extract_first(), 'nickname': fighter.xpath('//td[3]/a//text()').extract_first(), 'height': fighter.xpath('//td[4]//text()').extract_first().strip(), 'weight': fighter.xpath('//td[5]//text()').extract_first().strip(), 'reach': fighter.xpath('//td[6]//text()').extract_first().strip(), 'stance': fighter.xpath('//td[7]//text()').extract_first().strip(), 'wins': fighter.xpath('//td[8]//text()').extract_first().strip(), 'losses': fighter.xpath('//td[9]//text()').extract_first().strip(), 'draws': fighter.xpath('//td[10]//text()').extract_first().strip(), }
if i take out the _first
it pulls all of the data but puts it in the same cell and repeats the same way.
first last nickname height weight reach stance wins losses Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3....
You have to use relative xpath
to search only inside fighter
- it has to start with dot
fighter.xpath('.//td[1]/a//text()')
Without dot
it is absolute xpath
and it searchs in all HTML and it always finds first row.
But then you will have other problem.
You get all rows in table - even header which doesn't have td
- and you have to skip it. You can slice it with [1:]
for fighter in response.xpath(...)[1:]:
Minimal working code.
You can copy all to file and start it as normal script python script.py
without creating project in scrapy
import scrapy
class FightersSpider(scrapy.Spider):
name = "fighters"
start_urls = [
'http://www.ufcstats.com/statistics/fighters?char=a&page=all'
]
def start_requests(self):
urls = [
'http://www.ufcstats.com/statistics/fighters?char=a&page=all'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response, **kwargs):
for fighter in response.xpath('//*[@class="b-statistics__table"]//tbody/tr')[1:]:
print(fighter.xpath('.//td[4]'))
yield {
'first': fighter.xpath('.//td[1]/a//text()').extract_first(),
'last': fighter.xpath('.//td[2]/a//text()').extract_first(),
'nickname': fighter.xpath('.//td[3]/a//text()').extract_first(),
'height': fighter.xpath('.//td[4]//text()').extract_first().strip(),
'weight': fighter.xpath('.//td[5]//text()').extract_first().strip(),
'reach': fighter.xpath('.//td[6]//text()').extract_first().strip(),
'stance': fighter.xpath('.//td[7]//text()').extract_first().strip(),
'wins': fighter.xpath('.//td[8]//text()').extract_first().strip(),
'losses': fighter.xpath('.//td[9]//text()').extract_first().strip(),
'draws': fighter.xpath('.//td[10]//text()').extract_first().strip(),
}
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(FightersSpider)
c.start()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.