简体   繁体   中英

python scrapy only pulling first row but repeating for the correct amount of items in the table

like the title says. my scrapy code seems to be running correctly except it is pulling only the first row of table and repeating it for the amount of rows in the table.

 import scrapy class FightersSpider(scrapy.Spider): name = "fighters" start_urls = [ 'http://www.ufcstats.com/statistics/fighters?char=a&page=all' ] def start_requests(self): urls = [ 'http://www.ufcstats.com/statistics/fighters?char=a&page=all' ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response, **kwargs): for fighter in response.xpath('//*[@class="b-statistics__table"]//tbody/tr'): yield { 'first': fighter.xpath('//td[1]/a//text()').extract_first(), 'last': fighter.xpath('//td[2]/a//text()').extract_first(), 'nickname': fighter.xpath('//td[3]/a//text()').extract_first(), 'height': fighter.xpath('//td[4]//text()').extract_first().strip(), 'weight': fighter.xpath('//td[5]//text()').extract_first().strip(), 'reach': fighter.xpath('//td[6]//text()').extract_first().strip(), 'stance': fighter.xpath('//td[7]//text()').extract_first().strip(), 'wins': fighter.xpath('//td[8]//text()').extract_first().strip(), 'losses': fighter.xpath('//td[9]//text()').extract_first().strip(), 'draws': fighter.xpath('//td[10]//text()').extract_first().strip(), }

if i take out the _first it pulls all of the data but puts it in the same cell and repeats the same way.

 first last nickname height weight reach stance wins losses Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3 Tom Aaron The Assassin -- 155 lbs. -- 5 3....

You have to use relative xpath to search only inside fighter - it has to start with dot

fighter.xpath('.//td[1]/a//text()')

Without dot it is absolute xpath and it searchs in all HTML and it always finds first row.


But then you will have other problem.

You get all rows in table - even header which doesn't have td - and you have to skip it. You can slice it with [1:]

for fighter in response.xpath(...)[1:]:

Minimal working code.

You can copy all to file and start it as normal script python script.py without creating project in scrapy

import scrapy


class FightersSpider(scrapy.Spider):
    name = "fighters"

    start_urls = [
        'http://www.ufcstats.com/statistics/fighters?char=a&page=all'
    ]

    def start_requests(self):
        urls = [
            'http://www.ufcstats.com/statistics/fighters?char=a&page=all'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response, **kwargs):
        for fighter in response.xpath('//*[@class="b-statistics__table"]//tbody/tr')[1:]:
            print(fighter.xpath('.//td[4]'))
            yield {
                'first': fighter.xpath('.//td[1]/a//text()').extract_first(),
                'last': fighter.xpath('.//td[2]/a//text()').extract_first(),
                'nickname': fighter.xpath('.//td[3]/a//text()').extract_first(),
                'height': fighter.xpath('.//td[4]//text()').extract_first().strip(),
                'weight': fighter.xpath('.//td[5]//text()').extract_first().strip(),
                'reach': fighter.xpath('.//td[6]//text()').extract_first().strip(),
                'stance': fighter.xpath('.//td[7]//text()').extract_first().strip(),
                'wins': fighter.xpath('.//td[8]//text()').extract_first().strip(),
                'losses': fighter.xpath('.//td[9]//text()').extract_first().strip(),
                'draws': fighter.xpath('.//td[10]//text()').extract_first().strip(),
            }
            
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(FightersSpider)
c.start() 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM