简体   繁体   中英

scrapy xpath return empty list while response is not empty

I'm using scrapy to scrape information from 2 tables on the website

I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?

import scrapy
from universities.items import UniversitiesItem


class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
    name = 'uscc'
    allowed_domains = ['sc.edu']
    start_urls = ['http://www.sc.edu/about/directory/?name=']

    def parse(self, response):    
        for ln in ['Zhao']:
            query = response.url + ln
            yield scrapy.Request(query, callback=self.parse_item)

    @staticmethod
    def parse_item(response):
        staffs = response.xpath('//table[@id="directorystaff"]/tbody/tr[@role="row"]')
        students = response.xpath('//table[@id="directorystudent"]/tbody/tr[@role="row"]')

        print('--------------------------')
        print('staffs', staffs)
        print('==========================')
        print('students', students)

It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.

In response tr-tags do not have attribute 'role'

Please see it:

             <table class="display" id="directorystaff" width="100%">
                <thead>
                <tr>
                  <th style="text-align: left">Name</th>
                  <th style="text-align: left">Email</th>
                  <th style="text-align: left">Phone</th>
                  <th style="text-align: left">Department</th>
                  <th style="text-align: left">Office Address</th>
                </tr>
              </thead>  
              <tbody>

                    <tr>
                        <td style="text-align: left">Zhao, Xia  &nbsp;</td>
                        <td style="text-align: left">&nbsp;</td>
                        <td style="text-align: left">(803) 777-8436&nbsp;</td>
                        <td style="text-align: left">Chemistry&nbsp;</td>
                        <td style="text-align: left">537&nbsp;</td>
                    </tr>

                    <tr>
                        <td style="text-align: left">Zhao, Xing  &nbsp;</td>
                        <td style="text-align: left">&nbsp;</td>
                        <td style="text-align: left">&nbsp;</td>
                        <td style="text-align: left">Mechanical Engineering&nbsp;</td>
                        <td style="text-align: left">&nbsp;</td>
                    </tr>

根据请求查询响应

In this picture we see the response page 在此处输入图片说明

and in this picture we see page on browser: 在此处输入图片说明

So, if you want got list of staffs , I'm recommend next XPath:

//table[@id="directorystaff"]/tbody/tr/td

在此处输入图片说明

And for students, I'm recommend next XPath:

//table[@id="directorystudent"]/tbody/tr/td

If you want something else, you can modify this is XPath query.

This is example for you:

import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[@id="directorystaff"]/tbody/tr/td')
for el in element:
    print(el.text)

And output:

>>Zhao, Xia   
>> 
>>(803) 777-8436 
>>Chemistry 
>>537 
>>Zhao, Xing   
>> 
>> 
>>Mechanical Engineering 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM