I'm using scrapy to scrape information from 2 tables on the website
I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?
import scrapy
from universities.items import UniversitiesItem
class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
name = 'uscc'
allowed_domains = ['sc.edu']
start_urls = ['http://www.sc.edu/about/directory/?name=']
def parse(self, response):
for ln in ['Zhao']:
query = response.url + ln
yield scrapy.Request(query, callback=self.parse_item)
@staticmethod
def parse_item(response):
staffs = response.xpath('//table[@id="directorystaff"]/tbody/tr[@role="row"]')
students = response.xpath('//table[@id="directorystudent"]/tbody/tr[@role="row"]')
print('--------------------------')
print('staffs', staffs)
print('==========================')
print('students', students)
It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.
In response tr-tags do not have attribute 'role'
Please see it:
<table class="display" id="directorystaff" width="100%">
<thead>
<tr>
<th style="text-align: left">Name</th>
<th style="text-align: left">Email</th>
<th style="text-align: left">Phone</th>
<th style="text-align: left">Department</th>
<th style="text-align: left">Office Address</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Zhao, Xia </td>
<td style="text-align: left"> </td>
<td style="text-align: left">(803) 777-8436 </td>
<td style="text-align: left">Chemistry </td>
<td style="text-align: left">537 </td>
</tr>
<tr>
<td style="text-align: left">Zhao, Xing </td>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: left">Mechanical Engineering </td>
<td style="text-align: left"> </td>
</tr>
In this picture we see the response page
and in this picture we see page on browser:
So, if you want got list of staffs , I'm recommend next XPath:
//table[@id="directorystaff"]/tbody/tr/td
And for students, I'm recommend next XPath:
//table[@id="directorystudent"]/tbody/tr/td
If you want something else, you can modify this is XPath query.
This is example for you:
import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[@id="directorystaff"]/tbody/tr/td')
for el in element:
print(el.text)
And output:
>>Zhao, Xia
>>
>>(803) 777-8436
>>Chemistry
>>537
>>Zhao, Xing
>>
>>
>>Mechanical Engineering
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.