I need to extract data from the script tag of multiple urls with Regex. I've managed to implement a code that does half of the job. I have a csv file( 'links.csv '
) that contains all the urls I'll need to scrape. I managed to read the csv and store all the urls in the variable named 'start_urls'
. My problem is that I need a way to read the urls from 'start_urls'
one at a time and execute the next part of my code. When I execute my code in the terminal I receive 2 errors:
1.ERROR: Error while obtaining start requests 2. TypeError: Request url must be str or unicode, got list
How can I fix my code? I am a beginner in Scrapy, but I really need this script to work... Thank you in advance!
Here are some examples of urls I stored in the initial csv('links.csv'):
"https://www.samsung.com/uk/smartphones/galaxy-note8/"
"https://www.samsung.com/uk/smartphones/galaxy-s8/"
"https://www.samsung.com/uk/smartphones/galaxy-s9/"
Here is my code:
import scrapy
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('links.csv','r') as csvf:
for url in csvf:
yield scrapy.Request(url.strip())
def parse(self, response):
source = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()[0]
def get_values(parameter, script):
return re.findall('%s = "(.*)"' % parameter, script)[0]
with open('baza.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for pvi_subtype_name,pathIndicator.depth_5,model_name in zip(source):
writer.writerow({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
Append the following method to spider:
def start_requests(self):
with open('links.csv','r') as csvf:
for url in csvf:
yield scrapy.Request(url.strip())
And remove previous with...
block from code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.