I have a data-set that contains imageid, imageurl. I need to extract all the images using the image urls and zip them into a file with all the images downloaded. And I need to count the repetitions because some image url's are repeated several times. How do I do this in python?
The approach I was thinking was using a for loop here: image_id should be the file name
edit1: Added the code, how do i combine both the for loops?
import urllib
list1 = []
for key1 in (csv_file['imageid']):
list1 = str(key1)+".jpg"
for key in (csv_file['imageurl']):
urllib.request.urlretrieve(key, list1)
edit 2: csv file
Edit3: Error using library
unknown url type: '430'
430
2020-03-02 22:08:26 ('430',)
2020-03-02 22:08:26 (ValueError("unknown url type: '430'"), '430')
2020-03-02 22:08:26 ('error url:', {'url': '430', '_concurrency': 1,
'_startTm': 1583167106.29, '_endTm': 1583167106.292}, None)
This is the error I am facing with this library
According to the CSV file you gave, here is an example. But I suggest you don't use this library directly, because it is troublesome. You can try using other packaged libraries, such as requests and simplified_scrapy.
import csv
import urllib.request
# from simplified_scrapy import req, utils
list1 = []
with open('test.csv') as f:
f_csv = csv.reader(f)
list1 = set([row[1] for row in f_csv if row][1:])
for url in list1:
urllib.request.urlretrieve(url, filename=url.split('/')[-1])
# utils.saveResponseAsFile(req.get(url),url.split('/')[-1])
Give you an example of using simplified_scrapy to download pictures
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
import csv
class ImageSpider(Spider):
name = 'ImageSpider'
def __init__(self):
with open('test.csv')as f:
f_csv = csv.reader(f)
self.start_urls = [row[1] for row in f_csv if row][1:]
Spider.__init__(self,self.name) # The framework will help you eliminate duplicate data
def afterResponse(self, response, url, error=None, extra=None):
try:
# Create file name
end = url.find('?') if url.find('?')>0 else len(url)
name = 'data'+url[url.rindex('/',0,end):end]
# Save img
utils.saveResponseAsFile(response,name,'image')
return None
except Exception as err:
print (err)
SimplifiedMain.startThread(ImageSpider()) # Start
Here are more examples of SimplifiedDoc Library:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.