从带有 URL 的网站下载图像并按描述排序

Question

我正在尝试从网站下载图像，然后能够根据它们各自的描述将这些图像分类到文件夹中。 在我的脚本中，我已经完成了解析 HTML 标签并获得了我需要的必要信息（每个图像的 URL 以及该图像的描述）的部分。 我还在此脚本中添加了另外两列，即每个文件的名称和完整路径以及下载文件的名称和文件夹。 我现在被困在我想做的下一个部分。 我希望能够检查一个文件夹是否已经存在，并在同一个 if 语句中，检查文件名是否已经存在。 如果这两个都是真的，那么脚本将移动到下一个链接。 如果文件不存在，那么它将创建文件夹并在那时下载文件。 我想做的下一部分是一个elif，文件夹在哪里不存在，然后它将创建文件夹并下载文件。 我在下面概述了我希望本节执行的操作。 我遇到的问题是我不知道如何下载文件或如何检查它们。 如果我要从多个列表中提取信息，我也不知道它将如何工作。 对于每个链接，如果下载了文件，它必须从 csv 的另一列中提取完整路径和名称，这是另一个列表，我不明白我是如何设置它的，以便我可以做到这一点。 有人可以帮忙吗...!!!

我的代码直到我被困住的部分位于本节下方，该部分概述了我想要对脚本的下一部分执行的操作。

for elem in full_links
        if full_path  exists
                run test for if file name exists
                if file name exists = true
                        move onto the next file
                        if last file in list
                                break
                elif  file name exists = false
                        download image to location with with name in list

        elif full_path does not exist
                download image with file path and name

到目前为止我所做的代码：

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
from pip._vendor import requests
import csv
import time
import urllib.request
import pandas as pd 
import wget



URL = 'https://www.baps.org/Vicharan'
content = requests.get(URL)

soup = BeautifulSoup(content.text, 'html.parser')

#create a csv
f=csv.writer(open('crawl3.csv' , 'w'))
f.writerow(['description' , 'full_link', 'name','full_path' , 'full_path_with_jpg_name'])



# Use the 'fullview' class 
panelrow = soup.find('div' , {'id' : 'fullview'})

main_class =  panelrow.find_all('div' , {'class' : 'col-xl-3 col-lg-3 col-md-3 col-sm-12 col-xs-12 padding5'})

# Look for 'highslide-- img-flag' links
individual_classes = panelrow.find_all('a' , {'class' : 'highslide-- img-flag'})

# Get the img tags, each <a> tag contains one
images = [i.img for i in individual_classes]

for image in images:
    src=image.get('src')
    full_link = 'https://www.baps.org' + src
    description = image.get('alt')
    name = full_link.split('/')[-1]
    full_path = '/home/pi/image_downloader_test/' + description + '/'
    full_path_with_jpg_name = full_path + name 
    f.writerow([description , full_link , name, full_path , full_path_with_jpg_name])

print('-----------------------------------------------------------------------')
print('-----------------------------------------------------------------------')
print('finished with search  and csv created. Now moving onto download portion')
print('-----------------------------------------------------------------------')
print('-----------------------------------------------------------------------')



f = open('crawl3.csv')
csv_f = csv.reader(f)

descriptions = []
full_links = []
names = []
full_path = []
full_path_with_jpg_name = []

for row in csv_f:
    descriptions.append(row[0])
    full_links.append(row[1])
    names.append(row[2])
    full_path.append(row[3])
    full_path_with_jpg_name.append(row[4])

Answer 1

要回答您问题的各个部分：

要检查文件夹或文件是否存在，请使用os模块

import os if not os.path.exists(path_to_folder): os.makedirs(path_to_folder) if not os.path.exists(path_to_file): # do smth

下载文件
如果您有图像的 src 以及要保存它的文件名，则可以使用urllib.request模块下载该文件
```
urllib.request.urlretrieve(image_src, path_to_file)
```
同时遍历多个列表
最后，如果你想从多个列表中提取信息，你可以使用内置的zip function 来做到这一点。 例如，如果你想同时遍历full_links和full_path ，你可以这样做
```
for link, path in zip(full_links, full_path): # do something with link and path
```

希望这可以帮助！

从带有 URL 的网站下载图像并按描述排序

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-24 07:17:24

从带有 URL 的网站下载图像并按描述排序

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-24 07:17:24

解决方案1
1 已采纳 2020-05-24 07:17:24