I am scraping and downloading links from a website, and the website is updated with new links each day. I would like it so that each time my code runs, it only scrapes/downloads the updated links since the last time the program ran, rather than running through the entire code again.
I have tried adding previously-scraped links to an empty list, and only executing the rest of the code (which downloads and renames the file) if the scraped link isn't found in the list. But it doesn't seem to work as hoped, for each time I run the code, it starts "from 0" and overwrites the previously-downloaded files.
Is there a different approach I should try?
Here is my code (also open to general suggestions on how to clean this up and make it better)
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os
period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'
#create empty list of names
scraped_name_list = []
#scrape site for names and links
for anchor in table.findAll('a'):
try:
if not anchor:
continue
name = anchor.text
letter_link = anchor['href']
#if name doesn't exist in list of names: append it to the list, download it, and rename it
if name not in scraped_name_list:
#append it to name list
scraped_name_list.append(name)
#download it
urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
#rename it
best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
try:
if best_options:
name = (downloads_folder + period + " " + name + ".pdf")
os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
except:
pass
except:
pass
#else skip it
else:
pass
every time you run this, it is recreating scraped_name_list
as a new empty list. what you need to do is save the list at the end of the run, and then try to import it on any other run. the pickle
library is great for this.
instead of defining scraped_name_list = []
, try something like this
try:
with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
scraped_name_list = pickle.load(f)
except IOError:
scraped_name_list = []
this will attempt to open your list, but if it's the first run (meaning the list doesn't exist yet) it will start with an empty list. then at the end of your code, you just need to save the file so it can be used any other times it runs:
with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
pickle.dump(scraped_name_list, f)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.