How to Scrape Only New Links (After Previous Scrape) Using Python

Question

I am scraping and downloading links from a website, and the website is updated with new links each day. I would like it so that each time my code runs, it only scrapes/downloads the updated links since the last time the program ran, rather than running through the entire code again.

I have tried adding previously-scraped links to an empty list, and only executing the rest of the code (which downloads and renames the file) if the scraped link isn't found in the list. But it doesn't seem to work as hoped, for each time I run the code, it starts "from 0" and overwrites the previously-downloaded files.

Is there a different approach I should try?

Here is my code (also open to general suggestions on how to clean this up and make it better)

import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os

period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)

#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]

#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'

#create empty list of names
scraped_name_list = []

#scrape site for names and links
for anchor in table.findAll('a'):
    try:
        if not anchor:
            continue
        name = anchor.text
        letter_link = anchor['href']
    #if name doesn't exist in list of names: append it to the list, download it, and rename it
        if name not in scraped_name_list:
            #append it to name list
            scraped_name_list.append(name)
            #download it
            urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
            #rename it
            best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
            try:
                if best_options:
                    name = (downloads_folder + period + " " + name + ".pdf")
                    os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
            except:
                pass
    except:
        pass
    #else skip it
    else:
        pass

Answer 1

every time you run this, it is recreating scraped_name_list as a new empty list. what you need to do is save the list at the end of the run, and then try to import it on any other run. the pickle library is great for this.

instead of defining scraped_name_list = [] , try something like this

try:
    with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
        scraped_name_list = pickle.load(f)
except IOError:
    scraped_name_list = []

this will attempt to open your list, but if it's the first run (meaning the list doesn't exist yet) it will start with an empty list. then at the end of your code, you just need to save the file so it can be used any other times it runs:

with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
    pickle.dump(scraped_name_list, f)

How to Scrape Only New Links (After Previous Scrape) Using Python

Question

1 answers

solution1
1 ACCPTED 2019-04-10 17:42:17

How to Scrape Only New Links (After Previous Scrape) Using Python

Question

1 answers

solution1 1 ACCPTED 2019-04-10 17:42:17

solution1
1 ACCPTED 2019-04-10 17:42:17