简体   繁体   English

如何每小时自动运行 web 刮板脚本?

[英]How can I run web scraper script automatically every hour?

I am extracting data from booking.com, my script uses selenium to gather the data, creates a provisional csv with the appropriate timestamp and then appends it to the final database which is also a csv. I am extracting data from booking.com, my script uses selenium to gather the data, creates a provisional csv with the appropriate timestamp and then appends it to the final database which is also a csv. I would like to get new data every hour even when I'm offline, and append it to the final database, yet I don't know how to do it.即使我离线,我也想每小时获取新数据,并将 append 到最终数据库,但我不知道该怎么做。 I am new to web scraping.我是 web 抓取的新手。 Currently my script runs in Jupyter.目前我的脚本在 Jupyter 中运行。 Any help would be greatly appreciated.任何帮助将不胜感激。

I'm using macOS Big Sur我正在使用 macOS Big Sur

This is my code:这是我的代码:

 
def prepare_driver(url):
    '''Returns a Firefox Webdriver.'''
    options = Options()
    # options.add_argument('-headless')
    driver = Firefox(executable_path='/Users/andreazavala/Downloads/geckodriver', options=options)
    driver.get(url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.ID, 'ss')))
    return driver

def fill_form(driver, search_argument):
    '''Finds all the input tags in form and makes a POST requests.'''
    search_field = driver.find_element_by_id('ss')
    search_field.send_keys(search_argument)
    
    #Look for today's date
    driver.find_element_by_class_name('xp__dates-inner').click()
    slcpath = "td[data-date='"+str(date.today())+"']"
    driver.find_element_by_css_selector(slcpath).click()
    
    # We look for the search button and click it
    driver.find_element_by_class_name('sb-searchbox__button')\
        .click()
    
    wait = WebDriverWait(driver, timeout=10).until(
        EC.presence_of_all_elements_located(
            (By.CLASS_NAME, 'sr-hotel__title')))

driver = prepare_driver(domain)
fill_form(driver, 'City Name')

url_iter = driver.current_url
accommodation_urls = list()
accommodation_urls.append(url_iter)

with open('urls.txt', 'w') as f:
    for item in accommodation_urls:
        f.write("%s\n" % item)
from selectorlib import Extractor
import requests 
from time import sleep
import csv

# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('booking.yml')

def scrape(url):    
    headers = {
        'Connection': 'keep-alive',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache',
        'DNT': '1',
        'Upgrade-Insecure-Requests': '1',
        # You may want to change the user agent if you get blocked
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',

        'Referer': 'https://www.booking.com/index.en-gb.html',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Pass the HTML of the page and create 
    return e.extract(r.text,base_url=url)


with open("urls.txt",'r') as urllist, open('data.csv','w') as outfile:
    fieldnames = [
        "name",
        "location",
        "price",
        "price_for",
        "room_type",
        "beds",
        "rating",
        "rating_title",
        "number_of_ratings",
        "url"
    ]
    writer = csv.DictWriter(outfile, fieldnames=fieldnames,quoting=csv.QUOTE_ALL)
    writer.writeheader()
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            for h in data['hotels']:
                writer.writerow(h)
import pandas as pd
data = pd.read_csv("data.csv")
data.insert(0, 'TimeStamp', pd.to_datetime('today').replace(microsecond=0))

df2 = data
df2.to_csv('Tarifa.csv', mode = 'a', header = False)
df_results = pd.read_csv('Tarifa.csv', index_col=0).reset_index(drop = True, inplace = True)

Here is an approach you could use!这是您可以使用的方法!

Import schedule & time, then wrap your script in a main function to call once per hour.导入时间表和时间,然后将您的脚本包装在主 function 中,每小时调用一次。

import time
import schedule

def runs_my_script():
    function1()
    function2()
    and_so_on()

Then at the bottom add this:然后在底部添加:

if __name__ == "__main__":
    schedule.every().hour.do(runs_my_script) # sets the function to run once per hour
  
    while True:  # loops and runs the scheduled job indefinitely 
        schedule.run_pending()
        time.sleep(1)

Its not elegant, but it gets the base job done and can be expanded on to fit your needs:)它并不优雅,但它完成了基本工作,并且可以扩展以满足您的需求:)

A system approach would be to rely on crontab.一种系统方法是依赖 crontab。

Type in the console: crontab -e .在控制台中输入: crontab -e Inside there, put 0 0-23 * * * /path/to/script/app.py That'd run every hour every day.在那里,放0 0-23 * * * /path/to/script/app.py每天每小时运行一次。

Save it pressing escape ( esc ) then type :wq .按转义( esc )保存它,然后输入:wq That'd save the new cron job and quit the editor.这将保存新的 cron 作业并退出编辑器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM