簡體   English   中英

使用WGET或Python從CSV下載並重命名附件,需要基本身份驗證

[英]Using WGET or Python to download and rename attachments from CSV requiring basic authentication

我刮了一個正在使用的票務網站,現在有一個CSV文件,看起來像這樣:ID,Attachment_URL,Ticket_URL。 我現在需要做的是下載每個附件,並使用Ticket_URL重命名該文件。 我的主要問題是,導航到Attachment_URL時,必須使用基本身份驗證,然后將您重定向到aws s3鏈接。 我已經能夠使用wget下載單個文件,但無法遍歷整個列表(約35k行),而且我不確定如何將文件命名為ticket_id。 任何意見,將不勝感激。

得到它了。

要打開已認證的會話:

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
import csv
import pandas as pd
import time


s = requests.session()

payload = {
    'user': '',
    'pw': ''
}

s.post('login.url.here', data=payload)
for i in range(1, 6000):
    testURL = s.get(
        'https://urlhere.com/efw/stuff&page={}'.format(i))


    soup = BeautifulSoup(testURL.content)
    table = soup.find("table", {"class": "table-striped"})
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')[1:]
    print "The current page is: " + str(i)

    for row in rows:
        cols = row.find_all('a', attrs={'href': re.compile("^/helpdesk/")})
      # time.sleep(1)
        with open('fd.csv', 'a') as f:
         writer = csv.writer(f)
         writer.writerow(cols)
         print cols
    print cols

然后,我清理了R中的鏈接並下載了文件。

#!  /usr/bin/env python
    import threading
    import os
    from time import gmtime, strftime
    from Queue import Queue

    import requests
    s = requests.session()

    payload = {
        'user': '',
        'pw': ''
    }
    s.post('login', data=payload)

    class log:

        def info(self, message):
            self.__message("info", message)
        def error(self, message):
            self.__message("error", message)
        def debug(self, message):
            self.__message("debug", message)
        def __message(self, log_level, message):
            date = strftime("%Y-%m-%d %H:%M:%S", gmtime())
            print "%s [%s] %s" % (date, log_level, message)


    class fetch:
        def __init__(self):
            self.temp_dir = "/tmp"


        def run_fetcher(self, queue):

            while not queue.empty():
                url, ticketid = queue.get()

                if ticketid.endswith("NA"):
                    fileName = url.split("/")[-1] + 'NoTicket'
                else:
                    fileName = ticketid.split("/")[-1]

                response = s.get(url)

                with open(os.path.join('/Users/Desktop/FolderHere', fileName + '.mp3'), 'wb') as f:

                     f.write(response.content)

                     print  fileName




                queue.task_done()


    if __name__ == '__main__':

        # load in classes
        q = Queue()
        log = log()
        fe = fetch()


        # get bucket name
        #Read in input file
        with open('/Users/name/csvfilehere.csv', 'r') as csvfile:
            for line in csvfile:
                id,url,ticket = line.split(",")
                q.put([url.strip(),ticket.strip()])

        # spin up fetcher workers
        threads = []
        for i in range(8):
            t = threading.Thread(target=fe.run_fetcher, args=(q,))
            t.daemon = True
            threads.append(t)
            t.start()

        # close threads
        [x.join() for x in threads]

        # close queue
        q.join()
        log.info("End")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM