繁体   English   中英

使用WGET或Python从CSV下载并重命名附件,需要基本身份验证

[英]Using WGET or Python to download and rename attachments from CSV requiring basic authentication

我刮了一个正在使用的票务网站,现在有一个CSV文件,看起来像这样:ID,Attachment_URL,Ticket_URL。 我现在需要做的是下载每个附件,并使用Ticket_URL重命名该文件。 我的主要问题是,导航到Attachment_URL时,必须使用基本身份验证,然后将您重定向到aws s3链接。 我已经能够使用wget下载单个文件,但无法遍历整个列表(约35k行),而且我不确定如何将文件命名为ticket_id。 任何意见,将不胜感激。

得到它了。

要打开已认证的会话:

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
import csv
import pandas as pd
import time


s = requests.session()

payload = {
    'user': '',
    'pw': ''
}

s.post('login.url.here', data=payload)
for i in range(1, 6000):
    testURL = s.get(
        'https://urlhere.com/efw/stuff&page={}'.format(i))


    soup = BeautifulSoup(testURL.content)
    table = soup.find("table", {"class": "table-striped"})
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')[1:]
    print "The current page is: " + str(i)

    for row in rows:
        cols = row.find_all('a', attrs={'href': re.compile("^/helpdesk/")})
      # time.sleep(1)
        with open('fd.csv', 'a') as f:
         writer = csv.writer(f)
         writer.writerow(cols)
         print cols
    print cols

然后,我清理了R中的链接并下载了文件。

#!  /usr/bin/env python
    import threading
    import os
    from time import gmtime, strftime
    from Queue import Queue

    import requests
    s = requests.session()

    payload = {
        'user': '',
        'pw': ''
    }
    s.post('login', data=payload)

    class log:

        def info(self, message):
            self.__message("info", message)
        def error(self, message):
            self.__message("error", message)
        def debug(self, message):
            self.__message("debug", message)
        def __message(self, log_level, message):
            date = strftime("%Y-%m-%d %H:%M:%S", gmtime())
            print "%s [%s] %s" % (date, log_level, message)


    class fetch:
        def __init__(self):
            self.temp_dir = "/tmp"


        def run_fetcher(self, queue):

            while not queue.empty():
                url, ticketid = queue.get()

                if ticketid.endswith("NA"):
                    fileName = url.split("/")[-1] + 'NoTicket'
                else:
                    fileName = ticketid.split("/")[-1]

                response = s.get(url)

                with open(os.path.join('/Users/Desktop/FolderHere', fileName + '.mp3'), 'wb') as f:

                     f.write(response.content)

                     print  fileName




                queue.task_done()


    if __name__ == '__main__':

        # load in classes
        q = Queue()
        log = log()
        fe = fetch()


        # get bucket name
        #Read in input file
        with open('/Users/name/csvfilehere.csv', 'r') as csvfile:
            for line in csvfile:
                id,url,ticket = line.split(",")
                q.put([url.strip(),ticket.strip()])

        # spin up fetcher workers
        threads = []
        for i in range(8):
            t = threading.Thread(target=fe.run_fetcher, args=(q,))
            t.daemon = True
            threads.append(t)
            t.start()

        # close threads
        [x.join() for x in threads]

        # close queue
        q.join()
        log.info("End")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM