簡體   English   中英

Python 將循環分成幾個部分

[英]Python break loop into several section

我想從7000個網址,並保存信息為CSV獲取數據。 而不是一次遍歷所有 7000 個 URL。 我怎樣才能將 csv 分成每個 csv 1000 個 URL。

下面是我當前代碼的示例。 我已將總數更改為索引 7000 = 10 和每個 csv = 2 url。

urls = ['www.1.com', 'www.2.com', 'www.3.com', 'www.4.com', 'www.5.com', 'www.6.com', 'www.7.com', 'www.8.com', 'www.9.com', 'www.10.com']
ranks = []
names = []
prices = []
count = 0
rows_count = 0

total_index = 10
i = 1

while i < total_index:
    for url in urls[rows_count+0:rows_count+2]:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        count += 1
        print('Loop', count, f'started for {url}')
        rank = []
        name = []
        price = []
        # loop for watchlist
        for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz'):
            item = item.text
            rank.append(item)
        ranks.append(rank)
        # loop for ticker name
        for ticker in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1'):
            ticker = ticker.text
            name.append(ticker)
        names.append(name)
        # loop for price
        for price_tag in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle'):
            price_tag = price_tag.text
            price.append(price_tag)
        prices.append(price)
        sleep_interval = randint(1, 2)
        print('Sleep interval ', sleep_interval)
        time.sleep(sleep_interval)
        
    rows_count += 2
    df = pd.DataFrame(ranks)
    df2 = pd.DataFrame(names)
    df3 = pd.DataFrame(prices)
    final_table = pd.concat([df, df2, df3], axis=1)
    final_table.columns=['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
    final_table.to_csv(os.path.join(path,fr'summary_{rows_count}.csv'))
    i += 2

為我的問題尋求高級助理。

或者有沒有其他方法可以做到。

據我了解,您將通過抓取每個 URL 獲得一行數據。 分塊抓取並寫入 CSV 的通用解決方案如下所示:

def scrape_in_chunks(urls, scraping_function, chunk_size, filename_template):
    """ Apply a scraping function to a list of URLs and save as a series of CSVs with data from
        one URL on each row and chunk_size urls in each CSV file.
    """
    for i in range(0, len(urls), chunk_size):
        df = pd.DataFrame([scrape(url) for url in  urls[i:i+chunk_size]])
        df.to_csv(filename_template.format(start=i, end=i+chunk_size-1))
  
def my_scraper(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    print('Loop', count, f'started for {url}')
    keys = ['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
    data = ([item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz')] +
            [item.text for item in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1')] +
            [item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle')])
    return dict(zip(keys, data))  # You could alternatively return a dataframe or series here but dict seems simpler

scrape_in_chunks(urls, my_scraper, 1000, os.path.join(path, "summary {start}-{end}.csv"))
   

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM