简体   繁体   中英

Python break loop into several section

I am trying to fetch data from 7000 URLs and save the scraped info into csv. Rather then go through all the 7000 URLs once. how can I break the csv into let say 1000 URLs per csv.

Below is an example of my current code. I have change the total to index 7000 = 10 and per csv = 2 url.

urls = ['www.1.com', 'www.2.com', 'www.3.com', 'www.4.com', 'www.5.com', 'www.6.com', 'www.7.com', 'www.8.com', 'www.9.com', 'www.10.com']
ranks = []
names = []
prices = []
count = 0
rows_count = 0

total_index = 10
i = 1

while i < total_index:
    for url in urls[rows_count+0:rows_count+2]:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        count += 1
        print('Loop', count, f'started for {url}')
        rank = []
        name = []
        price = []
        # loop for watchlist
        for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz'):
            item = item.text
            rank.append(item)
        ranks.append(rank)
        # loop for ticker name
        for ticker in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1'):
            ticker = ticker.text
            name.append(ticker)
        names.append(name)
        # loop for price
        for price_tag in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle'):
            price_tag = price_tag.text
            price.append(price_tag)
        prices.append(price)
        sleep_interval = randint(1, 2)
        print('Sleep interval ', sleep_interval)
        time.sleep(sleep_interval)
        
    rows_count += 2
    df = pd.DataFrame(ranks)
    df2 = pd.DataFrame(names)
    df3 = pd.DataFrame(prices)
    final_table = pd.concat([df, df2, df3], axis=1)
    final_table.columns=['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
    final_table.to_csv(os.path.join(path,fr'summary_{rows_count}.csv'))
    i += 2

Seek senior assistant for my problem.

Or is there any other way to do it.

As I understand it you are getting one row of data from scraping each URL. A generic solution for scraping in chunks and writing to CSVs would look something like this:

def scrape_in_chunks(urls, scraping_function, chunk_size, filename_template):
    """ Apply a scraping function to a list of URLs and save as a series of CSVs with data from
        one URL on each row and chunk_size urls in each CSV file.
    """
    for i in range(0, len(urls), chunk_size):
        df = pd.DataFrame([scrape(url) for url in  urls[i:i+chunk_size]])
        df.to_csv(filename_template.format(start=i, end=i+chunk_size-1))
  
def my_scraper(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    print('Loop', count, f'started for {url}')
    keys = ['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
    data = ([item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz')] +
            [item.text for item in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1')] +
            [item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle')])
    return dict(zip(keys, data))  # You could alternatively return a dataframe or series here but dict seems simpler

scrape_in_chunks(urls, my_scraper, 1000, os.path.join(path, "summary {start}-{end}.csv"))
   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM