简体   繁体   中英

How can you show progress bar while iterating over a pandas dataframe

I am trying to iterate over a Pandas data frame with close to a million entries. I am using a for loop to iterate over them. Consider the following code as an example

import pandas as pd 
import os 
from requests_html import HTMLSession
from tqdm import tqdm
import time


df = pd.read_csv(os.getcwd()+'/test-urls.csv')
df = df.drop('Unnamed: 0', axis=1 )

new_df = pd.DataFrame(columns = ['pid', 'orig_url', 'hosted_url'])
refused_df = pd.DataFrame(columns = ['pid', 'refused_url'])

tic = time.time()

for idx, row in df.iterrows():

    img_id = row['pid']
    url = row['image_url']

    #Let's do scrapping 
    session = HTMLSession()
    r  = session.get(url)
    r.html.render(sleep=1, keep_page=True, scrolldown=1)

    count = 0 
    link_vals =  r.html.find('.zoomable')

    if len(link_vals) != 0 : 
        attrs = link_vals[0].attrs
        # print(attrs['src'])  
        embed_link = attrs['src']

    else: 
        while count <=7:
            link_vals =  r.html.find('.zoomable')
             count += 1
        else:
             print('Link refused connection for 7 tries. Adding URL to Refused URLs Data Frame')
            ref_val = [img_id,URL]
            len_ref = len(refused_df)
            refused_df.loc[len_ref] = ref_val
            print('Refused URL added')
            continue
    print('Got 1 link')

#Append scraped data to new_df
    len_df = len(new_df)
    append_value = [img_id,url, embed_link]
    new_df.loc[len_df] = append_value

I wanted to know how could I use a progress bar to add a visual representation of how far along I am. I will appreciate any help. Please let me know if you need any clarification.

You could try out TQDM

from tqdm import tqdm
for idx, row in tqdm(df.iterrows()):
      do something

This is primarily for a command-line progress bar. There are other solutions if you're looking for more of a GUI. PySimpleGUI comes to mind, but is definitely a little more complicated.

Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.

I would suggest you use apply/ avoid using iterrows().

If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]

PySimpleGUI makes this about as simple of a problem to solve as possible, assuming you know ahead of time time how items you have in your list. Indeterminate progress meters are possible, but a little more complicated.

There is no setup required before your loop. You don't need to make a special iterator. The only need you have to do is add 1 line of code inside your loop.

Inside your loop add a call to - one_line_progress_meter . The name sums up what it is. Add this call to the top of your loop, the bottom, it doesn't matter... just add it somewhere that's looped.

There 4 parameters you pass are:

  • A title to put on the meter (any string will do)
  • Where you are now - current counter
  • What the max counter value is
  • A "key" - a unique string, number, anything you want.

Here's a loop that iterates through a list of integers to demonstrate.

import PySimpleGUI as sg

items = list(range(1000))
total_items = len(items)
for index, item in enumerate(items):

    sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )

The list iteration code will be whatever your loop code is. The line of code to focus on that you'll be adding is this one:

sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )

This line of code will show you the window below. It contains statistical information like how long you've been running the loop and an estimation on how much longer you have to go.

在此处输入图像描述

How to do that in pandas apply? I do this

def some_func(a,b):
   global index
   some function involve a and b
   index+=1
   sg.one_line_progress_meter('My meter', index, len(df), 'my meter' )
   return c

index=0
df['c'] = df[['a','b']].apply(lambda : some_func(*x),axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM