I am trying to iterate over a Pandas data frame with close to a million entries. I am using a for loop to iterate over them. Consider the following code as an example
import pandas as pd
import os
from requests_html import HTMLSession
from tqdm import tqdm
import time
df = pd.read_csv(os.getcwd()+'/test-urls.csv')
df = df.drop('Unnamed: 0', axis=1 )
new_df = pd.DataFrame(columns = ['pid', 'orig_url', 'hosted_url'])
refused_df = pd.DataFrame(columns = ['pid', 'refused_url'])
tic = time.time()
for idx, row in df.iterrows():
img_id = row['pid']
url = row['image_url']
#Let's do scrapping
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1, keep_page=True, scrolldown=1)
count = 0
link_vals = r.html.find('.zoomable')
if len(link_vals) != 0 :
attrs = link_vals[0].attrs
# print(attrs['src'])
embed_link = attrs['src']
else:
while count <=7:
link_vals = r.html.find('.zoomable')
count += 1
else:
print('Link refused connection for 7 tries. Adding URL to Refused URLs Data Frame')
ref_val = [img_id,URL]
len_ref = len(refused_df)
refused_df.loc[len_ref] = ref_val
print('Refused URL added')
continue
print('Got 1 link')
#Append scraped data to new_df
len_df = len(new_df)
append_value = [img_id,url, embed_link]
new_df.loc[len_df] = append_value
I wanted to know how could I use a progress bar to add a visual representation of how far along I am. I will appreciate any help. Please let me know if you need any clarification.
You could try out TQDM
from tqdm import tqdm
for idx, row in tqdm(df.iterrows()):
do something
This is primarily for a command-line progress bar. There are other solutions if you're looking for more of a GUI. PySimpleGUI comes to mind, but is definitely a little more complicated.
Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.
I would suggest you use apply/ avoid using iterrows().
If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]
PySimpleGUI makes this about as simple of a problem to solve as possible, assuming you know ahead of time time how items you have in your list. Indeterminate progress meters are possible, but a little more complicated.
There is no setup required before your loop. You don't need to make a special iterator. The only need you have to do is add 1 line of code inside your loop.
Inside your loop add a call to - one_line_progress_meter
. The name sums up what it is. Add this call to the top of your loop, the bottom, it doesn't matter... just add it somewhere that's looped.
There 4 parameters you pass are:
Here's a loop that iterates through a list of integers to demonstrate.
import PySimpleGUI as sg
items = list(range(1000))
total_items = len(items)
for index, item in enumerate(items):
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
The list iteration code will be whatever your loop code is. The line of code to focus on that you'll be adding is this one:
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
This line of code will show you the window below. It contains statistical information like how long you've been running the loop and an estimation on how much longer you have to go.
How to do that in pandas apply? I do this
def some_func(a,b):
global index
some function involve a and b
index+=1
sg.one_line_progress_meter('My meter', index, len(df), 'my meter' )
return c
index=0
df['c'] = df[['a','b']].apply(lambda : some_func(*x),axis=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.