I have been working on a python code, which reads a csv file with 800 odd rows and around 17000 columns. I would like to check each entry in the csv file and see if this number is bigger than or smaller than a value, if it is, I assign a default value. I used pandas and worked with dataframes, apply and lambda functions. It takes me 172 minutes to finish going through all entries in the csv file. Is it normal? Is there any faster way to do this?. I am using Python 2.7. I don't know if it helps, but I am running it on a windows 10 machine with 32GB ram. Thanks in advance for the help.
The code is attached below.
def do_something(some_dataframe):
col = get_req_colm(some_dataframe)
modified_dataframe = pd.DataFrame()
for k in col:
temp_data = some_dataframe.apply(lambda x: check_for_range(x[k]), axis=1).tolist()
dictionary = {}
dictionary[str(k)] = temp_data
temp_frame = pd.DataFrame(dictionary)
modified_dataframe = pd.concat([modified_dataframe, temp_frame], axis=1)
return modified_dataframe
def check_for_range(var):
var = int(var)
try:
if var == 0:
return 0
if var == 1 or var == 4:
return 1
if var == 2 or var == 3 or var == 5 or var == 6:
return 2
except:
print('error')
def get_req_colm(df):
col = list(df)
try:
col.remove('index/Sample count')
col.remove('index / Sample')
col.remove('index')
col.remove('count')
except:
pass
return col
df_after_doing_something = do_something(some_dataframe)
df_after_doing_something.to_csv(output_folder + '\\df_after_doing_something.csv', index=False)
using pandas,for cvs data, makes it efficient. but your code is not efficient.it will be faster if you try code given blow.
def do_something(some_dataframe):
col = get_req_colm(some_dataframe)
col = col.to_numpy()
np_array = np.zeros_like(col)
for i in range(len(col)):
k = np_array[i]
temp_data = np.zeros_like()
temp_data[k == 1 or k == 4] = 1
temp_data[k == 2 or k == 3 or k == 5 or k == 6] = 2
np_array[i] = k
modified_dataframe = pandas.Dataframe(np_array)
return modified_dataframe
def get_req_colm(df):
col = list(df)
try:
col.remove('index/Sample count')
col.remove('index / Sample')
col.remove('index')
col.remove('count')
except:
pass
return col
it will work perfectly and don't forget to import numpy.
import numpy as np
if you didn't get this go and check some numpy tutorial and do it then. the link given below will help you otherwise
Replacing elements in a numpy array when there are multiple conditions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.