简体   繁体   中英

What is the fastest way to manipulate large csv files in Python?

I have been working on a python code, which reads a csv file with 800 odd rows and around 17000 columns. I would like to check each entry in the csv file and see if this number is bigger than or smaller than a value, if it is, I assign a default value. I used pandas and worked with dataframes, apply and lambda functions. It takes me 172 minutes to finish going through all entries in the csv file. Is it normal? Is there any faster way to do this?. I am using Python 2.7. I don't know if it helps, but I am running it on a windows 10 machine with 32GB ram. Thanks in advance for the help.

The code is attached below.


def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    modified_dataframe = pd.DataFrame()
    for k in col:
        temp_data = some_dataframe.apply(lambda x: check_for_range(x[k]), axis=1).tolist()
        dictionary = {}
        dictionary[str(k)] = temp_data
        temp_frame = pd.DataFrame(dictionary)
        modified_dataframe = pd.concat([modified_dataframe, temp_frame], axis=1)
    return modified_dataframe

def check_for_range(var):
    var = int(var)
    try:
        if var == 0:
            return 0
        if var == 1 or var == 4:
            return 1
        if var == 2 or var == 3 or var == 5 or var == 6:
            return 2
    except:
        print('error')

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

df_after_doing_something = do_something(some_dataframe)
df_after_doing_something.to_csv(output_folder + '\\df_after_doing_something.csv', index=False)

using pandas,for cvs data, makes it efficient. but your code is not efficient.it will be faster if you try code given blow.

def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    col = col.to_numpy()
    np_array = np.zeros_like(col)
    for i in range(len(col)):
        k = np_array[i]
        temp_data = np.zeros_like()
        temp_data[k == 1 or k == 4] = 1
        temp_data[k == 2 or k == 3 or k == 5 or k == 6] = 2
        np_array[i] = k
    modified_dataframe = pandas.Dataframe(np_array)
    return modified_dataframe

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

it will work perfectly and don't forget to import numpy.

import numpy as np

if you didn't get this go and check some numpy tutorial and do it then. the link given below will help you otherwise

Replacing elements in a numpy array when there are multiple conditions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM