comparing two large csv files takes too long

Question

I have two large CSV with data that I want to compare. I used pandas therefore I have two data frames to work with easier, but the program takes too long to finish and compare all the data.

I am comparing the data sent with the received, in order to get the latency time, for that I put a double loop and the program works fine. But I want to know if there is a faster way to do this, because for my heaviest files it takes days to finish the program.

I am working with large files, the first has 68001 rows and the second has 837190 rows. But the program takes too long. Thanks in advance.

Explanation of how my code works

I am doing some performance tests of the MQTT broker, for which I created the paho clients that send and receive messages, the data sent was stored in a csv to later calculate the latency. The csv of the Publishers (users who publish) contains the publisher's client ID, the account, the timestamp, and the topic to which it is subscribed.

While the subscribers (users who receive the message) contain the timestamp when the message is received, the message received, the count (counter for the number of messages), the publisher's client ID and the topic.

Now to calculate the latency, evaluate row by row with a loop that starts at 0. That's why I used a loop for df1 (dataframe of the publishers) and the first row of the df2 (dataframe of the message receivers).

With the conditional "if" I compare the client's ID, the count number, and the topics where the messages were sent from the first row of the dataframe df1, which corresponds to the first client to send a message and I compare it with the first row of the dataframe df2, to see if the client ID, count and topic match.

If it coincides, I proceed to subtract the times to calculate the Latency, locating myself in the column corresponding to the times, and then store them in another csv, which is almost the last line you can see.

If the conditional is not fulfilled, continue and goes to the next iteration of the loop corresponding to the df2. Therefore df1 will remain in the same position until the df2 loop has finished evaluating if there are matches in all the lines of its dataframe. I hope I have explained myself well.

    #/////////////////////////////////////////////////////////////////
    #!/usr/bin/env python
    # encoding: utf-8

    import pandas as pd
    import numpy as np
    import re
    import datetime as dt
    #from datetime import datetime
    #import matplotlib.pyplot as plt

    file1='/home/carmen/Desktop/Pruebas/QoS2/50-1/latency_users/publicadores4.csv'
    file2='/home/carmen/Desktop/Pruebas/QoS2/50-1/latency_users/suscriptor_hilos4.csv'

    Resultados='/home/carmen/Desktop/Resultados/QoS2/50-1/Latencia4.csv'
    f=open(Resultados,'w')
    f.write("Client,Count,Latency,Topic \n")
    f.close()

    #Almacena los datos a un archivo csv
    data1=pd.read_csv(file1)
    data2=pd.read_csv(file2)

    #Convierte los datos en frame
    df1=pd.DataFrame(data1)
    df2=pd.DataFrame(data2)

    #Mostrar informacion general del archivo
    data1.info()
    data2.info()

    pd.set_option('display.max_colwidth',1000)

    #Extrae de Timestamp solo la hora:min:segundos;microseg
    df1['Time1']=df1['Timestamp'].str.extract('(..:..:.........)',expand=True)
    df2['Time2']=df2['Timestamp'].str.extract('(..:..:.........)',expand=True)

    #Convierte la columna timestamp a formato fecha y/o hora
    df1['Time1']=pd.to_datetime(df1['Time1'])
    df2['Time2']=pd.to_datetime(df2['Time2'])


    with open(Resultados, "a") as f:
        for i in range(len(df1)):
            for j in range(len(df2)):
                if( (df1.loc[i,'Client']== df2.loc[j,'Cliente']) and (df1.loc[i,'Count']==df2.loc[j,'Count'])\
                 and (df1.loc[i,'Topic']==df2.loc[j,'Topic'])):

                    print(df1.loc[i,'Client'],df2.loc[j,'Cliente'],df1.loc[i,'Count'],df2.loc[j,'Count'],df1.loc[i,'Time1'],df1.loc[i,'Topic'],df2.loc[j,'Topic'])


                    df1.loc[i,'Latencia']=(df2.loc[j,'Time2']-df1.loc[i,'Time1']) #calculando la diferencia   
                    df1['Latencia'] = df1['Latencia'].astype(str).str.split('0 days ').str[-1] #delete 0days of the result
            
                    f.write(str(df2.loc[j,'Cliente'])+","+str(df2.loc[j,'Count'])+","+str(df1.loc[i,'Latencia'])+","+str(df1.loc[i,'Topic'])+","+str(df2.loc[j,'Topic'])+'\n')

                else:
                    continue

Example of Data to compare:

df1:

 Client,Count,Timestamp,Topic
    Client_0000,000000,2021-04-22 09:01:43.627250,topic/2
    Client_0001,000000,2021-04-22 09:01:43.628319,topic/2
    Client_0002,000000,2021-04-22 09:01:43.629341,topic/3
    Client_0003,000000,2021-04-22 09:01:43.630497,topic/2
    Client_0004,000000,2021-04-22 09:01:43.631836,topic/1
    Client_0005,000000,2021-04-22 09:01:43.633540,topic/3
    Client_0006,000000,2021-04-22 09:01:43.635005,topic/2
    CONTINUE....

df2:

Timestamp,Message,Count,Cliente,Topic
    2021-04-22 09:01:43.639642,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.642274,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.641392,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.643774,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.637687,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.639910,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.643982,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.653039,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
    2021-04-22 09:01:43.659924,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2

Answer 1

You can use a merge: It should be faster than running loops

df2 = df2.rename(columns={'Cliente': 'Client'})
df = df1.merge(df2, 'inner', on=['Client', 'Topic', 'Count'])
df['latency'] = df['Timestamp_y'] - df['Timestamp_x']

Note that I have renamed the Cliente from df2 to Client, you can change it back later, after the merge if needed.

Or Alternatively you can use left_on and right_on

df = df1.merge(df2, 'inner', left_on=['Client', 'Topic', 'Count'], right_on=['Cliente', 'Topic', 'Count'])
df['latency'] = df['Timestamp_y'] - df['Timestamp_x']

This is assuming that your Timestamp is in a valid format and you can subtract them to get latency. Here is an example of what the output df looks like:

    Client          Count           Timestamp_x          Topic          Timestamp_y            Message   Cliente          latency
0   Client_0000     000005  2021-05-09 13:05:36.316499  topic/2     2021-05-09 13:05:36.353677  Hello   Client_0000     0 days 00:00:00.037178
1   Client_0063     000005  2021-05-09 13:05:39.505920  topic/6     2021-05-09 13:05:39.532888  Hello   Client_0063     0 days 00:00:00.026968
2   Client_0071     000008  2021-05-09 13:05:39.913016  topic/5     2021-05-09 13:05:39.931340  Hello   Client_0071     0 days 00:00:00.018324
3   Client_0082     000009  2021-05-09 13:05:40.487390  topic/9     2021-05-09 13:05:40.521418  Hello   Client_0082     0 days 00:00:00.034028
4   Client_0097     000006  2021-05-09 13:05:41.248995  topic/10    2021-05-09 13:05:41.264659  Hello   Client_0097     0 days 00:00:00.015664

comparing two large csv files takes too long

Question

1 answers

solution1
1 ACCPTED 2021-05-09 17:11:39

comparing two large csv files takes too long

Question

1 answers

solution1 1 ACCPTED 2021-05-09 17:11:39

solution1
1 ACCPTED 2021-05-09 17:11:39