I have two large CSV with data that I want to compare. I used pandas therefore I have two data frames to work with easier, but the program takes too long to finish and compare all the data.
I am comparing the data sent with the received, in order to get the latency time, for that I put a double loop and the program works fine. But I want to know if there is a faster way to do this, because for my heaviest files it takes days to finish the program.
I am working with large files, the first has 68001 rows and the second has 837190 rows. But the program takes too long. Thanks in advance.
Explanation of how my code works
I am doing some performance tests of the MQTT broker, for which I created the paho clients that send and receive messages, the data sent was stored in a csv to later calculate the latency. The csv of the Publishers (users who publish) contains the publisher's client ID, the account, the timestamp, and the topic to which it is subscribed.
While the subscribers (users who receive the message) contain the timestamp when the message is received, the message received, the count (counter for the number of messages), the publisher's client ID and the topic.
Now to calculate the latency, evaluate row by row with a loop that starts at 0. That's why I used a loop for df1 (dataframe of the publishers) and the first row of the df2 (dataframe of the message receivers).
With the conditional "if" I compare the client's ID, the count number, and the topics where the messages were sent from the first row of the dataframe df1, which corresponds to the first client to send a message and I compare it with the first row of the dataframe df2, to see if the client ID, count and topic match.
If it coincides, I proceed to subtract the times to calculate the Latency, locating myself in the column corresponding to the times, and then store them in another csv, which is almost the last line you can see.
If the conditional is not fulfilled, continue and goes to the next iteration of the loop corresponding to the df2. Therefore df1 will remain in the same position until the df2 loop has finished evaluating if there are matches in all the lines of its dataframe. I hope I have explained myself well.
#/////////////////////////////////////////////////////////////////
#!/usr/bin/env python
# encoding: utf-8
import pandas as pd
import numpy as np
import re
import datetime as dt
#from datetime import datetime
#import matplotlib.pyplot as plt
file1='/home/carmen/Desktop/Pruebas/QoS2/50-1/latency_users/publicadores4.csv'
file2='/home/carmen/Desktop/Pruebas/QoS2/50-1/latency_users/suscriptor_hilos4.csv'
Resultados='/home/carmen/Desktop/Resultados/QoS2/50-1/Latencia4.csv'
f=open(Resultados,'w')
f.write("Client,Count,Latency,Topic \n")
f.close()
#Almacena los datos a un archivo csv
data1=pd.read_csv(file1)
data2=pd.read_csv(file2)
#Convierte los datos en frame
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
#Mostrar informacion general del archivo
data1.info()
data2.info()
pd.set_option('display.max_colwidth',1000)
#Extrae de Timestamp solo la hora:min:segundos;microseg
df1['Time1']=df1['Timestamp'].str.extract('(..:..:.........)',expand=True)
df2['Time2']=df2['Timestamp'].str.extract('(..:..:.........)',expand=True)
#Convierte la columna timestamp a formato fecha y/o hora
df1['Time1']=pd.to_datetime(df1['Time1'])
df2['Time2']=pd.to_datetime(df2['Time2'])
with open(Resultados, "a") as f:
for i in range(len(df1)):
for j in range(len(df2)):
if( (df1.loc[i,'Client']== df2.loc[j,'Cliente']) and (df1.loc[i,'Count']==df2.loc[j,'Count'])\
and (df1.loc[i,'Topic']==df2.loc[j,'Topic'])):
print(df1.loc[i,'Client'],df2.loc[j,'Cliente'],df1.loc[i,'Count'],df2.loc[j,'Count'],df1.loc[i,'Time1'],df1.loc[i,'Topic'],df2.loc[j,'Topic'])
df1.loc[i,'Latencia']=(df2.loc[j,'Time2']-df1.loc[i,'Time1']) #calculando la diferencia
df1['Latencia'] = df1['Latencia'].astype(str).str.split('0 days ').str[-1] #delete 0days of the result
f.write(str(df2.loc[j,'Cliente'])+","+str(df2.loc[j,'Count'])+","+str(df1.loc[i,'Latencia'])+","+str(df1.loc[i,'Topic'])+","+str(df2.loc[j,'Topic'])+'\n')
else:
continue
Example of Data to compare:
df1:
Client,Count,Timestamp,Topic
Client_0000,000000,2021-04-22 09:01:43.627250,topic/2
Client_0001,000000,2021-04-22 09:01:43.628319,topic/2
Client_0002,000000,2021-04-22 09:01:43.629341,topic/3
Client_0003,000000,2021-04-22 09:01:43.630497,topic/2
Client_0004,000000,2021-04-22 09:01:43.631836,topic/1
Client_0005,000000,2021-04-22 09:01:43.633540,topic/3
Client_0006,000000,2021-04-22 09:01:43.635005,topic/2
CONTINUE....
df2:
Timestamp,Message,Count,Cliente,Topic
2021-04-22 09:01:43.639642,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.642274,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.641392,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.643774,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.637687,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.639910,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.643982,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.653039,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
2021-04-22 09:01:43.659924,¡Hello World! ¡Welcome! The temperature is:91 °C,000000,Client_0000,topic/2
You can use a merge: It should be faster than running loops
df2 = df2.rename(columns={'Cliente': 'Client'})
df = df1.merge(df2, 'inner', on=['Client', 'Topic', 'Count'])
df['latency'] = df['Timestamp_y'] - df['Timestamp_x']
Note that I have renamed the Cliente from df2 to Client, you can change it back later, after the merge if needed.
Or Alternatively you can use left_on
and right_on
df = df1.merge(df2, 'inner', left_on=['Client', 'Topic', 'Count'], right_on=['Cliente', 'Topic', 'Count'])
df['latency'] = df['Timestamp_y'] - df['Timestamp_x']
This is assuming that your Timestamp is in a valid format and you can subtract them to get latency. Here is an example of what the output df looks like:
Client Count Timestamp_x Topic Timestamp_y Message Cliente latency
0 Client_0000 000005 2021-05-09 13:05:36.316499 topic/2 2021-05-09 13:05:36.353677 Hello Client_0000 0 days 00:00:00.037178
1 Client_0063 000005 2021-05-09 13:05:39.505920 topic/6 2021-05-09 13:05:39.532888 Hello Client_0063 0 days 00:00:00.026968
2 Client_0071 000008 2021-05-09 13:05:39.913016 topic/5 2021-05-09 13:05:39.931340 Hello Client_0071 0 days 00:00:00.018324
3 Client_0082 000009 2021-05-09 13:05:40.487390 topic/9 2021-05-09 13:05:40.521418 Hello Client_0082 0 days 00:00:00.034028
4 Client_0097 000006 2021-05-09 13:05:41.248995 topic/10 2021-05-09 13:05:41.264659 Hello Client_0097 0 days 00:00:00.015664
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.