Pandas DataFrame 耗时太长

Question

I am running the below code on a file with close to 300k lines.我在接近 300k 行的文件上运行以下代码。 I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?我知道我的代码效率不高，因为它需要很长时间才能完成，有人可以告诉我如何加快速度吗？

import sys
import numpy as np
import pandas as pd


file = sys.argv[1]

df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]

orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])


size = np.array([])
ts = np.array([])
for i in range(len(df)):
    if orig_bytes[i] > resp_bytes[i]:
        size = np.append(size, orig_bytes[i])
        ts = np.append(ts, df['ts'][i])
    else:
        size = np.append(size, resp_bytes[i])
        ts = np.append(ts, df['ts'][i])

The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.目的是仅记录两个（orig_bytes 或 resp_bytes）之一是较大的实例。

Thanking you all for your help感谢大家的帮助

Answer 1

I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go.我不能保证这会比你拥有的运行得更快，但它是通往你想去的地方的更直接的途径。 Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:另外，我假设根据您的示例，您不想保留两个字节值相等的实例，并且您最终想要一个单独的 DataFrame，而不是现有 df 中的新列：

After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.创建 DataFrame 并重命名列后，您可以使用 query 删除所有 orig_bytes 和 resp_bytes 相同的实例，创建一个具有两者最大值的新列，然后将 DataFrame 缩小到仅你想要的两列。

df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]

df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]

If you do want to include the entries where they are equal to each other, then just skip the query step.如果您确实希望包含彼此相等的条目，则只需跳过查询步骤。

Pandas DataFrame 耗时太长

问题描述

1 个解决方案

解决方案1
0 2019-12-17 21:06:26

Pandas DataFrame 耗时太长

问题描述

1 个解决方案

解决方案1 0 2019-12-17 21:06:26

解决方案1
0 2019-12-17 21:06:26