简体   繁体   中英

Creating new columns with Pandas df.apply

I have a huge NetFlow database, (it contains a Timestamp, Source IP, Dest IP, Protocol, Source and Dest Port Num., Packets Exchanged, Bytes and more). I want to create custom attributes based on the current and previous rows.

I want to calculate new columns based on the source ip and timestamp of the current row. This what i want to do logically:

  • Get the source ip for the current row.
  • Get the Timestamp for the current row.
  • Based on the source IP, and Timestamp, I want to get all the Previous rows of the entire dataframe, that matches the source IP, and the communicaton happened in the last half an hour. This is very important.
  • For the rows(Flows, in my example), that matches the criteria (source ip and happened in the last half hour), I want to count the sum and mean of all the packets and all the bytes.

One row from the dataset

Snippets of relevant code:

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])

df['ts'] = pd.to_datetime(df['ts'])

def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
    ...
return sum

df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)

I know that there's probably a better, more efficient way to do this, but I'm sadly not the best programmer.

Thanks.

Documented inline

from datetime import timedelta

def fun(df, i):
  # Current timestamp
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes
  last = current - timedelta(minutes=30)
  # Current IP
  ip = df.loc[i, 'sa']
  
  # df matching the criterian
  adf = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == ip)]

  # Return sum and mean
  return adf['pkt'].sum(), adf['pkt'].mean()

# Apply the fun over each row
result = [fun(df, i) for i in df.index]

# Create new columns
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
        
df['ts'] = pd.to_datetime(df['ts'])
   
def prev_30_ip_sum(df, i):
  #current time from current row
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes 
  last = current - timedelta(minutes=30)

  # Current source address
  sa = df.loc[i, 'sa']

  # new dataframe for timestamp less than 30 min and same ip as current one
  new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]

  # Return sum and mean
  return new_df['pkt'].sum(), new_df['pkt'].mean()


# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]

# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

refer this to understand timedelta

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM