Could you please advise how the following lines should be re-written based on http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.drop('PACKETS', axis=1, inplace=True)
produces
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.drop('PACKETS', axis=1, inplace=True)
/home/app/ip-spotlight/code/app/ipacc/plugin/ix.py:74: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
df.replace(numpy.nan, "", inplace=True)
produces
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.replace(numpy.nan, "", inplace=True)
/home/app/ip-spotlight/code/app/ipacc/plugin/ix.py:68: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
On the other hand, the following is an example of how it was re-written based on the above principle
df.loc[:, ('SRC_PREFIX')] = df[ ['SRC_NET', 'SRC_MASK'] ].apply(lambda x: "/".join(x), axis=1)
But i am unable to figure out how to re-write the cases 1 and 2 ?
EDIT : the code so far it looks like this ( df
is the dataframe of interest). So initially the is some kind of casting:
df = pandas.DataFrame(data['payload'], columns=sorted(data['header'], key=data['header'].get))
df = df.astype({
'SRC_AS' : "object",
'DST_AS' : "object",
'COMMS' : "object",
'SRC_COMMS' : "object",
'AS_PATH' : "object",
'SRC_AS_PATH' : "object",
'PREF' : "object",
'SRC_PREF' : "object",
'MED' : "object",
'SRC_MED' : "object",
'PEER_SRC_AS' : "object",
'PEER_DST_AS' : "object",
'PEER_SRC_IP' : "object",
'PEER_DST_IP' : "object",
'IN_IFACE' : "object",
'OUT_IFACE' : "object",
'SRC_NET' : "object",
'DST_NET' : "object",
'SRC_MASK' : "object",
'DST_MASK' : "object",
'PROTOCOL' : "object",
'TOS' : "object",
'SAMPLING_RATE' : "uint64",
'EXPORT_PROTO_VERSION' : "object",
'PACKETS' : "object",
'BYTES' : "uint64",
})
Then the calculate
function of a module is called:
mod.calculate(data['identifier'], data['timestamp'], df)
And the calculate
function is defined like this:
def calculate(identifier, timestamp, df):
try:
# Filter based on AORTA IX.
lut_ipaddr = lookup_ipaddr()
df = df[ (df.PEER_SRC_IP.isin( lut_ipaddr )) ]
if df.shape[0] > 0:
logger.info('analyzing message `{}`'.format(identifier))
# Preparing for input.
df.replace("", numpy.nan, inplace=True)
# Data wrangling. Calculate traffic rate. Reduce.
df.loc[:, ('BPS')] = 8*df['BYTES']*df['SAMPLING_RATE']/300
df.drop(columns=['SAMPLING_RATE', 'EXPORT_PROTO_VERSION', 'PACKETS', 'BYTES'], inplace=True)
# Data wrangling. Formulate prefixes using CIDR notation. Reduce.
df.loc[:, ('SRC_PREFIX')] = df[ ['SRC_NET', 'SRC_MASK'] ].apply(lambda x: "/".join(x), axis=1)
df.loc[:, ('DST_PREFIX')] = df[ ['DST_NET', 'DST_MASK'] ].apply(lambda x: "/".join(x), axis=1)
df.drop(columns=['SRC_NET', 'SRC_MASK', 'DST_NET' ,'DST_MASK'], inplace=True)
# Populate using lookup tables.
df.loc[:, ('NETELEMENT')] = df['PEER_SRC_IP'].apply(lookup_netelement)
df.loc[:, ('IN_IFNAME')] = df.apply(lambda x: lookup_iface(x['NETELEMENT'], x['IN_IFACE']), axis=1)
df.loc[:, ('OUT_IFNAME')] = df.apply(lambda x: lookup_iface(x['NETELEMENT'], x['OUT_IFACE']), axis=1)
# df.loc[:, ('SRC_ASNAME')] = df.apply(lambda x: lookup_asn(x['SRC_AS']), axis=1)
# Add a timestamp.
df.loc[:, ('METERED_ON')] = arrow.get(timestamp, "YYYYMMDDHHmm").format("YYYY-MM-DD HH:mm:ss")
# Preparing for input.
df.replace(numpy.nan, "", inplace=True)
# Finalize !
return identifier, timestamp, df.to_dict(orient="records")
else:
logger.info('going through message `{}` no IX bgp/netflow data were found'.format(identifier))
except Exception as e:
logger.error('processing message `{}` at `{}` caused `{}`'.format(identifier,timestamp,repr(e)), exc_info=True)
return identifier, timestamp, None
Ok. I don't really know what is going on under the hood of pandas. But still, I've tried to come up with some minimal examples to show you where the problem can be and what you can do about it. First, creating dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(x=[0, 1, 2],
y=[0, 0, 5]))
Then, as you pass your dataframe to a function, I will do the same but for 2 almost identical functions:
def func(dfx):
# Analog of your df = df[df.PEER_SRC_IP.isin(lut_ipaddr)]
dfx = dfx[dfx['x'] > 1.5]
# Analog of your df.replace("", numpy.nan, inplace=True)
dfx.replace(5, np.nan, inplace=True)
def func_with_copy(dfx):
dfx = dfx[dfx['x'] > 1.5].copy() # explicitly making a copy
dfx.replace(5, np.nan, inplace=True)
Now let's call them for initial df:
func_with_copy(df)
print(df)
gives
x y
0 0 0
1 1 0
2 2 5
and no warning. And calling this:
func(df)
print(df)
gives the same output:
x y
0 0 0
1 1 0
2 2 5
but with the warning:
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
So this looks like a 'false positive'. Here is a good comment on false positives: link
Strange thing here is that if you do exactly the same manipulations with your dataframe but without passing it to a function, then you won't see this warning. ¯\\_(ツ)_/¯
My advice is to use .copy()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.