how to optimize the following code?

Question

I'm writing a program in python to replace some values of a data frame, the idea is that I have a file called file.txt and looks like this:

A:s:Y:0.1:0.1:0.1:0.2:0.1
B:r:D:0.3:0.5:0.1:0.2:0.2
C:f:C:0.3:0.4:0.2:-0.1:0.4
D:f:C:0.1:0.2:0.1:0.1:0.1
F:f:C:0.1:-0.1:-0.1:0.1:0.1
G:f:C:0.0:-0.1:0.1:0.3:0.4
H:M:D:0.1:0.4:0.1:0.0:0.4

and I want to use as separator the ':::', I want to replace the values of the four column for some strings following this rules:

All the values who belong's to the range1 are going to be replaced for 'N':

range1=[-0.2,-0.1,0,0.1,0.2] -> 'N'

All the values who belong to the range2 are going to be replaced for 'L':

range2=[-0.5,-0.4,-0.3] -> 'L'

All the values who belong to the range3 are going to be replaced with 'H':

range3=[0.3,0.4,0.5]

In order to achieve this I tried the following:

import pandas as pd

df= pd.read_csv('file.txt', sep=':',header=None)

labels=df[3]


range1=[-0.2,-0.1,0,0.1,0.2]

range2=[-0.5,-0.4,-0.3]

range3=[0.3,0.4,0.5]

lookup = {'N': range1, 'L': range2, 'H': range3}




for k, v in lookup.items():
    df.loc[df[3].isin(v), 3] = k


for k, v in lookup.items():
    df.loc[df[4].isin(v), 4] = k


for k, v in lookup.items():
    df.loc[df[5].isin(v), 5] = k

for k, v in lookup.items():
    df.loc[df[6].isin(v), 6] = k

for k, v in lookup.items():
    df.loc[df[7].isin(v), 7] = k


print(df)

And it works well but i want to avoid the usage of so many fors, I would like to appreciate any suggestion of how to achieve this.

Answer 1

You can use where instead:

for k, v in lookup.items():
    df = df.where(~df.isin(v), k)

This says to retain the values of df when those values are not contained in v . Otherwise, replace them with the value k . The assignment overwrites df at each iteration to accumulate the categorical labels.

This method works on all columns in one operation, so it only works if you want to replace every instance of a given numeric value with its categorical coded letter.

There is another option for where that specifies in-place modification, but unfortunately it cannot be used with DataFrames that have mixed column types. In your example, columns 0, 1, and 2 have type object while the rest have type float . Thus, pandas conservatively (and inefficiently) assumes it would have to convert everything to object to do the in-place overwrite, and raises a TypeError rather than checking further to see if only same-typed columns are actually affected by the mutation.

For example, this:

df.where(~df.isin(v), k, inplace=True)

will raise TypeError .

This limitation with Pandas is fairly frustrating. For example, you also cannot use regular pandas assignment to work around it either, as the following also gives the same TypeError :

for k, v in lookup.items():
    df.where(~df.isin(v), inplace=True)
    df[df.isnull()] = k # <-- same TypeError

and amazingly setting the try_cast keyword argument to True and/or setting the raise_on_error keyword argument to False do not affect whether the TypeError is raised, so you cannot disable this type safety check when using where .

how to optimize the following code?

Question

1 answers

solution1
4 ACCPTED 2016-04-19 16:16:52

how to optimize the following code?

Question

1 answers

solution1 4 ACCPTED 2016-04-19 16:16:52

solution1
4 ACCPTED 2016-04-19 16:16:52