简体   繁体   English

如何优化以下代码?

[英]how to optimize the following code?

I'm writing a program in python to replace some values of a data frame, the idea is that I have a file called file.txt and looks like this: 我在python中编写一个程序来替换数据框的某些值,我的想法是我有一个名为file.txt的文件,如下所示:

A:s:Y:0.1:0.1:0.1:0.2:0.1
B:r:D:0.3:0.5:0.1:0.2:0.2
C:f:C:0.3:0.4:0.2:-0.1:0.4
D:f:C:0.1:0.2:0.1:0.1:0.1
F:f:C:0.1:-0.1:-0.1:0.1:0.1
G:f:C:0.0:-0.1:0.1:0.3:0.4
H:M:D:0.1:0.4:0.1:0.0:0.4

and I want to use as separator the ':::', I want to replace the values of the four column for some strings following this rules: 并且我想使用':::'作为分隔符,我想按照以下规则替换一些字符串的四列值:

All the values who belong's to the range1 are going to be replaced for 'N': 所有属于range1的值将被替换为'N':

range1=[-0.2,-0.1,0,0.1,0.2] -> 'N'

All the values who belong to the range2 are going to be replaced for 'L': 属于range2的所有值将替换为'L':

range2=[-0.5,-0.4,-0.3] -> 'L'

All the values who belong to the range3 are going to be replaced with 'H': 属于range3的所有值将被替换为'H':

range3=[0.3,0.4,0.5]

In order to achieve this I tried the following: 为了实现这一点,我尝试了以下方法:

import pandas as pd

df= pd.read_csv('file.txt', sep=':',header=None)

labels=df[3]


range1=[-0.2,-0.1,0,0.1,0.2]

range2=[-0.5,-0.4,-0.3]

range3=[0.3,0.4,0.5]

lookup = {'N': range1, 'L': range2, 'H': range3}




for k, v in lookup.items():
    df.loc[df[3].isin(v), 3] = k


for k, v in lookup.items():
    df.loc[df[4].isin(v), 4] = k


for k, v in lookup.items():
    df.loc[df[5].isin(v), 5] = k

for k, v in lookup.items():
    df.loc[df[6].isin(v), 6] = k

for k, v in lookup.items():
    df.loc[df[7].isin(v), 7] = k


print(df)

And it works well but i want to avoid the usage of so many fors, I would like to appreciate any suggestion of how to achieve this. 它运作良好,但我想避免使用这么多的fors,我想欣赏任何有关如何实现这一点的建议。

You can use where instead: 您可以where使用:

for k, v in lookup.items():
    df = df.where(~df.isin(v), k)

This says to retain the values of df when those values are not contained in v . 这表示当v中不包含这些值时保留df的值。 Otherwise, replace them with the value k . 否则,用值k替换它们。 The assignment overwrites df at each iteration to accumulate the categorical labels. 赋值在每次迭代时覆盖df以累积分类标签。

This method works on all columns in one operation, so it only works if you want to replace every instance of a given numeric value with its categorical coded letter. 此方法适用于一个操作中的所有列,因此仅当您要将给定数值的每个实例替换为其分类编码字母时才有效。

There is another option for where that specifies in-place modification, but unfortunately it cannot be used with DataFrames that have mixed column types. 还有另一个选项where指定就地修改,但不幸的是它不能与具有混合列类型DataFrames使用。 In your example, columns 0, 1, and 2 have type object while the rest have type float . 在您的示例中,列0,1和2具有类型object而其余的类型为float Thus, pandas conservatively (and inefficiently) assumes it would have to convert everything to object to do the in-place overwrite, and raises a TypeError rather than checking further to see if only same-typed columns are actually affected by the mutation. 因此, pandas保守地(并且效率低)假设它必须将所有内容转换为object以进行就地覆盖,并引发TypeError而不是进一步检查是否只有相同类型的列实际上受到突变的影响。

For example, this: 例如,这个:

df.where(~df.isin(v), k, inplace=True)

will raise TypeError . 会引发TypeError

This limitation with Pandas is fairly frustrating. 熊猫的这种限制令人沮丧。 For example, you also cannot use regular pandas assignment to work around it either, as the following also gives the same TypeError : 例如,你也不能使用常规的pandas赋值来解决它,因为下面也给出了相同的TypeError

for k, v in lookup.items():
    df.where(~df.isin(v), inplace=True)
    df[df.isnull()] = k # <-- same TypeError  

and amazingly setting the try_cast keyword argument to True and/or setting the raise_on_error keyword argument to False do not affect whether the TypeError is raised, so you cannot disable this type safety check when using where . 并且令人惊讶地将try_cast关键字参数设置为True和/或将raise_on_error关键字参数设置为False不会影响是否TypeError ,因此在使用where时无法禁用此类型安全检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM