[英]how to optimize the following code?
I'm writing a program in python to replace some values of a data frame, the idea is that I have a file called file.txt and looks like this: 我在python中编写一个程序来替换数据框的某些值,我的想法是我有一个名为file.txt的文件,如下所示:
A:s:Y:0.1:0.1:0.1:0.2:0.1
B:r:D:0.3:0.5:0.1:0.2:0.2
C:f:C:0.3:0.4:0.2:-0.1:0.4
D:f:C:0.1:0.2:0.1:0.1:0.1
F:f:C:0.1:-0.1:-0.1:0.1:0.1
G:f:C:0.0:-0.1:0.1:0.3:0.4
H:M:D:0.1:0.4:0.1:0.0:0.4
and I want to use as separator the ':::', I want to replace the values of the four column for some strings following this rules: 并且我想使用':::'作为分隔符,我想按照以下规则替换一些字符串的四列值:
All the values who belong's to the range1 are going to be replaced for 'N': 所有属于range1的值将被替换为'N':
range1=[-0.2,-0.1,0,0.1,0.2] -> 'N'
All the values who belong to the range2 are going to be replaced for 'L': 属于range2的所有值将替换为'L':
range2=[-0.5,-0.4,-0.3] -> 'L'
All the values who belong to the range3 are going to be replaced with 'H': 属于range3的所有值将被替换为'H':
range3=[0.3,0.4,0.5]
In order to achieve this I tried the following: 为了实现这一点,我尝试了以下方法:
import pandas as pd
df= pd.read_csv('file.txt', sep=':',header=None)
labels=df[3]
range1=[-0.2,-0.1,0,0.1,0.2]
range2=[-0.5,-0.4,-0.3]
range3=[0.3,0.4,0.5]
lookup = {'N': range1, 'L': range2, 'H': range3}
for k, v in lookup.items():
df.loc[df[3].isin(v), 3] = k
for k, v in lookup.items():
df.loc[df[4].isin(v), 4] = k
for k, v in lookup.items():
df.loc[df[5].isin(v), 5] = k
for k, v in lookup.items():
df.loc[df[6].isin(v), 6] = k
for k, v in lookup.items():
df.loc[df[7].isin(v), 7] = k
print(df)
And it works well but i want to avoid the usage of so many fors, I would like to appreciate any suggestion of how to achieve this. 它运作良好,但我想避免使用这么多的fors,我想欣赏任何有关如何实现这一点的建议。
You can use where
instead: 您可以
where
使用:
for k, v in lookup.items():
df = df.where(~df.isin(v), k)
This says to retain the values of df
when those values are not contained in v
. 这表示当
v
中不包含这些值时保留df
的值。 Otherwise, replace them with the value k
. 否则,用值
k
替换它们。 The assignment overwrites df
at each iteration to accumulate the categorical labels. 赋值在每次迭代时覆盖
df
以累积分类标签。
This method works on all columns in one operation, so it only works if you want to replace every instance of a given numeric value with its categorical coded letter. 此方法适用于一个操作中的所有列,因此仅当您要将给定数值的每个实例替换为其分类编码字母时才有效。
There is another option for where
that specifies in-place modification, but unfortunately it cannot be used with DataFrames that have mixed column types. 还有另一个选项
where
指定就地修改,但不幸的是它不能与具有混合列类型DataFrames使用。 In your example, columns 0, 1, and 2 have type object
while the rest have type float
. 在您的示例中,列0,1和2具有类型
object
而其余的类型为float
。 Thus, pandas
conservatively (and inefficiently) assumes it would have to convert everything to object
to do the in-place overwrite, and raises a TypeError
rather than checking further to see if only same-typed columns are actually affected by the mutation. 因此,
pandas
保守地(并且效率低)假设它必须将所有内容转换为object
以进行就地覆盖,并引发TypeError
而不是进一步检查是否只有相同类型的列实际上受到突变的影响。
For example, this: 例如,这个:
df.where(~df.isin(v), k, inplace=True)
will raise TypeError
. 会引发
TypeError
。
This limitation with Pandas is fairly frustrating. 熊猫的这种限制令人沮丧。 For example, you also cannot use regular pandas assignment to work around it either, as the following also gives the same
TypeError
: 例如,你也不能使用常规的pandas赋值来解决它,因为下面也给出了相同的
TypeError
:
for k, v in lookup.items():
df.where(~df.isin(v), inplace=True)
df[df.isnull()] = k # <-- same TypeError
and amazingly setting the try_cast
keyword argument to True
and/or setting the raise_on_error
keyword argument to False
do not affect whether the TypeError
is raised, so you cannot disable this type safety check when using where
. 并且令人惊讶地将
try_cast
关键字参数设置为True
和/或将raise_on_error
关键字参数设置为False
不会影响是否TypeError
,因此在使用where
时无法禁用此类型安全检查。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.