[英]Pandas DataFrame change a value based on column, index values comparison
Suppose that you have a pandas DataFrame
which has some kind of data in the body and numbers in the column
and index
names. 假设你有一个熊猫
DataFrame
具有某种在主体数据和数字 column
和index
名。
>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
2 4 8
10 a b c
4 d e f
2 g h i
Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. 现在假设我们想要在比较索引和列的基础上以某种方式操作数据帧。 Consider the following.
考虑以下。
Where index is greater than column replace letter with 'k':
其中index大于列替换字母'k':
2 4 8
10 k k k
4 k e f
2 g h i
Where index is equal to column replace letter with 'U':
其中index等于列替换字母为'U':
2 4 8
10 k k k
4 k U f
2 U h i
Where column is greater than index replace letter with 'Y':
其中列大于索引替换字母'Y':
2 4 8
10 k k k
4 k U Y
2 U Y Y
To keep the question useful to all: 为了让问题对所有人有用:
What is a fast way to do this replacement? 这种替换的快速方法是什么?
What is the simplest way to do this replacement? 这种替换最简单的方法是什么?
Speed Results from minimal example 速度来自最小的例子
jezrael : 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael :
556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
556μs 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
66.1μs 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
偏差 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
user3471881 : 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
user3471881 :
329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
329μs 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
11.4μs 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
偏差 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
thunderwood : 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
thunderwood :
4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
252μs 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
偏差 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Is this a duplicate? 这是重复的吗? I searched google for
pandas replace compare index column
and the top results are: 我搜索了谷歌的
pandas replace compare index column
,顶部的结果是:
Pandas - Compare two dataframes and replace values matching condition Pandas - 比较两个数据帧并替换匹配条件的值
Python pandas: replace values based on location not index value Python pandas:根据位置而不是索引值替换值
Pandas DataFrame: replace all values in a column, based on condition Pandas DataFrame:根据条件替换列中的所有值
However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way 但是,我不觉得这些是否可能或b)如何以这种方式进行比较
I think you need numpy.select
with broadcasting: 我认为你需要
numpy.select
广播:
m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values
df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
2 4 8
10 k k k
4 k U Y
2 U Y Y
Performance : 表现 :
np.random.seed(1000)
N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)
df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)
def us(df):
values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
greater = values > 0
less = values < 0
same = values == 0
values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'
return pd.DataFrame(values, columns=df.columns, index=df.index)
def jez(df):
m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values
return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure about the fastest way to accomplish this but an incredibly simple way would be to just iterate over the dataframe like such: 不确定实现这一目标的最快方法,但一种非常简单的方法就是迭代数据帧,如下所示:
for i in df.index:
for j in df.columns:
if i>j:
df.loc[i,j]='k'
elif j>i:
df.loc[i,j]='y'
else:
df.loc[i,j]='u'
1. Using np.arrays
+ np.select
: 1.使用
np.arrays
+ np.select
:
values = np.array(np.array([df.index]).transpose() - np.array([df.columns]))
greater = values > 0
same = values == 0
df = pd.DataFrame(np.select([greater, same], ['k', 'U'], 'Y'), columns=df.columns, index=df.index)
2. Using np.arrays
and manual masking. 2.使用
np.arrays
和手动屏蔽。
values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
greater = values > 0
less = values < 0
same = values == 0
values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'
df = pd.DataFrame(values, columns=df.columns, index=df.index)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.