Pandas dataframe 比较没有列名引用的索引的所有列值

Question

我有一个索引 dataframe 包含许多列，一些例子：

    Feature1
    Feature2
    Feature3
    Feature4
....

我只想实现一个 function，以创建一个新的数据帧（或另一种数据结构类型）object，如果值相等，它将比较一个测试样本行值与所有其他行的值（包括测试样本）； 比较结果将为“1”，否则为“0”，但由于我有 91 列，我不想引用列名，我见过很多例子，列名被赋予某些 pandas 函数。

分类数据对象的数据示例（NaN 表示null ）

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        yes         IN    NaN
SRI3015  1        yes         IN    NaN
SRS3012  1        no          OUT   Val1

我只是尝试过：

##Choose sample
    test_sample = classified_data.sample();
#Find index of random sample
    test_product_code = list(test_sample.index.values)[0]
##Find location of random product in data-set
    test_index = classified_data.index.get_loc(test_product_code)
    #print(test_sample);
    #print(classified_data[(test_index):(test_index+1)])
    enum_similarity_data = pandas.DataFrame(calculate_similarity_for_categorical(classified_data[(test_index):(test_index+1)],classified_data).T,index=classified_data.index)


def calculate_similarity_for_categorical(value1,value2):
    if(value1 == value2):
        return 1;
    else:
        return 0;

SRI3012 所需的 output（假设随机选择）一个 dataframe 或另一个 object 具有列名和值：

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        1        1        1
SRI3015  1        1        1        1
SRS3012  1        0        0        0

Answer 1

`DataFrame.eq`

您可以检查一行与指定axis=1的所有其他行的相等性。 这里的比较应该是DataFrame.eq(Series, axis=1)如果您认为NaN == NaN为True （这不是标准），我们需要单独处理。

import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'A', 'B', 'C', np.NaN], ['A', 'A', 'B', 'C', np.NaN], 
                   ['A', 'X', 'Z', 'C', np.NaN], [6, 'foo', 'bar', 12, 1231.1]])
#   0    1    2   3       4
#0  A    A    B   C     NaN
#1  A    A    B   C     NaN
#2  A    X    Z   C     NaN
#3  6  foo  bar  12  1231.1

s = df.iloc[0]  # or df.iloc[np.random.choice(range(df.shape[0]))]
(df.eq(s, axis=1) | (s.isnull() & df.isnull())).astype(int)
                     # so NaN == NaN is True

#   0  1  2  3  4
#0  1  1  1  1  1
#1  1  1  1  1  1
#2  1  0  0  1  1
#3  0  0  0  0  0

Answer 2

我无法发表评论，所以我会在这里发表评论。 正如 Quang Hoang 评论的那样，您不应该使用截图，而应该使用简单且格式良好的数据，任何花费宝贵时间帮助您的人都可以复制。 此外，所有这些复杂的信息都是不必要的。 您可以使用具有简单值和名称的简单虚拟 DataFrame 重现您的问题的概念。 这样，您将获得更好更快的答案。

尝试这个：

import numpy as np
import pandas as pd


df = pd.DataFrame({'Feature1':[    1 ,     1 ,    1 ],
                   'Feature2':[ 'yes',  'yes',  'no'], 
                   'Feature3':[ 'IN' ,  'IN' , 'OUT'],
                   'Feature4':[np.NaN, np.NaN,    5 ]
                  },
                  index=['SR12', 'SR13', 'SR14']
)
df.index.name = '_product'

def compare_against_series(x, reference):
    """compares a Series against a reference Series"""
    # apply .astype(int) to convert boolean to 0-1
    return np.logical_or(x == sample, x.isnull() & sample.isnull()).astype(int)

# take the 1st row as sample 
sample = df.iloc[0]

# apply compare_against_series row-wise, using the sample
# note axis=1 means row-wise and axis=0 column-wise
result = df.apply(compare_against_series, axis=1, reference=sample)

东风：

          Feature1 Feature2 Feature3 Feature4
_product                            
SR12             1      yes       IN      NaN
SR13             1      yes       IN      NaN
SR14             1       no      OUT      5.0

样本：

Feature1      1
Feature2    yes
Feature3     IN
Feaure4     NaN
Name: SR12, dtype: object

结果：

          Feature1  Feature2  Feature3  Feautre4
_product                              
SR12             1         1         1         1
SR13             1         1         1         1
SR14             1         0         0         0

Pandas dataframe 比较没有列名引用的索引的所有列值

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-11-12 20:44:14

`DataFrame.eq`

解决方案2
0 2019-11-12 20:55:42

Pandas dataframe 比较没有列名引用的索引的所有列值

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-11-12 20:44:14

DataFrame.eq

解决方案2 0 2019-11-12 20:55:42

解决方案1
1 已采纳 2019-11-12 20:44:14

`DataFrame.eq`

解决方案2
0 2019-11-12 20:55:42