检查具有不同值系统的两个数据框列的相似性

Question

I have two columns of two different dataframes.我有两列两个不同的数据框。 The following chunks are the first 5 rows, but each column is much longer:以下块是前 5 行，但每列要长得多：

A = pd.DataFrame(['30-34', '20-24', '20-24', '15-19', '00-04'])

and和

B = pd.DataFrame(['6','4', '4', '3', '0'])

I want to check whether both columns coincide, considering that考虑到这一点，我想检查两列是否重合

0 represents 00-04; 
1 represents 05-09; 
2 represents 10-14;
3 represents 15-19;
4 represents 20-24;
5 represents 25-29;
and 6 represents 30-34.

The desired output would be the number of non-matching elements.所需的输出将是不匹配元素的数量。 In the sample given, the desired output is "0", because the first 5 values of both columns are matching.在给定的示例中，所需的输出为“0”，因为两列的前 5 个值匹配。 I would give an approach that I tried but I have absolutely no idea.我会给出一种我尝试过的方法，但我完全不知道。

Answer 1

IIUC, you have ranges, 5 by 5, and you want to match them to their integer division. IIUC，您有 5 x 5 的范围，并且您希望将它们与整数除法相匹配。

(B.astype(int).values == A[0].str.split('-', expand=True).astype(int)//5).all(axis=1)

output:输出：

0    True
1    True
2    True
3    True
4    True

Check if the columns coincide:检查列是否重合：

(B.astype(int).values ==
 A[0].str.split('-', expand=True).astype(int)//5
).all(axis=1).all()

output: True输出： True

Intermediate steps:中间步骤：

# split on "-"
>>> A[0].str.split('-', expand=True)
    0   1
0  30  34
1  20  24
2  20  24
3  15  19
4  00  04

# get integer division
>>> A[0].str.split('-', expand=True).astype(int)//5
   0  1
0  6  6
1  4  4
2  4  4
3  3  3
4  0  0

# check if equals B
>>> B.astype(int).values == A[0].str.split('-', expand=True).astype(int)//5
      0     1
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

Answer 2

You can map the strings to integers with map , then compare with the other dataframe and count the number of non-matching items:您可以将字符串映射到与整数map ，然后用其他的数据帧比较和计算不匹配的项目数：

import pandas as pd
import io

data ='''0      30-34
1      20-24
2      20-24
3      15-19
4      00-04'''

data1 = '''0      6
1      4
2      4
3      3
4      0'''

df = pd.read_csv(io.StringIO(data), sep='\s+', names=['idx', 'string'])
df1 = pd.read_csv(io.StringIO(data1), sep='\s+', names=['idx', 'value'])

df['value'] = df['string'].map({'00-04': 0, '05-09':1, '10-14':2, '15-19':3, '20-24':4, '25-30':5, '30-34':6})

sum(df['value'] != df1['value'])

检查具有不同值系统的两个数据框列的相似性

问题描述

2 个解决方案

解决方案1
1 2021-11-12 12:17:51

解决方案2
-1 2021-11-12 11:52:25

检查具有不同值系统的两个数据框列的相似性

问题描述

2 个解决方案

解决方案1 1 2021-11-12 12:17:51

解决方案2 -1 2021-11-12 11:52:25

解决方案1
1 2021-11-12 12:17:51

解决方案2
-1 2021-11-12 11:52:25