[英]Use column combinations to find data mismatch in rows pandas
What's the best way to get all cell values based on a combination of column values? 基于列值的组合来获取所有单元格值的最佳方法是什么?
Sample dataframe One: 示例数据帧一:
Stock Name Price
0 AMD Advanced Micro Devices 100
1 GE General Electric Company 200
2 BAC Bank of America Corporation 300
3 AAPL Apple Inc. 500
4 MSFT Microsoft Corporation 1000
5 GOOGL Alphabet Inc. 2000
Sample dataframe Two: 示例数据框二:
Stock Name Price
0 AMD Advanced Micro Devices 100
1 GE General Electric Company 200
2 BAC Branch of America Corporation 300
3 AAPL Apple Inc. 500
4 MSFT Microsoft Corporation 1000
5 GOOGL Alphabet Inc. 2000
For example: I want to use (Stock and Name) as key columns and then compare the datasets. 例如:我想使用(库存和名称)作为关键列,然后比较数据集。 The goal is to print the mismatch entries between the two datasets with the Stock+Name columns used as a combination key.
目的是使用Stock + Name列作为组合键来打印两个数据集之间的不匹配项。
I'm using Pandas/Python3.7 我正在使用Pandas / Python3.7
Sample Output: 样本输出:
BAC Bank of America Corporation 300 --- BAC Branch of America Corporation 300
BAC美国银行公司300 ---美国公司300 BAC分行
Perhaps, a FULL INNER JOIN using merge
+ query
? 也许,使用
merge
+ query
的FULL INNER JOIN?
df1.merge(df2, on='Stock').query('Name_x != Name_y')
Stock Name_x Price_x Name_y Price_y
2 BAC Bank of America Corporation 300 Branch of America Corporation 300
Or, a slightly different solution with map
, you can use to get the stock symbols: 或者,与
map
稍有不同的解决方案,您可以用来获取股票代码:
m = df1.Stock.map(df2.set_index('Stock').Name).ne(df1.Name)
symbols = df1.loc[m, 'Stock']
print(symbols)
2 BAC
Name: Stock, dtype: object
And then access each DataFrame row by stock symbol: 然后按库存代码访问每个DataFrame行:
df1[df1.Stock.isin(symbols)]
Stock Name Price
2 BAC Bank of America Corporation 300
df2[df2.Stock.isin(symbols)]
Stock Name Price
2 BAC Branch of America Corporation 300
If they are in two dataframes, merging them without condition is pretty straightforward with .concat
. 如果它们在两个数据帧中,则使用
.concat
合并非常简单。 Once they are joined, here's one way to get the mismatch: 一旦加入,这是解决不匹配的一种方法:
import pandas as pd
df1 = pd.DataFrame({
"Ticker_y": list("qwerty"),
"Name_y": list("asdfgh"),
"Ticker_x": list("qw3r7y"),
"Name_x": list("as6f8h")
})
mismatch = df1[(df1["Ticker_y"] != df1["Ticker_x"]) & (df1["Name_y"] != df1["Name_x"])]
The last line just says "the df only where these conditions are met." 最后一行只是说“只有在满足这些条件的情况下,df”。
We can use isin
using the sequence of values to test as it ensures each element in the DataFrame is contained in values 我们可以使用
isin
使用值序列进行测试,因为它可以确保DataFrame中的每个元素都包含在值中
First DataFrame 第一个数据框
>>> df1
Stock Name Price
0 AMD Advanced Micro Devices 100
1 GE General Electric Company 200
2 BAC Bank of America Corporation 300
3 APPL Apple Inc. 500
4 MSFT Microsoft Corporation 1000
5 GOOGL Alphabet Inc. 2000
Second DataFrame 第二个DataFrame
>>> df2
Stock Name Price
0 AMD Advanced Micro Devices 100
1 GE General Electric Company 200
2 BAC Branch of America Corporation 300
3 APPL Apple Inc. 500
4 MSFT Microsoft Corporation 1000
5 GOOGL Alphabet Inc. 2000
Here you can go.. 在这里你可以去..
>>> df2[~df2.Name.isin(df1.Name.values)]
Stock Name Price
2 BAC Branch of America Corporation 300
OR 要么
>>> df1[~df1.Name.isin(df2.Name.values)]
Stock Name Price
2 BAC Bank of America Corporation 300
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.