简体   繁体   English

比较 pandas DataFrame 中的两个日期列以验证第三列

[英]Compare two date columns in pandas DataFrame to validate third column

Background info背景资料
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher.我正在研究 DataFrame,我已经使用模糊匹配器成功加入了两个不同的足球运动员数据集。 These datasets did not have keys for an exact match and instead had to be done by their names.这些数据集没有完全匹配的键,而是必须通过它们的名称来完成。 An example match of the name column from two databases to merge as one is the following将两个数据库中的名称列合并为一个的示例匹配如下

long_name       name
L. Messi        Lionel Andrés Messi Cuccittini

As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df , ensuring that the columns match like the example below作为 18,000 行数据库验证过程的一部分,我想检查合并的 DataFrame - df中的两个出生日期列,确保列匹配,如下例所示

dob             birth_date
1987-06-24      1987-06-24

Both date columns have been converted from strings to dates using pd.to_datetime() , eg两个日期列都已使用pd.to_datetime()从字符串转换为日期,例如

df['birth_date'] = pd.to_datetime(df['birth_date'])

My question我的问题
My query, I have another column called 'value'.我的查询,我有另一列名为“价值”。 I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged.我想更新我的 pandas DataFrame 以便如果两个日期列匹配,则条目不变。 However, if the two date columns don't match, I want the data in this value column to be changed to null.但是,如果两个日期列不匹配,我希望将此值列中的数据更改为 null。 This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.这是我可以在 Excel 中通过 date_diff 计算轻松完成的事情,但我不确定在 pandas 中。

My current code is the following:我当前的代码如下:

df.loc[(df['birth_date'],= df['dob']).'value'] = np.nan

Reason for this step (feel free to skip)此步骤的原因(请随意跳过)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.这段代码的原因是它会快速向我显示不准确的模糊匹配(大约占整个数据库的 10%),并允许我快速修复这些匹配。

Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete.理想情况下,我还需要研究匹配算法以确保完美的日期匹配,但是,我当前的算法目前在当前的 state 中运行良好,并且项目几乎完成。 Any advice on this however I'd be happy to hear, if this is something you know about对此有任何建议,但我很乐意听到,如果这是你所知道的

Many thanks in advance!提前谢谢了!

IICU: Please Try np.where . IICU:请尝试np.where Works as follows;工作如下;

np.where(if condition, assign x, else assign y)

if condition =df.loc[(df['birth_date'],= df['dob'], x =np.nan and y = prevailing df.value if condition =df.loc[(df['birth_date'],= df['dob'], x =np.nan and y = 主要的 df.value

df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较 pandas 中的两列以更新 dataframe 中的日期列 - Compare two columns in pandas to update date column in a dataframe 比较两列中的值并提取 dataframe 中第三列的值 - Compare the values in two columns and extract the values of a third column in a dataframe 如何比较pandas中的两列来制作第三列? - how to compare two columns in pandas to make a third column ? 熊猫:将相同数据框的两列相乘,取决于第三列 - Pandas: Multiplying two columns of same dataframe thats dependent on third column 熊猫数据框:按两列分组,然后对第三列取平均值 - Pandas dataframe: Group by two columns and then average the third column 仅比较 Pandas DataFrame 中两列数据时间对象之间的日期 - Compare only Date between two columns of datatime objects in Pandas DataFrame 将 pandas dataframe 中的两个日期与当前日期进行比较并创建新列? - Compare two dates in pandas dataframe with current date and create new column? 比较两列以在 python 中创建第三列 - Compare two columns to create third column in python 如何遍历熊猫数据框并基于第三列比较某些列? - How to iterate over a pandas dataframe and compare certain columns based on a third column? 比较两个熊猫数据框列的元素,并基于第三列创建一个新列 - Compare elements of two pandas data frame columns and create a new column based on a third column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM