简体   繁体   English

无法匹配两个不同 csvs 中的两个值

[英]Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.我正在解析两个单独的 csv 文件,目的是找到匹配的客户 ID 和日期来操纵余额。

In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv.在我的 for 循环中,在某些时候应该有一个匹配,因为我故意在我的 csv 中放置重复的 ID 和日期。 However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.但是,在解析和尝试匹配数据时,即使值相同,匹配也无法正常工作。

main.py:主要文件:

transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
    columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])

for index, row in transactions.iterrows():
    customer_id = row['customerID']
    date = formatter.convert_date(row['date'])

    minBalance = 0
    maxBalance = 0
    endingBalance = 0

    dict = {
        "customerID": customer_id,
        "MM/YYYY": date,
        "minBalance": minBalance,
        "maxBalance": maxBalance,
        "endingBalance": endingBalance
    }

    print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
    # Returns False

    if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
        # This section never runs
        print("hello")

    else:
        print("world")
        accounts.loc[index] = dict
        accounts.to_csv(OUTPUT_PATH, index=False)

Transactions CSV :交易 CSV

customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700

Accounts CSV帐户 CSV

customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0

Expected Accounts CSV预期帐户 CSV

customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

It is not clear from the information what does formatter.convert_date function does.从信息中不清楚formatter.convert_date函数的作用。 but from the example CSVs you added it seems like it should do something like:但从您添加的示例 CSV 来看,它似乎应该执行以下操作:

def convert_date(mmddyy):
  (mm,dd,yy) = mmddyy.split('/')
  return mm + '/' + yy

in addition, make sure that data types are also equal (both date fields are strings and also for customer id)此外,确保数据类型也相同(两个日期字段都是字符串,也用于客户 ID)

Where does the problem come from问题从何而来

Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do:您的问题来自您与熊猫系列进行的比较,为了简单起见,当您这样做时:

customer_id in accounts['customerID']

You're checking if customer_id is an index of the Series accounts['customerID'] , however, you want to check the value of the Series.您正在检查customer_id是否是系列accounts['customerID']的索引,但是,您想要检查系列的值。

And in your if statement, you're using the pd.Series.equals method.在您的 if 语句中,您使用的是pd.Series.equals方法。 Here is an explanation of what does the method do from the documentation这是文档中该方法的作用的解释

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.此函数允许将两个 Series 或 DataFrame 相互比较,以查看它们是否具有相同的形状和元素。 NaNs in the same location are considered equal.相同位置的 NaN 被认为是相等的。

So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.所以 equals 用于比较 DataFrames 和 Series,这与您尝试做的不同。

One of many solutions许多解决方案之一

There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison:有多种方法可以实现您想要做的事情,最简单的方法就是在进行比较之前从系列中获取值:

customer_id in accounts['customerID'].values

Note that accounts['customerID'].values returns a NumPy array of the values of your Series.请注意, accounts['customerID'].values返回一个包含系列值的 NumPy 数组。

So your comparison should be something like this:所以你的比较应该是这样的:

print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)

And use the same thing in your if statement:并在您的 if 语句中使用相同的东西:

if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):

Alternative solutions替代解决方案

You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.您还可以使用pandas.Series.isin函数,给定一个元素作为输入返回一个布尔系列,显示系列中的每个元素是否与给定输入匹配,然后您只需要检查布尔系列是否包含一个真值。

Documentation of isin: https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html isin 文档: https ://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM