简体   繁体   English

检查一个 dataframe 列是否是另一列的子集

[英]Checking if one dataframe column is a subset of another column

I have a dataframe with columns Enrolled_Months and Eligible_Months , described as follows:我有一个 dataframe 列Enrolled_MonthsEligible_Months ,描述如下:

month_list1 = [
    [(1, 2018), (2, 2018), (3, 2019)],
    [(7, 2018), (8, 2018), (10, 2018)],
    [(4, 2018), (5, 2018), (7, 2018)],
    [(1, 2019), (2, 2019), (4, 2019)]
]

month_list2 = [
    [(2, 2018), (3, 2019)],
    [(7, 2018), (8, 2018)],
    [(2, 2018), (3, 2019)],
    [(10, 2018), (11, 2019)]
]

EID = [1, 2, 3, 4]

df = pd.DataFrame({
    'EID': EID,
    'Enrolled_Months': month_list1,
    'Eligible_Months': month_list2
})
df

Out[6]: 
   EID                     Enrolled_Months           Eligible_Months
0    1   [(1, 2018), (2, 2018), (3, 2019)]    [(2, 2018), (3, 2019)]
1    2  [(7, 2018), (8, 2018), (10, 2018)]    [(7, 2018), (8, 2018)]
2    3   [(4, 2018), (5, 2018), (7, 2018)]    [(2, 2018), (3, 2019)]
3    4   [(1, 2019), (2, 2019), (4, 2019)]  [(10, 2018), (11, 2019)]

I want to create a new column called Check that is true if Enrolled_Months contains ALL elements of Eligible_Months .我想创建一个名为Check的新列,如果Enrolled_Months包含Eligible_Months的所有元素,则该列为真。 My desired output is below:我想要的 output 如下:

Out[8]: 
   EID                     Enrolled_Months           Eligible_Months  Check
0    1   [(1, 2018), (2, 2018), (3, 2019)]    [(2, 2018), (3, 2019)]   True
1    2  [(7, 2018), (8, 2018), (10, 2018)]    [(7, 2018), (8, 2018)]   True
2    3   [(4, 2018), (5, 2018), (7, 2018)]    [(2, 2018), (3, 2019)]  False
3    4   [(1, 2019), (2, 2019), (4, 2019)]  [(10, 2018), (11, 2019)]  False

I've tried the following:我试过以下方法:

df['Check'] = set(df['Eligible_Months']).issubset(df['Enrolled_Months'])

But end up getting the error TypeError: unhashable type: 'list' .但最终得到错误TypeError: unhashable type: 'list'

Any thoughts on how I can achieve this?关于如何实现这一目标的任何想法?

Side note: the Enrolled_Months data was originally in a much different format, with each month having its own binary column, and a separate Year column specifying the year (really bad design imo).旁注: Enrolled_Months数据最初采用非常不同的格式,每个月都有自己的二进制列,以及一个单独的Year列指定年份(imo 设计非常糟糕)。 I created the list columns as I thought it would be easier to work with, but let me know if that original format is better for what I want to achieve.我创建了列表列,因为我认为它更容易使用,但如果原始格式更适合我想要实现的目标,请告诉我。

You can use a few explodes and then eval and any :您可以使用一些explodes然后evalany

df['Check'] = df.explode('Eligible_Months').explode('Enrolled_Months').eval('Enrolled_Months == Eligible_Months').groupby(level=0).any()

Output: Output:

>>> df
   EID                     Enrolled_Months           Eligible_Months  Check
0    1   [(1, 2018), (2, 2018), (3, 2019)]    [(2, 2018), (3, 2019)]   True
1    2  [(7, 2018), (8, 2018), (10, 2018)]    [(7, 2018), (8, 2018)]   True
2    3   [(4, 2018), (5, 2018), (7, 2018)]    [(2, 2018), (3, 2019)]  False
3    4   [(1, 2019), (2, 2019), (4, 2019)]  [(10, 2018), (11, 2019)]  False

You can use df.apply() to create the new column:您可以使用df.apply()创建新列:

df['Check'] = df.apply(
    lambda row: set(row['Eligible_Months']).issubset(row['Enrolled_Months']), axis=1
)

This outputs:这输出:

   EID                     Enrolled_Months           Eligible_Months  Check
0    1   [(1, 2018), (2, 2018), (3, 2019)]    [(2, 2018), (3, 2019)]   True
1    2  [(7, 2018), (8, 2018), (10, 2018)]    [(7, 2018), (8, 2018)]   True
2    3   [(4, 2018), (5, 2018), (7, 2018)]    [(2, 2018), (3, 2019)]  False
3    4   [(1, 2019), (2, 2019), (4, 2019)]  [(10, 2018), (11, 2019)]  False

A list comprehension works fine:列表理解工作正常:

df.assign(check = [set(l).issuperset(r) 
                   for l, r in 
                   zip(df.Enrolled_Months, df.Eligible_Months)])

   EID                     Enrolled_Months           Eligible_Months  check
0    1   [(1, 2018), (2, 2018), (3, 2019)]    [(2, 2018), (3, 2019)]   True
1    2  [(7, 2018), (8, 2018), (10, 2018)]    [(7, 2018), (8, 2018)]   True
2    3   [(4, 2018), (5, 2018), (7, 2018)]    [(2, 2018), (3, 2019)]  False
3    4   [(1, 2019), (2, 2019), (4, 2019)]  [(10, 2018), (11, 2019)]  False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM