简体   繁体   中英

Finding an Intersection between two lists or dataframes while enforcing an ordering condition

I have two lists (columns from two separate pandas dataframes) and want to find the intersection of both lists while preserving the order, or ordering based on a condition. Consider the following example:

x = ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']
y = ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']

answer = set(x).intersection(y)

The variable answer yields the overlapping columns, yet the order is not preserved. Is there a way of sorting the solution such that the answer yields:

answer = ['3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
          '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR',
          '30 YR']

ie first sorting the intersected list by month ("MO") and integers, and then by year ("YR") and its integers?

Alternatively, is there a pandas method to obtain the same result with two dataframes of overlapping columns (preserving or stating order)?

You could simply with list comprehensions:

[this_name for this_name in x if this_name in y]

and

[this_name for this_name in y if this_name in x]

I don't know what you are trying to do exactly, but my answer will be for the use case you described. If you want to work with pandas, I think the following code will do what you want. If you have more complex data, I think you might need to change the columns types to timedelta to have more flexibility. The sorting is working in this case because MO is alphabetically before YR .

import pandas as pd
df1 = pd.DataFrame({'x': ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']})
df2 = pd.DataFrame({'y': ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']})

# drop 'non-standard' data 
df1["x"] = df1["x"].apply(lambda x: x if x[0].isdigit() else None)
df2["y"] = df2["y"].apply(lambda x: x if x[0].isdigit() else None)
df1.dropna(inplace=True)
df2.dropna(inplace=True)

# make two columns to sort 
df1["value"] = df1["x"].apply(lambda x: int(x[:-2]))
df1["unit"] = df1["x"].apply(lambda x: x[-2:])

df2["value"] = df2["y"].apply(lambda x: int(x[:-2]))
df2["unit"] = df2["y"].apply(lambda x: x[-2:])

# sort by unit and value
df1 = df1.sort_values(by=["unit", "value"]).drop("x", axis=1)
df2 = df2.sort_values(by=["unit", "value"]).drop("y", axis=1)

# merge 
df = pd.merge(df1, df2, on=["unit", "value"])
df["result"] = df.apply(lambda x: str(x["value"]) + " " + x["unit"], axis=1)
df.drop(["unit", "value"], axis=1, inplace=True)
df



Use list comprehension to to check if items in x also exist in the set of y . This preserves the order each item appears in x while checking only for membership in y :

y_set = set(y)
answer = [item for item in x if item in y_set]

or use filter to do essentially the same job:

answer = list(filter(lambda i: i in y_set, x))

Output:

['1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR', '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM