简体   繁体   中英

How to check if string elements of lists are in dataframe/other list (python)

I got the following problem. I do have 2 lists/dataframes. One is a list/dataframe of customers, where every row is a customer, the columns are synonyms for these customers, other verbal expressions.

customer_list = {'A': ['AA', 'AA', 'AAA'], 'B': ['B', 'BB','BBB'], 'C': ['C','CC','CCC']}
customer_df = pd.DataFrame.from_dict(customer_list, orient='index')

Than I have another dataframe with the following structure:

text = [['A', 'Hello i am AA', 'Hello i am BB', 'Hello i am A'], ['B', 'Hello i am B', 'Hello i am BBB','Hello i am BB'], ['C', 'Hello i am AAA','Hello i am CC','Hello i am CCC']]
text_df = pd.DataFrame(text)
text_df = text_df.set_index(0)
text_df = text_df.rename_axis("customer")

How (which types, which functions) can I check every row (eg every element of row "A") of the text_df for "wrong entries", which means for all the elements/synonyms of other customers (so check for every entry besides the own). Do I have to create multiple dataframes in a for loop? Is one loop enough?

Thanks for any advice, even just a hint concerning methods. For my example, a result like

Wrong texts: A: Hello i am BB, C: Hello i am AAA or some according indices would be great.

数据1 数据2

First, I would pd.melt to transform this DataFrame into an "index" of (customer, column, value) triples, like so:

df = pd.melt(text_df.reset_index(), id_vars="customer", var_name="columns")

Now, we have a way of "efficiently" operating over the entire data without needing to figure out the "right" columns and the like. So let's solve the "correctness" problem.

def correctness(melted_row: pd.Series, customer_df: pd.DataFrame) -> bool:
    customer = customer_df.loc[melted_row.customer]
    cust_ids = customer.values.tolist()
    return any([melted_row.value.endswith(cust_id) for cust_id in cust_ids])

Note: You could swap out .endswith with a variety of str functions to match your needs. Take a look at the docs, here .

Lastly, you can generate a mask by using the apply method across rows, like so:

df["correct"] = df.apply(correctness, axis=1, args=(customer_df, ))

You'll then have an output that looks like this:

  customer columns           value  correct
0        A       1   Hello i am AA     True
1        B       1    Hello i am B     True
2        C       1  Hello i am AAA    False
3        A       2   Hello i am BB    False
4        B       2  Hello i am BBB     True
5        C       2   Hello i am CC     True
6        A       3    Hello i am A    False
7        B       3   Hello i am BB     True
8        C       3  Hello i am CCC     True

I imagine you have other things you want to do before "un-melting" you data, so I'll point you to this SO question on how to "un-melt" your data .


By "efficient", I really mean that you have a way of leveraging built-in functions of pandas , not that it's "computationally efficient". My memory is foggy on this, but using .apply(...) is generally something to do as a last-resort. I imagine there are multiple ways to crack this problem that use built-ins, but I find this solution to be the most readable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM