简体   繁体   中英

How to create a pandas dataframe of unique values fetched from column with no duplicates

I have a pandas dataframe df :

import pandas as pd

df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
              "type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
              "F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})

which looks like:

      F_ID  ID type
0        0   2    A
1  [7 8 9]   3    B
2     [10]   4    B
3        0   5    A
4      [2]   6    A
5        0   7    B
6        0   8    A
7        0   9    A
8        0  10    A

Here, F_ID is column which tells which records match with that articular records based on certain calculation. It gives the matching ID value. So ID 3 is matching with ID 7 and 8.

I wanted a list of all B type ID's and their associated records. with the matching ID mentioned in column F_ID in separate column, the no of such column can vary according to the values, like shown below:

ID  type F_ID_1  F_ID_2 
3    B    8      9
4    B    10      
7    B

I don't require the values of those F_ID mentioned which are of type B. For example ID 3 has 7, 8, 9 as matching IDs, but as the 7th ID is of type B that should not be mentioned as an F_ID and only 8 and 9 must be listed.

How can I do this with pandas in python ?

If I understand your intent, F_ID is a string representation of a list?

If so, lets convert it to actual lists:

import numpy as np
import pandas as pd

df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
      "type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
      "F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})

# convert the string representations of list structures to actual lists
F_ID_as_series_of_lists = df["F_ID"].str.replace("[","").str.replace("]","").str.split(" ")

#type(F_ID_as_series_of_lists) is pd.Series, make it a list for pd.DataFrame.from_records
F_ID_as_records = list(F_ID_as_series_of_lists)

f_id_df = pd.DataFrame.from_records(list(F_ID_as_records)).fillna(np.nan)
f_id_df

F_ID作为列表

Now let's join the split F_ID s to the original DataFrame

combined_df = df.merge(f_id_df, left_index = True, right_index = True, how = "inner")
combined_df = combined_df.drop("F_ID", axis = 1).sort_values(["type", "ID"])
combined_df

已加入F_Id

However we need to omit the F_ID s that appear as in ID within the same type , ie As 7 is an ID in type == "B" we want to exclude it where ID == 3 and type == "B" , even though it is in the list of F_ID s.

To achieve this let's create mapping of ID / type to F_ID .

mapping_df = pd.DataFrame(combined_df.set_index(["ID", "type"]).stack()).reset_index().drop("level_2", axis = 1)
mapping_df.columns = ["ID", "type", "F_ID"]
mapping_df

映射数据框

Now to do the filtering, we could probably do some impressive joining, but a query for this example is easier to read should we have to come back to this:

def is_fid_of_same_type(row, df):
    query = "ID == {row_fid} & type == '{row_type}'".format(
        row_fid = row["F_ID"],
        row_type = row["type"]
    )

    matches_df = df.query(query)

    row["fid_in_type_id"] = len(matches_df) > 0
    return row

Now apply this function to each row, and drop the rows that do appear as a F_ID as an ID within the same type .

df = mapping_df.apply(lambda row: is_fid_of_same_type(row, mapping_df), axis = 1)
df = df[df["fid_in_type_id"] == False].drop("fid_in_type_id", axis = 1)
df

排除了我们要排除的行

Then to have the F_ID s as a list rather than individual rows, use DataFrame.groupby() and then apply(list) .

group_columns = ['type', 'ID']
df = df.groupby(group_columns)['F_ID'].apply(list).reset_index()
df = df.sort_values(group_columns).set_index(group_columns)
df

Which results in:

最后结果

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM