I have a pandas dataframe df
:
import pandas as pd
df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
"type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
"F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})
which looks like:
F_ID ID type
0 0 2 A
1 [7 8 9] 3 B
2 [10] 4 B
3 0 5 A
4 [2] 6 A
5 0 7 B
6 0 8 A
7 0 9 A
8 0 10 A
Here, F_ID is column which tells which records match with that articular records based on certain calculation. It gives the matching ID value. So ID 3 is matching with ID 7 and 8.
I wanted a list of all B
type ID's and their associated records. with the matching ID mentioned in column F_ID in separate column, the no of such column can vary according to the values, like shown below:
ID type F_ID_1 F_ID_2
3 B 8 9
4 B 10
7 B
I don't require the values of those F_ID mentioned which are of type B. For example ID 3 has 7, 8, 9 as matching IDs, but as the 7th ID is of type B that should not be mentioned as an F_ID and only 8 and 9 must be listed.
How can I do this with pandas in python ?
If I understand your intent, F_ID is a string representation of a list?
If so, lets convert it to actual lists:
import numpy as np
import pandas as pd
df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
"type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
"F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})
# convert the string representations of list structures to actual lists
F_ID_as_series_of_lists = df["F_ID"].str.replace("[","").str.replace("]","").str.split(" ")
#type(F_ID_as_series_of_lists) is pd.Series, make it a list for pd.DataFrame.from_records
F_ID_as_records = list(F_ID_as_series_of_lists)
f_id_df = pd.DataFrame.from_records(list(F_ID_as_records)).fillna(np.nan)
f_id_df
Now let's join the split F_ID
s to the original DataFrame
combined_df = df.merge(f_id_df, left_index = True, right_index = True, how = "inner")
combined_df = combined_df.drop("F_ID", axis = 1).sort_values(["type", "ID"])
combined_df
However we need to omit the F_ID
s that appear as in ID
within the same type
, ie As 7
is an ID
in type == "B"
we want to exclude it where ID == 3
and type == "B"
, even though it is in the list of F_ID
s.
To achieve this let's create mapping of ID
/ type
to F_ID
.
mapping_df = pd.DataFrame(combined_df.set_index(["ID", "type"]).stack()).reset_index().drop("level_2", axis = 1)
mapping_df.columns = ["ID", "type", "F_ID"]
mapping_df
Now to do the filtering, we could probably do some impressive joining, but a query for this example is easier to read should we have to come back to this:
def is_fid_of_same_type(row, df):
query = "ID == {row_fid} & type == '{row_type}'".format(
row_fid = row["F_ID"],
row_type = row["type"]
)
matches_df = df.query(query)
row["fid_in_type_id"] = len(matches_df) > 0
return row
Now apply this function to each row, and drop the rows that do appear as a F_ID
as an ID
within the same type
.
df = mapping_df.apply(lambda row: is_fid_of_same_type(row, mapping_df), axis = 1)
df = df[df["fid_in_type_id"] == False].drop("fid_in_type_id", axis = 1)
df
Then to have the F_ID
s as a list rather than individual rows, use DataFrame.groupby()
and then apply(list)
.
group_columns = ['type', 'ID']
df = df.groupby(group_columns)['F_ID'].apply(list).reset_index()
df = df.sort_values(group_columns).set_index(group_columns)
df
Which results in:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.