I have a pandas data frame with payments of the following structure:
>> print(df)
id time amount seller buyer
-------------------------------------------------
1 07:01 16.00 Jack Rose
2 07:03 14.00 Alice Bob
3 07:05 95.00 Jim Larry
... ... ... ... ...
9999 18:16 81.00 Rose Alice
How do I find the "closed-members" payments network from this?
For example, if I would like to find a subset of the data which contains only payments that {Rose, Alice, Jim} made strictly between each other, then the below may work:
members = ['Rose', 'Alice', 'Jim']
df_subset = df[df.seller.isin(members) & df.buyer.isin(members)]
But how does one retrieve the largest such network?, ie not just for 3 people but for the maximum possible number of people in the data frame?
I already tried variations of the below:
df_subset = df[df.seller.isin(df.buyer.unique())]
df_subset = df_subset[df_subset.buyer.isin(df_subset.seller.unique())]
This is not successful, however, since afterwards df_subset.seller.unique()
and df_subset.buyer.unique()
are not the same.
Any help would be appreciated.
I believe in the end df_subset.seller.unique()
and df_subset.buyer.unique()
should be the same.
This is what you looking for in the maximum number of people
a = df[df.seller].drop_duplicates()
b = df[df.buyer].drop_duplicates()
result = pd.concat([a,b])
IIUC, the following should do what you want:
common_users = set(df["buyer"]).intersection(df["seller"])
df_subset = df[df["buyer"].isin(common_users) & df["seller"].isin(common_users)]
The following solution seems to work. I will provide a sandbox solution as it might become useful for others.
First, let's define a similar pandas data frame as in the question:
# generates strings to be used as names, e.g.: 'hlddldxhys'
def randomString(stringLength=10):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(stringLength))
# let's generate a set of 600 names
participants = [];
for k in range(600):
participants.append(randomString())
# from the generated set, draw 1000 sellers and buyers
seller = np.random.choice(participants, 1000)
buyer = np.random.choice(participants, 1000)
# construct pandas data frame
df = pd.DataFrame([seller, buyer]).T
df.columns = ['seller', 'buyer']
Taking a look at resulting data frame print(df)
:
seller buyer
----------------------------
0 bpzroghaxp evvhhlbiys
1 qsopxbirgn lwwljadfwg
2 cnllyrzjiz opbvoodpgw
3 hkzafylzst slfqtwdeak
... ... ...
999 natqsscnlk ftvjvgtala
While some have hinted towards a solution (replies from PMende, Tal Avissar, and myself), it seems that it does work - but only iteratively , where with each iteration of df = df[df.seller.isin(df.buyer.unique()) & df.buyer.isin(df.seller.unique())]
the sets of df.seller.unique()
and df.buyer.unique()
become more similar to each other. This is repeated until they are both the same (see last if-statement, followed by break
):
while(True):
df = df[df.seller.isin(df.buyer.unique()) & df.buyer.isin(df.seller.unique())]
if len(df.seller.unique()) == len(df.buyer.unique()):
if (np.sort(df.seller.unique()) == np.sort(df.buyer.unique())).all() == True:
break
A final check confirms, both df.seller.unique()
and df.buyer.unique()
are of same length and also of same composition:
>> len(df.seller.unique()), len(df.buyer.unique())
(281, 281)
>> (np.sort(df.seller.unique()) == np.sort(df.buyer.unique())).all()
True
Below charts visualise how the sets of df.seller.unique()
and df.buyer.unique()
become similar to each other with each iteration of the loop:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.