Intersection / subset of pandas string columns

Question

I have a pandas data frame with payments of the following structure:

>> print(df)

id      time      amount      seller     buyer
-------------------------------------------------
1       07:01     16.00       Jack       Rose
2       07:03     14.00       Alice      Bob
3       07:05     95.00       Jim        Larry
...     ...       ...         ...        ...
9999    18:16     81.00       Rose       Alice

How do I find the "closed-members" payments network from this?

For example, if I would like to find a subset of the data which contains only payments that {Rose, Alice, Jim} made strictly between each other, then the below may work:

members = ['Rose', 'Alice', 'Jim']
df_subset = df[df.seller.isin(members) & df.buyer.isin(members)]

But how does one retrieve the largest such network?, ie not just for 3 people but for the maximum possible number of people in the data frame?

I already tried variations of the below:

df_subset = df[df.seller.isin(df.buyer.unique())]
df_subset = df_subset[df_subset.buyer.isin(df_subset.seller.unique())]

This is not successful, however, since afterwards df_subset.seller.unique() and df_subset.buyer.unique() are not the same.

Any help would be appreciated.

I believe in the end df_subset.seller.unique() and df_subset.buyer.unique() should be the same.

Answer 1

This is what you looking for in the maximum number of people

a = df[df.seller].drop_duplicates()
b = df[df.buyer].drop_duplicates()
result = pd.concat([a,b])

Answer 2

IIUC, the following should do what you want:

common_users = set(df["buyer"]).intersection(df["seller"])
df_subset = df[df["buyer"].isin(common_users) & df["seller"].isin(common_users)]

Answer 3

The following solution seems to work. I will provide a sandbox solution as it might become useful for others.

First, let's define a similar pandas data frame as in the question:

# generates strings to be used as names, e.g.: 'hlddldxhys'
def randomString(stringLength=10):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))

# let's generate a set of 600 names
participants = [];
for k in range(600):
    participants.append(randomString())

# from the generated set, draw 1000 sellers and buyers
seller = np.random.choice(participants, 1000)
buyer = np.random.choice(participants, 1000)

# construct pandas data frame
df = pd.DataFrame([seller, buyer]).T
df.columns = ['seller', 'buyer']

Taking a look at resulting data frame print(df) :

     seller       buyer
----------------------------
0    bpzroghaxp  evvhhlbiys
1    qsopxbirgn  lwwljadfwg
2    cnllyrzjiz  opbvoodpgw
3    hkzafylzst  slfqtwdeak
...    ...        ...
999  natqsscnlk  ftvjvgtala

While some have hinted towards a solution (replies from PMende, Tal Avissar, and myself), it seems that it does work - but only iteratively , where with each iteration of df = df[df.seller.isin(df.buyer.unique()) & df.buyer.isin(df.seller.unique())] the sets of df.seller.unique() and df.buyer.unique() become more similar to each other. This is repeated until they are both the same (see last if-statement, followed by break ):

while(True):
    df = df[df.seller.isin(df.buyer.unique()) & df.buyer.isin(df.seller.unique())]
    if len(df.seller.unique()) == len(df.buyer.unique()):
        if (np.sort(df.seller.unique()) == np.sort(df.buyer.unique())).all() == True:
            break

A final check confirms, both df.seller.unique() and df.buyer.unique() are of same length and also of same composition:

>> len(df.seller.unique()), len(df.buyer.unique())
(281, 281)

>> (np.sort(df.seller.unique()) == np.sort(df.buyer.unique())).all()
True

Below charts visualise how the sets of df.seller.unique() and df.buyer.unique() become similar to each other with each iteration of the loop:

See also charts: visualisation of solution

Intersection / subset of pandas string columns

Question

3 answers

solution1
0 2019-06-10 20:20:55

solution2
0 2019-06-10 21:10:19

solution3
0 ACCPTED 2019-06-11 00:50:50

Intersection / subset of pandas string columns

Question

3 answers

solution1 0 2019-06-10 20:20:55

solution2 0 2019-06-10 21:10:19

solution3 0 ACCPTED 2019-06-11 00:50:50

solution1
0 2019-06-10 20:20:55

solution2
0 2019-06-10 21:10:19

solution3
0 ACCPTED 2019-06-11 00:50:50