简体   繁体   中英

How to quickly create edge lists (itertools combinations style) from a boolean indexed pandas dataframe (or other fast solution?)

I'm attempting to create an edgelist (a unique set of a;b, a;c, a;f, etc, where a;b == b;a) from a very large (long) pandas dataframe which has two columns. The edge lists required are between all combinations of rows of one column conditional on the other column having the same value. An example below shows this:

df1 = pd.DataFrame({'A':['Mary', 'Mary', 'Mary', 'Clive','Clive','Clive', 'John', 'John'],
                   'B':['Apples','Oranges','Strawberries','Apples','Pears','Bananas','Bananas','Pears']})

And this dataframe looks like:

    A   B
0   Mary    Apples
1   Mary    Oranges
2   Mary    Strawberries
3   Clive   Apples
4   Clive   Pears
5   Clive   Bananas
6   John    Bananas
7   John    Pears

with the intended output looking like this:

Apples; Oranges
Apples; Strawberries
Oranges; Strawberries
Apples; Pears
Apples; Bananas
Pears; Bananas

My current solution is extremely slow, and loops over unique values of A (with some pre-filtering to ensure the count of A is >1 (otherwise no pairwise edge)), taking boolean indexes of the dataframe:

for person in df1['A'].unique():
    temp = df1[df1['A']==person]
    ...
    perform some combination\itertools on df1['B']

However, because my df1 in reality is extremely big, this is taking an inordinate amount of time: is there some trick here using lambdas and stacking that I am missing? Really appreciate any help!

How about this?

In [10]: df1.groupby('A')['B'].apply(lambda x : list(itertools.combinations(x,2)))  
Out[10]:
A
Clive    [(Apples, Pears), (Apples, Bananas), (Pears, B...
John                                    [(Bananas, Pears)]
Mary     [(Apples, Oranges), (Apples, Strawberries), (O...
Name: B, dtype: object

This is superb, really great! Many thanks! it doesnt treat a;b and b;a (ie, specifically, the tuple of (Bananas,Pears) and (Pears, Bananas)) as the same, so for the future, heres an (inefficient) expansion to unpack the edges into a set:

df2 = pd.DataFrame(df1.groupby('A')['B'].apply(lambda x: list(itertools.combinations(x,2))))
set_of_edges = set()
for toople in df2['B'].tolist():
    for pair in toople:
        if (pair[0] + ';' + pair[1] not in set_of_edges) and\
           (pair[1] + ';' + pair[0] not in set_of_edges):
             set_of_edges.add(pair[0] + ';' + pair[1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM