简体   繁体   中英

Group based on same items in list in dataframe (python)

I am grouping travellers who travel together, based on the % of trips that they take together. This seems similar to another question ( Group Python list of lists into groups based on overlapping items ) but the conditions are different.

The travellers are only grouped together if they travelled 80% of the trips together . It is ok if the same traveller are in different groups.

Data: (the actual dataset is big and has >1000 trips and travellers)

Traveller  Trips
   A       [Trip_1, Trip_2, Trip_3, Trip_4, Trip_5]
   B       [Trip_1, Trip_2, Trip_3, Trip_4]
   C       [Trip_6, Trip_7]
   D       [Trip_8]
   E       [Trip_2, Trip_3, Trip_4, Trip_5]
   F       [Trip_2, Trip_3, Trip_4, Trip_5]
   G       [Trip_8]

Intended output:

TravelGroup  Traveller
  Group_1       A
  Group_1       B
  Group_2       A
  Group_2       E
  Group_2       F
  Group_3       C
  Group_4       D
  Group_4       G

Note that A and B are in a group; A, E and F are in a group. However, B and C are not in a group because they only have 75% match in the trips taken.

Really appreciate any help here, thank you very much!

df = pd.DataFrame({'Traveller':[*'ABCDE'], 'Trips': [
    ['Trip_1', 'Trip_2', 'Trip_3', 'Trip_4', 'Trip_5'],
    ['Trip_1', 'Trip_2', 'Trip_3', 'Trip_4'],
    ['Trip_1', 'Trip_2'],
    ['Trip_1'],
    ['Trip_2', 'Trip_3', 'Trip_4', 'Trip_5']
    ] })

from itertools import combinations

all_trips = df.explode('Trips')['Trips'].nunique()
all_travelers = set(df.Traveller)

groups, cnt = {'TravelGroup':[], 'Traveller':[]}, 1
for t1, t2 in combinations(df.Traveller, 2):
    s1 = df.loc[df.Traveller==t1, 'Trips'].iloc[0]
    s2 = df.loc[df.Traveller==t2, 'Trips'].iloc[0]
    if len(set(s1).intersection(s2)) / all_trips >= 0.8:
        group_name = 'Group_{}'.format(cnt)
        groups['TravelGroup'].extend([group_name, group_name])
        groups['Traveller'].extend([t1, t2])
        cnt += 1

df = pd.DataFrame(groups)
for t in all_travelers.difference(df.Traveller):
    group_name = 'Group_{}'.format(cnt)
    df.loc[df.shape[0]] = [group_name, t]
    cnt += 1

print(df)

Prints:

  TravelGroup Traveller
0     Group_1         A
1     Group_1         B
2     Group_2         A
3     Group_2         E
4     Group_3         D
5     Group_4         C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM