简体   繁体   中英

How to find the common pair of data in python from given data

I have a data looks something like this

Start Time         End Time       Trip Duration    Start Station   End Station 
01/01/17 15:09    01/01/17 15:14     321           A               B
01/02/17 15:09    01/02/17 15:14     321           C               D
12/03/17 15:09    12/03/17 15:14     321           E               F
05/01/17 15:09    05/01/17 15:14     321           B               D
17/02/17 15:09    17/02/17 15:14     321           A               B
12/04/17 15:09    12/04/17 15:14     321           E               H
13/05/17 15:09    13/05/17 15:14     321           S               K
17/01/17 15:09    17/01/17 15:14     321           A               B

Using the following code, I am able to find the most common start station

start_station = filtered['Start Station'].mode()[0]

I need to find the most common trip, ie where a pair of start station and end station are same. According to the above data, the most common trip should be b/w A and B

Can anyone please tell me how to find a common trip

Use GroupBy.size with nlargest or sort_values with iloc for select last value.

Function remove_unused_levels is used for remove MultiIndex values by removed values of Series .

a = (df.groupby(['Start Station','End Station'])
       .size()
       .nlargest(1)
       .index.remove_unused_levels()
       .tolist()
     )

Or:

a = (df.groupby(['Start Station','End Station'])
       .size()
       .sort_values()
       .iloc[[-1]]
       .index.remove_unused_levels()
       .tolist()
       )

print(a)
[('A', 'B')]

If want output DataFrame :

df1 = (df.groupby(['Start Station','End Station'])
       .size()
       .reset_index(name='count')
       .nlargest(1, 'count')[['Start Station','End Station']]
)
print (df1)
  Start Station End Station
0             A           B

You need count? Then try this:

df = pd.DataFrame({'Start':['A','B','C','D','A'],'End':['B']*5,'Trip Duration':[321]*5})
df.groupby(['Start','End'])['Trip Duration'].count().sort_values(ascending=False, na_position='first')

I might do this

trip = (filtered["Start Station"] + " -> " + filtered["End Station"]).mode()
# A -> B

Have a look at this Groupby Split apply combine

This should give you a wide range of aggregation functions.

using groupby:

import pandas as pd

counts = df.groupby(["Start_Station","End_Station"]).count()

print(counts)

                           Start_Time  End_Time  Trip_Duration  trip_id
Start_Station End_Station                                              
A             B                     3         3              3        3
B             D                     1         1              1        1
C             D                     1         1              1        1
E             F                     1         1              1        1
              H                     1         1              1        1
S             K                     1         1              1        1

using value_counts and a dummy column:

import pandas as pd

df["trip_id"] = df.Start_Station + df.End_Station

counts = df["trip_id"].value_counts()

print(counts)

AB    3
BD    1
EH    1
SK    1
EF    1
CD    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM