简体   繁体   中英

Python Pandas: How to find patterns of combinations (Combinations of Combinations) - time series

Started from here: unique combinations of values in selected columns in pandas data frame and count

I have found the most to least occurring combinations of 3 columns with this code:

def common_cols(df,n):
    '''n is how many of the top results to show'''

    df = df.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})

    df = df.sort_values(by='count', ascending=False).reset_index(drop=True).head(n)

    return df

common_data = common_cols(df,10)

Output of common_data(10 top results shown):

      A     B       C      count
0    0.00  0.00    0.00     96
1    0.00  1.00    0.00     25
2    0.14  0.86    0.00     19
3    0.13  0.87    0.00     17
4    0.00  0.72    0.28     17
5    0.00  0.89    0.11     16
6    0.01  0.84    0.15     16
7    0.03  0.97    0.00     15
8    0.35  0.65    0.00     15
9    0.13  0.79    0.08     14 

Now, I would like to find combinations of the AB C rows, and count how many times they occurred.

For example lets say in the BASE df from row 1 to row 4:

the first set of combinations of the 3 columns (as told by dataframe(df) BEFORE using the common_cols function) are

# each of these rows are their own combination of values
       A    B     C
0    0.67  0.16  0.17
1    0.06  0.73  0.20
2    0.19  0.48  0.33
3    0.07  0.87  0.06
4    0.07  0.60  0.33

The above 5 rows (in order) would be counted as a pattern of combinations. It could be counted as a combination of 2 rows, 3 rows, 4 rows or more rows (if it is easy enough to do that is!)

If this pattern was found once(across the entire dataframe), it would output this pattern's count as 1. If it was found 10 times; the count would be 10.

Any ideas on how I can count the combinations found between consecutive rows? Like from using the common_cols function, but as a 'combinations of combinations'?

The rows have to be in order for it to be a pattern. Any help is massively appreciated!

I used integers for this test dataframe, but if your groupby is working above this should also work for your data:

df_size = 1000000
df = pd.DataFrame( { 'A' : (np.random.randint(20) for i in range(df_size)),
                     'B' : (np.random.randint(20) for i in range(df_size)),
                     'C' : (np.random.randint(20) for i in range(df_size)),
            })

print(df.head())
    A   B   C
0  12  12   5
1  19  12  12
2  14  11  15
3  11  14   8
4  13  16   2

The code below makes a list called source of the triplets (A, B, C) using zip . The tmp variable (a generator) is effectively a list that holds successively "shifted" copies of the source list, like [source[0:], source[1:], source[2:]...]

Finally, the zip interleaves the values from the lists in tmp , eg, for n=2 it would generate a list of [(source[0], source[1]), (source[1], source[2]), ... ]

source = list(zip(df['A'],df['B'],df['C']))
n_consecutive = 3

tmp = ( source[i:] for i in range(n_consecutive) )
output = pd.Series(list(zip(*tmp)))

For this example, this is a series containing the counts of the triplet (A, B, C) values:

print(output.value_counts().head())
((6, 19, 14), (19, 12, 6), (13, 7, 10))    2
((2, 18, 12), (17, 2, 19), (7, 19, 19))    1
((10, 2, 3), (1, 18, 8), (3, 6, 19))       1
((16, 15, 14), (11, 2, 9), (14, 14, 8))    1
((3, 3, 7), (13, 9, 3), (18, 15, 6))       1
dtype: int64

Note that this will possibly double-count things depending on what you are looking for. For example, if the base df has three records in a row, and you're looking for patterns of 2 consecutive:

(1, 3, 4)
(1, 3, 4)
(1, 3, 4)

In that case it will find (1, 3, 4), (1, 3, 4) twice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM