Pandas dataframe slicing and manipulation

Question

I have dataframe df1 as follows

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | a        |   1 |
|      | a        |   2 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | c        |   4 |
|      | c        |   4 |
|      | b        |   5 |
|      | b        |   6 |
|      | d        |   7 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
+------+----------+-----+

and df2 below is sliced from that.

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   5 |
|      | b        |   6 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
|      | b        |   9 |
+------+----------+-----+

The goal is to find the time difference between the Key changes in df2 (like from the last 3 to 5, 5 to 6, 6 to the first 8, last 8 to first 9 and so on), add them up, repeat this for every Location item and average them.

Can this process be vectorized or we need to slice the dataframe for every machine and manually compute the average?

[EDIT]:

Traceback (most recent call last):

  File "<ipython-input-1142-b85a122735aa>", line 1, in <module>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 930, in apply
    return self._python_apply_general(f)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 936, in _python_apply_general
    self.axis)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2273, in apply
    res = f(group)

  File "<ipython-input-1142-b85a122735aa>", line 1, in <lambda>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1995, in diff
    result = algorithms.diff(com._values_from_object(self), periods)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 1823, in diff
    out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Answer 1

You can try do with

g=df.groupby(['Location','Key'])
(g.first()-g.last().groupby('Location').shift()).mean(level=0)

Answer 2

s = df.groupby('Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Date'].diff().mean())

Is this what you mean? It averages the time delta of date when the key value changes per location. If you meant the average of change of 'Key' just change 'Date' to 'Key'.

Answer 3

You can try:

# obviously we will group by Location
groups = df1.groupby('Location')

# we record the changes and mark the unchanged with nan
df1['changes'] = groups.Key.diff().replace({0:np.nan})

# average the changes by location
# ignore all the nan's (unchanges)
groups.changes.mean()

Output:

Location
a    1.0
b    1.5
c    NaN
d    NaN
Name: changes, dtype: float64

Pandas dataframe slicing and manipulation

Question

3 answers

solution1
0 2019-06-05 21:37:49

solution2
0 2019-06-05 21:57:10

solution3
0 2019-06-05 23:59:19

Pandas dataframe slicing and manipulation

Question

3 answers

solution1 0 2019-06-05 21:37:49

solution2 0 2019-06-05 21:57:10

solution3 0 2019-06-05 23:59:19

solution1
0 2019-06-05 21:37:49

solution2
0 2019-06-05 21:57:10

solution3
0 2019-06-05 23:59:19