简体   繁体   中英

Pandas dataframe slicing and manipulation

I have dataframe df1 as follows

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | a        |   1 |
|      | a        |   2 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | c        |   4 |
|      | c        |   4 |
|      | b        |   5 |
|      | b        |   6 |
|      | d        |   7 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
+------+----------+-----+

and df2 below is sliced from that.

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   5 |
|      | b        |   6 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
|      | b        |   9 |
+------+----------+-----+

The goal is to find the time difference between the Key changes in df2 (like from the last 3 to 5, 5 to 6, 6 to the first 8, last 8 to first 9 and so on), add them up, repeat this for every Location item and average them.

Can this process be vectorized or we need to slice the dataframe for every machine and manually compute the average?

[EDIT]:

Traceback (most recent call last):

  File "<ipython-input-1142-b85a122735aa>", line 1, in <module>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 930, in apply
    return self._python_apply_general(f)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 936, in _python_apply_general
    self.axis)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2273, in apply
    res = f(group)

  File "<ipython-input-1142-b85a122735aa>", line 1, in <lambda>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1995, in diff
    result = algorithms.diff(com._values_from_object(self), periods)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 1823, in diff
    out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

You can try do with

g=df.groupby(['Location','Key'])
(g.first()-g.last().groupby('Location').shift()).mean(level=0)

s = df.groupby('Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Date'].diff().mean())

Is this what you mean? It averages the time delta of date when the key value changes per location. If you meant the average of change of 'Key' just change 'Date' to 'Key'.

You can try:

# obviously we will group by Location
groups = df1.groupby('Location')

# we record the changes and mark the unchanged with nan
df1['changes'] = groups.Key.diff().replace({0:np.nan})

# average the changes by location
# ignore all the nan's (unchanges)
groups.changes.mean()

Output:

Location
a    1.0
b    1.5
c    NaN
d    NaN
Name: changes, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM