[英]How to split a dataframe each time a string value changes in a column?
I've got a dataframe of the form:我有一个形式的数据框:
time value label
0 2020-01-01 -0.556014 high
1 2020-01-02 0.185451 high
2 2020-01-03 -0.401111 medium
3 2020-01-04 0.436111 medium
4 2020-01-05 0.412933 high
5 2020-01-06 0.636421 high
6 2020-01-07 1.168237 high
7 2020-01-08 1.205073 high
8 2020-01-09 0.798674 high
9 2020-01-10 0.174116 high
And I'd like to populate a list of dataframes where each dataframe is built when the string in the column label
changes.我想填充一个数据框列表,其中每个数据框都是在列
label
的字符串更改时构建的。 So the first dataframe would be:所以第一个数据帧将是:
time value label
0 2020-01-01 -0.556014 high
1 2020-01-02 0.185451 high
The second dataframe would be:第二个数据帧将是:
time value label
2 2020-01-03 -0.401111 medium
3 2020-01-04 0.436111 medium
And so on.等等。 And the desired list would be
[df, df, ...]
.所需的列表将是
[df, df, ...]
。 If you think that a dict would be a more appropriate container I wouldn't mind that at all.如果你认为 dict 是一个更合适的容器,我一点也不介意。
There's a similar post named split data frame pandas if sequence of column value change , but that only handles changes in numeric values.有一个类似的帖子名为split data frame pandas if sequence of column value change ,但它只处理数值的变化。 I've made a few attempts but keep running into indexing problems when comparing a row value for
label
with the previous value.我已经做了一些尝试,但在将
label
的行值与前一个值进行比较时一直遇到索引问题。 So any suggestions would be great!所以任何建议都会很棒!
Here's a reproducible snippet:这是一个可重现的片段:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# settings
observations = 100
np.random.seed(5)
value = np.random.uniform(low=-1, high=1, size=observations).tolist()
time = [t for t in pd.date_range('2020', freq='D', periods=observations).format()]
df=pd.DataFrame({'time': time,
'value':value})
df['value']=df['value'].cumsum()
def classify(e):
if e > 0.75: return 'high'
if e > 0.25: return 'medium'
if e >= 0: return 'low'
df['label1'] = [(elem-df['value'].min())/(df['value'].max()-df['value'].min()) for elem in df['value']]
df['label'] = [classify(elem) for elem in df['label1']]
df = df.drop('label1', 1)
df
I would create a column that increments on each change, then group by that column.我会创建一个在每次更改时递增的列,然后按该列分组。 If you need separate dataframes you can assign them in a loop.
如果您需要单独的数据帧,您可以在循环中分配它们。
df['group'] = df['label'].ne(df['label'].shift()).cumsum()
df = df.groupby('group')
dfs = []
for name, data in df:
dfs.append(data)
dfs will be a list of dataframes like so: dfs 将是一个数据框列表,如下所示:
[ time value label group
0 2020-01-01 -0.556014 high 1
1 2020-01-02 0.185451 high 1,
time value label group
2 2020-01-03 -0.401111 medium 2
3 2020-01-04 0.436111 medium 2,
time value label group
4 2020-01-05 0.412933 high 3
5 2020-01-06 0.636421 high 3
6 2020-01-07 1.168237 high 3
7 2020-01-08 1.205073 high 3
8 2020-01-09 0.798674 high 3
9 2020-01-10 0.174116 high 3]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.