[英]Creating ranges from sequential values, while maintaining other columns in pandas
I'm trying to find a way to consolidate sequential (consecutive?) numbers into a range, grouped by another column.我试图找到一种方法将顺序(连续?)数字合并到一个范围内,由另一列分组。
I've tried pynumparser and itertools , but I'm not clever enough to implement them to get the results I'm looking for.我已经尝试过pynumparser和itertools ,但我不够聪明,无法实现它们以获得我正在寻找的结果。 Looking for some assistance and/or ideas.寻找一些帮助和/或想法。 Thank you!谢谢!
| test_var | F1 |
|------------|------|
| ABC | 1 |
| ABC | 2 |
| DEF | 3 |
| ABC | 4 |
| ABC | 5 |
| GHI | 1 |
| GHI | 2 |
| ABC | 6 |
F1_range is supposed to represent the min and max of sequential values per test_var. F1_range 应该代表每个 test_var 的顺序值的最小值和最大值。 Which there may be several sets.其中可能有几套。
A simple example is "GHI".一个简单的例子是“GHI”。 For F1 there is only 1 set of sequential values, 1-2.对于 F1,只有一组顺序值,1-2。
A more complicated example is "ABC", it has 2 sets of sequential values 1-2 and 4-6.一个更复杂的例子是“ABC”,它有 2 组顺序值 1-2 和 4-6。
| test_var | F1 | F1_range |
|------------|------|------------|
| ABC | 1 | 1-2 |
| ABC | 2 | 1-2 |
| DEF | 3 | 3 |
| ABC | 4 | 4-6 |
| ABC | 5 | 4-6 |
| GHI | 1 | 1-2 |
| GHI | 2 | 1-2 |
| ABC | 6 | 4-6 |
df = pd.DataFrame(data={'test_var': {0: 'ABC',
1: 'ABC',
2: 'DEF',
3: 'ABC',
4: 'ABC',
5: 'GHI',
6: 'GHI',
7: 'ABC'},
'F1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 6}})
df = pd.DataFrame({
'test_var': ['ABC', 'ABC', 'DEF', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI'],
'F1': [1, 2, 3, 4, 6, 5, 1, 2],
'F2': [10, 11, 1, 13, 16, 14, 2, 1]
})
We suppose that indexes are an ordinary RangeIndex
starting from 0 with step 1.我们假设索引是一个普通的RangeIndex
,从 0 开始,第 1 步。
numpy.vsplit
.使用numpy.vsplit
垂直拆分这些索引处的数据。join
min/max values across the columns of interest in each group of the previous split. join
上一次拆分的每组中感兴趣的列的最小/最大值。columns = ['F1','F2']
ranges = [f'{name}_range' for name in columns]
df[ranges] = ''
test_var = df['test_var'].values
changed = np.zeros(len(df), dtype=np.bool)
changed[1:] = test_var[1:] != test_var[:-1]
groups = np.vsplit(df, df.index[changed])
sep = '-'
def get_range(index, column):
data = df.loc[index, column]
low, high = min(data), max(data)
return f'{low}-{high}' if low < high else str(low)
for gr in groups:
for col, rng in zip(columns, ranges):
df.loc[gr.index, rng] = get_range(gr.index, col)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.