[英]Python: how to split pandas DataFrame into subsets based on the value in the first column?
我有一個實驗的大日志文件(.txt)(最多包含10萬個條目),其結構如下:
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
_______________________________________________
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
OSL 75 20 570
OSL 75 20 580
OSL 75 20 590
OSL 75 20 600
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
我使用read_table從熊貓日志文件加載到蟒蛇。 我想根據第一列的值將結果數據框划分為較小的數據框。 因此結果將如下所示:
**DATAFRAME 1:**
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
**DATAFRAME 2:**
OSL 75 20 570
OSL 75 20 580
OSL 75 20 590
OSL 75 20 600
**DATAFRAME 3:**
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
首先,我嘗試使用第一列的值更改的索引來拆分它們:
indexSplit = [] # list containing the boundry indices
prevRoutine = log['ROUTINE'][0] # log is the complete dataframe
i = 1
while i < len(log):
if prevRoutine != log['ROUTINE'][i]:
indexSplit.append(i)
prevRoutine = log['ROUTINE'][i]
但是,考慮到日志文件的大小,以這種方式(顯然)要花費大量時間。 我想知道是否有一種優雅的方法可以對付熊貓? 我一直遇到的問題是第一列的值在多個序列中使用。 我總是以數據幀1和數據幀3結束。
您可以使用list comprehension
,其中循環groupby
對象和groups
由s
創建。 目前比較受ne
(相同!=
但速度更快) shift
編列,並通過cumsum
得到的輸出:
s = df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum()
print (s)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ROUTINE, dtype: int32
dfs = [g for i,g in df.groupby(df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum())]
print (dfs)
[ ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560]
print (dfs[0])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560
print (dfs[1])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600
print (dfs[2])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560
解決方案很復雜,因為如果僅在第一列中使用groupby
,則僅獲得2組:
dfs = [g for i,g in df.groupby('ROUTINE')]
print (dfs)
[ ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.