[英]Using Python to remove certain rows in a CSV file
目標:查找介於綠色和黃色狀態之間的平均經過時間。 首先,我需要刪除所有不必要的行。 為了找到經過的時間,我需要GREEN的第一個實例,然后是YELLOW的第一個實例 ,一遍又一遍地重復。 以下是100,000多行的摘錄。
在下面的示例中,我想保留行1,2,5,6,9,13,14,15,16,21
Row # Serial Number Time Stamp Status <br>
1 1400004 3/10/14 11:52 GREEN <br>
2 1400004 3/15/14 11:45 YELLOW <br>
3 1400004 3/29/14 7:59 YELLOW <br>
4 1400004 4/16/14 15:59 YELLOW <br>
5 1400004 5/10/14 8:18 GREEN <br>
6 1400004 5/11/14 15:28 YELLOW <br>
7 1400004 5/23/14 14:10 YELLOW <br>
8 1400004 5/24/14 7:56 YELLOW <br>
9 1400004 5/26/14 7:59 GREEN <br>
10 1400004 5/28/14 8:26 GREEN <br>
11 1400004 5/30/14 7:28 GREEN <br>
12 1400004 6/1/14 16:56 GREEN <br>
13 1400004 6/13/14 17:29 YELLOW <br>
14 1400004 6/15/14 15:12 GREEN <br>
15 1400004 6/17/14 8:57 YELLOW <br>
16 1400007 1/3/14 11:55 GREEN <br>
17 1400007 1/4/14 15:31 GREEN <br>
18 1400007 1/15/14 14:44 GREEN <br>
19 1400007 1/17/14 5:37 GREEN <br>
20 1400007 1/18/14 5:35 GREEN <br>
21 1400007 1/18/14 18:32 YELLOW <br>
22 1400007 1/19/14 21:50 YELLOW <br>
以下內容可用於獲取您要查找的行:
from itertools import groupby
from datetime import datetime, timedelta
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for k, g in groupby(csv_input, lambda x: x[4]):
first_in_group = next(g)
print first_in_group[0] # show first column entry
這將顯示:
1
2
5
6
9
13
14
15
16
21
為了對此進行擴展,我建議采用以下方法:
from itertools import groupby
from datetime import datetime, timedelta
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for k1, g1 in groupby(csv_input, lambda x: x[1]): # group by serial number
last = None
entries = []
for k, g in groupby(g1, lambda x: x[4]): # group by status
first = next(g)
start = datetime.strptime('{} {}'.format(first[2], first[3]), '%m/%d/%y %H:%M')
if last:
entries.append((first[0], k, start - last))
print '{:4} {:7} {:>20}'.format(first[0], k, start - last)
last = start
average_seconds = sum((t[2] for t in entries), timedelta()).total_seconds() / float(len(entries))
print "Entries: {} Average mins: {}".format(len(entries), average_seconds / 60)
print
這將顯示給定數據的以下輸出:
2 YELLOW 4 days, 23:53:00
5 GREEN 55 days, 20:33:00
6 YELLOW 1 day, 7:10:00
9 GREEN 14 days, 16:31:00
13 YELLOW 18 days, 9:30:00
14 GREEN 1 day, 21:43:00
15 YELLOW 1 day, 17:45:00
Entries: 7 Average mins: 20340.7142857
21 YELLOW 15 days, 6:37:00
Entries: 1 Average mins: 21997.0
一個問題是您的時間戳會為每個新的序列號重置,因此,如果計算差異,您將獲得非常負的時間。 另外,還不清楚您的日期和時間是在一列還是兩列中? 該腳本假設兩列,例如
Row,#,Serial,Number,Time,Stamp,Status
1,1400004,3/10/14,11:52,GREEN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.