[英]Pandas Data Cleaning
因此,我正在將PDF中的表格讀取到pandas數據框中,但是對於pandas來說我還很陌生,並且在文檔中瀏覽起來非常艱巨。 我敢肯定有一種相當容易的方法來做我需要做的事情,但是我只是不知道怎么做。
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 NaN col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 NaN
1 NaN Location Date NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN measure1 1** 40** 30** 20** 20 0.02** 3** 10** 5** 100** 15** NaN
3 NaN measure2 100 400 300 200 200 2 300 100 50 1,000 150 NaN
4 NaN location1 1/15/1994 5900 28000 7600 25000 150 --- --- --- --- --- ---
5 NaN NaN 3/16/1994 4900 12000 4400 11000 60 --- --- --- --- --- ---
6 NaN NaN 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
7 NaN NaN 4/12/2004 8400 34000 4600 17000 <1000 --- --- --- --- --- ---
8 NaN NaN 7/28/2008 3200 15400 4430 17100 172 I --- --- --- --- --- ---
9 NaN NaN 5/19/2011 2000 11000 2500 9200 0.2 1 --- --- --- --- --- ---
10 NaN NaN 8/6/2013 2700 20000 5300 20000 2 6 --- --- --- --- --- ---
11 NaN NaN 11/13/2013 2600 14000 5400 20000 0.1 3 --- --- --- --- --- ---
12 NaN NaN 2/5/2014 3200 19000 6400 25000 18 0 --- --- --- --- --- ---
13 NaN NaN 5/7/2014 2000 15000 4100 16000 22 0 --- --- --- --- --- ---
14 NaN NaN 12/18/2014 2500 32000 5200 20000 8 8 --- --- --- --- --- ---
15 NaN NaN 6/4/2015 1700 15000 5200 21000 44 0 --- --- --- --- --- ---
16 NaN NaN 1/20/2017 1400 15,000 6,300 21,000 1 2 --- --- --- --- --- ---
17 NaN location2 1/15/1994 210 290 39 180 69 --- --- --- --- --- ---
18 NaN NaN 3/24/1994 1500 12000 4100 18000 400 0 --- --- --- --- --- ---
19 NaN NaN 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
20 NaN NaN 2/1/2000 <1000 8900 5200 58000 <10000 --- --- --- --- --- ---
21 NaN NaN 4/12/2004 <5.0 42 78 540 150 --- --- --- --- --- ---
22 NaN NaN 7/28/2008 23.3 27.9 28 409 9.34 --- --- --- --- --- ---
23 NaN NaN 5/19/2011 1.8 12 22 170 0.2 1 --- --- --- --- --- ---
24 NaN NaN 8/6/2013 4.3 23 71 590 0.1 3 --- --- --- --- --- ---
25 NaN NaN 1/19/2017 0.21 I 0.26 I 7.7 42 0.2 4 --- --- --- --- --- ---
26 NaN location3 3/21/1994 <1 <1 <1 <1 <8 --- --- --- --- --- ---
27 2/1/2000 <1 <1 <1 <2 <10 --- --- --- --- --- --- NaN NaN
因此,我需要處理三個主要問題。
第一:最后一行與其他行不符。 我需要將丟失的行中的所有值向右移兩列,以便將日期對齊。 這也意味着第一列不應該存在。
第二:由於這些表格在PDF中的設置很笨拙,因此其他事情變得一團糟。 日期列應該只是日期。 我需要以某種方式將“日期”列中所有不顯示“日期”或將日期向下移到一列的行。
上一個:位置NaNs。 每個位置下的所有NaN值實際上都屬於同一位置,因此我需要以某種方式填寫這些值。
所以我想要的輸出看起來像這樣...
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0
1 Location Date col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11
2 measure1 NaN 1** 40** 30** 20** 20 0.02** 3** 10** 5** 100** 15**
3 measure2 NaN 100 400 300 200 200 2 300 100 50 1,000 150
4 location1 1/15/1994 5900 28000 7600 25000 150 --- --- --- --- --- ---
5 location1 3/16/1994 4900 12000 4400 11000 60 --- --- --- --- --- ---
6 location1 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
7 location1 4/12/2004 8400 34000 4600 17000 <1000 --- --- --- --- --- ---
8 location1 7/28/2008 3200 15400 4430 17100 172 I --- --- --- --- --- ---
9 location1 5/19/2011 2000 11000 2500 9200 0.2 1 --- --- --- --- --- ---
10 location1 8/6/2013 2700 20000 5300 20000 2 6 --- --- --- --- --- ---
11 location1 11/13/2013 2600 14000 5400 20000 0.1 3 --- --- --- --- --- ---
12 location1 2/5/2014 3200 19000 6400 25000 18 0 --- --- --- --- --- ---
13 location1 5/7/2014 2000 15000 4100 16000 22 0 --- --- --- --- --- ---
14 location1 12/18/2014 2500 32000 5200 20000 8 8 --- --- --- --- --- ---
15 location1 6/4/2015 1700 15000 5200 21000 44 0 --- --- --- --- --- ---
16 location1 1/20/2017 1400 15,000 6,300 21,000 1 2 --- --- --- --- --- ---
17 location2 1/15/1994 210 290 39 180 69 --- --- --- --- --- ---
18 location2 3/24/1994 1500 12000 4100 18000 400 0 --- --- --- --- --- ---
19 location2 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
20 location2 2/1/2000 <1000 8900 5200 58000 <10000 --- --- --- --- --- ---
21 location2 4/12/2004 <5.0 42 78 540 150 --- --- --- --- --- ---
22 location2 7/28/2008 23.3 27.9 28 409 9.34 --- --- --- --- --- ---
23 location2 5/19/2011 1.8 12 22 170 0.2 1 --- --- --- --- --- ---
24 location2 8/6/2013 4.3 23 71 590 0.1 3 --- --- --- --- --- ---
25 location2 1/19/2017 0.21 I 0.26 I 7.7 42 0.2 4 --- --- --- --- --- ---
26 location3 3/21/1994 <1 <1 <1 <1 <8 --- --- --- --- --- ---
27 location3 2/1/2000 <1 <1 <1 <2 <10 --- --- --- --- --- ---
首先,您可以嘗試以下操作:
df = df.T
df.iloc[:,-1] = df.iloc[:,-1].shift(1)
df = df.T
df = df.drop(df.columns[0], axis=1)
最后一點:
df['1'] = df['1'].ffill()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.