[英]pandas replace only part of a column with datetime index
這是一個后續問題: pandas只替換列的一部分
這是我目前的輸入:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date
pd.set_option('expand_frame_repr', False)
df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))
# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......
print (df.round(2))
這是我目前的輸出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 NaN
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 NaN
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 NaN
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'
我想要做的是將列'x'更改為連續三個1的集合,不重疊。 所需的輸出是:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
因此,1月5日,18日和31日從NaN變為1.0。
正如上面的評論所述,代碼的第二部分適用於基於整數的索引。 但是,當日期時間索引為dtype datetime64 [ns]時,它不起作用。 我想我需要對代碼的第二部分進行微調,才能使其工作(希望如此)。
大衛先生,謝謝你
--------------------------跟進部分--------------------- ---------------
謝謝你和我在一起b2002。 由於它的簡潔,我真的想要保持最佳解決方案。 當我開箱即用你的代碼時,這是輸出:
原始輸出
... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......
...... a [ci:ci + 2] = 1.0 ......
Open High Low Close Volume Adj Close x ii dd jj jj desired
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0 1
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0 1
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0 2 x x
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 1.0 3 1
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN 4 1
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN 5 1
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN 6 1
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN 7 1
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0 1
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0 1
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0 10 3 x x x
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 1.0 11 1
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0 1
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0 1
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0 1
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 1.0 15 4 z z
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 1.0 16 1
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0 1
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0 1
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0 19 3 x x x
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 1.0 20 1
我真的想了解發生了什么,所以我設置了列ii,dd,jj之前,jj之后和期望。 當我將輸入調整為:
... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......
... a [ci:ci + 1] = 1.0 ......
這是輸出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.45 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.27 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.22 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.60 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.70 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.69 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.74 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 29.84 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.18 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.64 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.05 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.19 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.65 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.29 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.12 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.19 1.0
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.77 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 27.83 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.31 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.31 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.70 NaN
唯一的問題是1月25日,其中np.diff給出的值為4.我只需要代碼跳過值4就可以單獨保留現有的三個1。 我嘗試修改dd,然后轉到jj,這兩次嘗試都沒有用:
dd[dd == 4] = 1
dd = [3 if x==4 else x for x in dd]
還嘗試用這個修改jj條目:
jj = [ii [i] for i in range(1,len(ii))if((dd == 4)或(dd [i-1]> 2))]
這給出了以下錯誤消息:
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
有人有想法么?
如果代碼不依賴於索引,代碼將起作用:
#mod version
a = np.array(df.x)
ii = np.where(np.isnan(a))[0]
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
a[ci:ci+2] = 1.0
df.x = a
我不確定結果是否正是你所尋找的......
下面的代碼允許您搜索特定模式,然后用其他定義的模式替換這些模式。 缺點是在整個陣列中循環多次,具體取決於搜索模式的數量,根據數據的大小,搜索模式可能也可能不重要。
“發現”模式已標記,並且不會包含在后續搜索循環中,從而避免重疊結果。 因此,搜索以優先級方式完成。 調整模式和填充列表中的元素以更改規則。
我認為下面的模式規則產生了你之前的問題所需的輸出,但它只是經過了輕微的測試......
# search patterns in original data (zeros represent nans)
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]
# markers to 'set aside' found patterns (can be any list of floats > 1.0
# for each, the same float for each fill makes it easy to see which
# replacements were done where for testing...)
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]
patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]
def fill_segments(a, test_patterns, fill_patterns):
# replace nans with zeros so fast numpy array_equal will work
nan_idx = np.where(np.isnan(a))[0]
np.put(a, nan_idx, 0.)
col_index = list(np.arange(a.size))
# loop forward through sequence comparing segment patterns
for j in np.arange(len(test_patterns)):
this_pattern = test_patterns[j]
snip = len(this_pattern)
rng = col_index[:-snip + 1]
for i in rng:
seg = a[col_index[i: i + snip]]
if np.array_equal(seg, this_pattern):
# when a match is found, replace values in array segment
# with fill pattern
pattern_indexes = col_index[i: i + snip]
np.put(a, pattern_indexes, fill_patterns[j])
# convert all fillers to ones
np.put(a, np.where(a > 1.)[0], 1.)
# convert zeros back to nans
np.put(a, np.where(a == 0.)[0], np.nan)
return a
運行函數並分配給df.x列
df.x = fill_segments(np.array(df.x), patterns, fills)
輸入:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 NaN
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 NaN
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 NaN
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
輸出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
---------------------最終答案/最終解決-----------好吧,這是幾個星期的兼職工作和幾個十幾個小時,但我終於明白了! 我知道這個代碼是一個鈍器,但它的工作原理。 如果有人有減少代碼或加快速度的建議,請告訴我!
這是最后的輸入:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date
df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']
test = 0
for i in np.nditer(df['x2'], op_flags=['readwrite']):
if test == 4:
test = 0
if test == 3:
i[...] = 3
test = 4
if test == 2:
i[...] = 2
test = 3
if (test == 1) & (i[...] == 1):
i[...] = 1
test = 2
if (test == 0) & (i[...] == 1):
i[...] = 1
test = 2
if (test == 0) & (i[...] == np.nan):
i[...] = np.nan
test = 1
print (df.round(2))
這是最終輸出的一節:
Open High Low Close Volume Adj Close x x2
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.45 NaN NaN
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.27 NaN NaN
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.22 1.0 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.60 1.0 2.0
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.70 1.0 3.0
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.69 1.0 1.0
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.74 1.0 2.0
2000-01-12 151.06 153.25 150.56 152.00 18342300 29.84 1.0 3.0
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.18 NaN NaN
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.64 NaN NaN
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.05 1.0 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.19 1.0 2.0
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.65 NaN 3.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.29 NaN NaN
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.12 NaN NaN
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.19 1.0 1.0
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.77 NaN 2.0
2000-01-27 141.56 141.75 137.06 141.75 19243500 27.83 NaN 3.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.31 NaN NaN
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.31 1.0 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.70 1.0 2.0
2000-02-02 137.12 137.62 134.06 134.06 21820200 26.32 1.0 3.0
2000-02-03 135.94 139.81 135.25 139.25 20232000 27.34 1.0 1.0
2000-02-04 141.00 143.12 140.50 141.56 18167100 27.79 NaN 2.0
2000-02-07 141.69 141.75 135.88 136.50 18285000 26.80 NaN 3.0
我將列x2中的值更改為顯示1 - 3而不是僅顯示1以查看新系列在舊系列結束時的開始時間。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.