[英]pandas replace only part of a column with datetime index
This is a follow up question to this: pandas replace only part of a column 这是一个后续问题: pandas只替换列的一部分
Here is my current input: 这是我目前的输入:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date
pd.set_option('expand_frame_repr', False)
df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))
# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......
print (df.round(2))
Here is my current output: 这是我目前的输出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 NaN
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 NaN
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 NaN
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'
What I want to do is to change column 'x' to be a set of three 1's in a row, non-overlapping. 我想要做的是将列'x'更改为连续三个1的集合,不重叠。 The desired output is: 所需的输出是:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
So, January 5, 18 and 31 change from NaN to 1.0. 因此,1月5日,18日和31日从NaN变为1.0。
As the comment above states, the second portion of the code works perfect for an integer based index. 正如上面的评论所述,代码的第二部分适用于基于整数的索引。 However it doesn't work when with the datetime index of dtype datetime64[ns]. 但是,当日期时间索引为dtype datetime64 [ns]时,它不起作用。 I think I need just a tiny tweak to the second portion of the code to get this to work (hopefully). 我想我需要对代码的第二部分进行微调,才能使其工作(希望如此)。
Thanks in advance, David 大卫先生,谢谢你
--------------------------follow up section ------------------------------------ --------------------------跟进部分--------------------- ---------------
Thanks for hanging in there with me b2002. 谢谢你和我在一起b2002。 i am really trying to keep with the top solution due to it's brevity. 由于它的简洁,我真的想要保持最佳解决方案。 when i run your code out of the box, here is the output: 当我开箱即用你的代码时,这是输出:
original output with 原始输出
...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]... ... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......
... a[ci:ci+2] = 1.0... ...... a [ci:ci + 2] = 1.0 ......
Open High Low Close Volume Adj Close x ii dd jj jj desired
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0 1
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0 1
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0 2 x x
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 1.0 3 1
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN 4 1
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN 5 1
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN 6 1
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN 7 1
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0 1
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0 1
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0 10 3 x x x
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 1.0 11 1
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0 1
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0 1
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0 1
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 1.0 15 4 z z
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 1.0 16 1
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0 1
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0 1
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0 19 3 x x x
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 1.0 20 1
I am really trying to understand what is going on so i set up columns ii, dd, jj before, jj after, and desired. 我真的想了解发生了什么,所以我设置了列ii,dd,jj之前,jj之后和期望。 when i tweak the input to: 当我将输入调整为:
...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]... ... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......
... a[ci:ci+1] = 1.0... ... a [ci:ci + 1] = 1.0 ......
here is the output: 这是输出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.45 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.27 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.22 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.60 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.70 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.69 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.74 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 29.84 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.18 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.64 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.05 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.19 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.65 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.29 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.12 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.19 1.0
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.77 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 27.83 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.31 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.31 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.70 NaN
the only problem is with January 25th, where the np.diff is giving a value of 4. i just need the code to skip a value of 4 to leave the existing sets of three 1's alone. 唯一的问题是1月25日,其中np.diff给出的值为4.我只需要代码跳过值4就可以单独保留现有的三个1。 i tried to modify dd before it goes to jj with these two attempts which didn't work: 我尝试修改dd,然后转到jj,这两次尝试都没有用:
dd[dd == 4] = 1
dd = [3 if x==4 else x for x in dd]
also tried to modify the jj entry with this: 还尝试用这个修改jj条目:
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))] jj = [ii [i] for i in range(1,len(ii))if((dd == 4)或(dd [i-1]> 2))]
which gives this error message: 这给出了以下错误消息:
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
anyone have any ideas? 有人有想法么?
The code will work if it doesn't depend on the index: 如果代码不依赖于索引,代码将起作用:
#mod version
a = np.array(df.x)
ii = np.where(np.isnan(a))[0]
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
a[ci:ci+2] = 1.0
df.x = a
I'm not sure that the results are exactly what you are looking for though... 我不确定结果是否正是你所寻找的......
The code below allows you to search for specific patterns and then replace those patterns with other defined patterns. 下面的代码允许您搜索特定模式,然后用其他定义的模式替换这些模式。 Drawback is that is loops through the entire array several times depending on the number of search patterns, which may or may not matter depending on the size of your data. 缺点是在整个阵列中循环多次,具体取决于搜索模式的数量,根据数据的大小,搜索模式可能也可能不重要。
The 'found' patterns are marked and are not included in subsequent search loops which avoids overlapping results. “发现”模式已标记,并且不会包含在后续搜索循环中,从而避免重叠结果。 So, the searches are done in a priority fashion. 因此,搜索以优先级方式完成。 Adjust the elements in the patterns and fills lists to change the rules. 调整模式和填充列表中的元素以更改规则。
I think the pattern rules below produce the output desired per your previous question but it has only been tested lightly... 我认为下面的模式规则产生了你之前的问题所需的输出,但它只是经过了轻微的测试......
# search patterns in original data (zeros represent nans)
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]
# markers to 'set aside' found patterns (can be any list of floats > 1.0
# for each, the same float for each fill makes it easy to see which
# replacements were done where for testing...)
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]
patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]
def fill_segments(a, test_patterns, fill_patterns):
# replace nans with zeros so fast numpy array_equal will work
nan_idx = np.where(np.isnan(a))[0]
np.put(a, nan_idx, 0.)
col_index = list(np.arange(a.size))
# loop forward through sequence comparing segment patterns
for j in np.arange(len(test_patterns)):
this_pattern = test_patterns[j]
snip = len(this_pattern)
rng = col_index[:-snip + 1]
for i in rng:
seg = a[col_index[i: i + snip]]
if np.array_equal(seg, this_pattern):
# when a match is found, replace values in array segment
# with fill pattern
pattern_indexes = col_index[i: i + snip]
np.put(a, pattern_indexes, fill_patterns[j])
# convert all fillers to ones
np.put(a, np.where(a > 1.)[0], 1.)
# convert zeros back to nans
np.put(a, np.where(a == 0.)[0], np.nan)
return a
run function and assign to df.x column 运行函数并分配给df.x列
df.x = fill_segments(np.array(df.x), patterns, fills)
Input: 输入:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 NaN
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 NaN
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 NaN
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
Output: 输出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
--------------------- FINAL ANSWER / FINALLY SOLVED ----------- well, it's been a couple weeks of part time effort and several dozen hours but i finally got it! ---------------------最终答案/最终解决-----------好吧,这是几个星期的兼职工作和几个十几个小时,但我终于明白了! i know this code is a blunt instrument but it works. 我知道这个代码是一个钝器,但它的工作原理。 if anyone has suggestions on reducing the code or speeding it up, please let me know! 如果有人有减少代码或加快速度的建议,请告诉我!
here is the final input: 这是最后的输入:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date
df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']
test = 0
for i in np.nditer(df['x2'], op_flags=['readwrite']):
if test == 4:
test = 0
if test == 3:
i[...] = 3
test = 4
if test == 2:
i[...] = 2
test = 3
if (test == 1) & (i[...] == 1):
i[...] = 1
test = 2
if (test == 0) & (i[...] == 1):
i[...] = 1
test = 2
if (test == 0) & (i[...] == np.nan):
i[...] = np.nan
test = 1
print (df.round(2))
here is a section the final output: 这是最终输出的一节:
Open High Low Close Volume Adj Close x x2
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.45 NaN NaN
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.27 NaN NaN
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.22 1.0 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.60 1.0 2.0
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.70 1.0 3.0
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.69 1.0 1.0
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.74 1.0 2.0
2000-01-12 151.06 153.25 150.56 152.00 18342300 29.84 1.0 3.0
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.18 NaN NaN
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.64 NaN NaN
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.05 1.0 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.19 1.0 2.0
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.65 NaN 3.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.29 NaN NaN
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.12 NaN NaN
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.19 1.0 1.0
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.77 NaN 2.0
2000-01-27 141.56 141.75 137.06 141.75 19243500 27.83 NaN 3.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.31 NaN NaN
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.31 1.0 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.70 1.0 2.0
2000-02-02 137.12 137.62 134.06 134.06 21820200 26.32 1.0 3.0
2000-02-03 135.94 139.81 135.25 139.25 20232000 27.34 1.0 1.0
2000-02-04 141.00 143.12 140.50 141.56 18167100 27.79 NaN 2.0
2000-02-07 141.69 141.75 135.88 136.50 18285000 26.80 NaN 3.0
i altered the values in column x2 to display 1 - 3 instead of just 1 to see when a new series started at the end of an old series. 我将列x2中的值更改为显示1 - 3而不是仅显示1以查看新系列在旧系列结束时的开始时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.