繁体   English   中英

pandas仅使用datetime索引替换列的一部分

[英]pandas replace only part of a column with datetime index

这是一个后续问题: pandas只替换列的一部分

这是我目前的输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date

pd.set_option('expand_frame_repr', False)

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))

# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......

print (df.round(2))

这是我目前的输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0 
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  NaN
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  NaN
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  NaN
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN
Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
  File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'

我想要做的是将列'x'更改为连续三个1的集合,不重叠。 所需的输出是:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN

因此,1月5日,18日和31日从NaN变为1.0。

正如上面的评论所述,代码的第二部分适用于基于整数的索引。 但是,当日期时间索引为dtype datetime64 [ns]时,它不起作用。 我想我需要对代码的第二部分进行微调,才能使其工作(希望如此)。

大卫先生,谢谢你

--------------------------跟进部分--------------------- ---------------

谢谢你和我在一起b2002。 由于它的简洁,我真的想要保持最佳解决方案。 当我开箱即用你的代码时,这是输出:

原始输出

... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......

...... a [ci:ci + 2] = 1.0 ......

              Open    High     Low   Close    Volume  Adj Close    x  ii  dd  jj  jj  desired
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0  1
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0  1
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0  2          x    x
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  1.0  3   1  
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN  4   1
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN  5   1
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN  6   1
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN  7   1
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0  1
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0  1
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0  10  3   x  x    x
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  1.0  11  1
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0  1
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0  1
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0  1
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  1.0  15  4   z  z
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  1.0  16  1
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0  1
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0  1
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0  19  3   x  x    x
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  1.0  20  1              

我真的想了解发生了什么,所以我设置了列ii,dd,jj之前,jj之后和期望。 当我将输入调整为:

... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......

... a [ci:ci + 1] = 1.0 ......

这是输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  NaN

唯一的问题是1月25日,其中np.diff给出的值为4.我只需要代码跳过值4就可以单独保留现有的三个1。 我尝试修改dd,然后转到jj,这两次尝试都没有用:

dd[dd == 4] = 1

dd = [3 if x==4 else x for x in dd]

还尝试用这个修改jj条目:

jj = [ii [i] for i in range(1,len(ii))if((dd == 4)或(dd [i-1]> 2))]

这给出了以下错误消息:

Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
  File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is     ambiguous. Use a.any() or a.all()

有人有想法么?

如果代码不依赖于索引,代码将起作用:

#mod version
a = np.array(df.x)
ii = np.where(np.isnan(a))[0]

dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    a[ci:ci+2] = 1.0
df.x = a

我不确定结果是否正是你所寻找的......

下面的代码允许您搜索特定模式,然后用其他定义的模式替换这些模式。 缺点是在整个阵列中循环多次,具体取决于搜索模式的数量,根据数据的大小,搜索模式可能也可能不重要。

“发现”模式已标记,并且不会包含在后续搜索循环中,从而避免重叠结果。 因此,搜索以优先级方式完成。 调整模式和填充列表中的元素以更改规则。

我认为下面的模式规则产生了你之前的问题所需的输出,但它只是经过了轻微的测试......

# search patterns in original data (zeros represent nans)
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]

# markers to 'set aside' found patterns (can be any list of floats > 1.0 
# for each, the same float for each fill makes it easy to see which
# replacements were done where for testing...)
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]

patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]

def fill_segments(a, test_patterns, fill_patterns):
    # replace nans with zeros so fast numpy array_equal will work
    nan_idx = np.where(np.isnan(a))[0]
    np.put(a, nan_idx, 0.)
    col_index = list(np.arange(a.size))
    # loop forward through sequence comparing segment patterns
    for j in np.arange(len(test_patterns)):
        this_pattern = test_patterns[j]
        snip = len(this_pattern)
        rng = col_index[:-snip + 1]
        for i in rng:
            seg = a[col_index[i: i + snip]]
            if np.array_equal(seg, this_pattern):
                # when a match is found, replace values in array segment
                # with fill pattern
                pattern_indexes = col_index[i: i + snip]
                np.put(a, pattern_indexes, fill_patterns[j])
    # convert all fillers to ones
    np.put(a, np.where(a > 1.)[0], 1.)
    # convert zeros back to nans
    np.put(a, np.where(a == 0.)[0], np.nan)

    return a

运行函数并分配给df.x列

df.x = fill_segments(np.array(df.x), patterns, fills)

输入:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44     NaN 
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29     NaN 
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52     NaN 
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 

输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44      1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29      1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52      1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 

---------------------最终答案/最终解决-----------好吧,这是几个星期的兼职工作和几个十几个小时,但我终于明白了! 我知道这个代码是一个钝器,但它的工作原理。 如果有人有减少代码或加快速度的建议,请告诉我!

这是最后的输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date 

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']

test = 0

for i in np.nditer(df['x2'], op_flags=['readwrite']):

    if test == 4:
        test = 0

    if test == 3:
        i[...] = 3
        test = 4

    if test == 2:
        i[...] = 2
        test = 3

    if (test == 1) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == np.nan):
        i[...] = np.nan
        test = 1

print (df.round(2))

这是最终输出的一节:

              Open    High     Low   Close    Volume  Adj Close    x   x2
Date                                                                     
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  NaN  NaN
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  NaN  NaN
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  1.0  2.0
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  1.0  3.0
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  1.0  1.0
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  1.0  2.0
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  1.0  3.0
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  NaN  NaN
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  NaN  NaN
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  1.0  2.0
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  NaN  3.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  NaN  NaN
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  NaN  NaN
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN  2.0
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  NaN  3.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  NaN  NaN
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  1.0  2.0
2000-02-02  137.12  137.62  134.06  134.06  21820200      26.32  1.0  3.0
2000-02-03  135.94  139.81  135.25  139.25  20232000      27.34  1.0  1.0
2000-02-04  141.00  143.12  140.50  141.56  18167100      27.79  NaN  2.0
2000-02-07  141.69  141.75  135.88  136.50  18285000      26.80  NaN  3.0

我将列x2中的值更改为显示1 - 3而不是仅显示1以查看新系列在旧系列结束时的开始时间。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM