简体   繁体   English

pandas仅使用datetime索引替换列的一部分

[英]pandas replace only part of a column with datetime index

This is a follow up question to this: pandas replace only part of a column 这是一个后续问题: pandas只替换列的一部分

Here is my current input: 这是我目前的输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date

pd.set_option('expand_frame_repr', False)

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))

# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......

print (df.round(2))

Here is my current output: 这是我目前的输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0 
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  NaN
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  NaN
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  NaN
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN
Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
  File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'

What I want to do is to change column 'x' to be a set of three 1's in a row, non-overlapping. 我想要做的是将列'x'更改为连续三个1的集合,不重叠。 The desired output is: 所需的输出是:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN

So, January 5, 18 and 31 change from NaN to 1.0. 因此,1月5日,18日和31日从NaN变为1.0。

As the comment above states, the second portion of the code works perfect for an integer based index. 正如上面的评论所述,代码的第二部分适用于基于整数的索引。 However it doesn't work when with the datetime index of dtype datetime64[ns]. 但是,当日期时间索引为dtype datetime64 [ns]时,它不起作用。 I think I need just a tiny tweak to the second portion of the code to get this to work (hopefully). 我想我需要对代码的第二部分进行微调,才能使其工作(希望如此)。

Thanks in advance, David 大卫先生,谢谢你

--------------------------follow up section ------------------------------------ --------------------------跟进部分--------------------- ---------------

Thanks for hanging in there with me b2002. 谢谢你和我在一起b2002。 i am really trying to keep with the top solution due to it's brevity. 由于它的简洁,我真的想要保持最佳解决方案。 when i run your code out of the box, here is the output: 当我开箱即用你的代码时,这是输出:

original output with 原始输出

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]... ... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......

... a[ci:ci+2] = 1.0... ...... a [ci:ci + 2] = 1.0 ......

              Open    High     Low   Close    Volume  Adj Close    x  ii  dd  jj  jj  desired
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0  1
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0  1
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0  2          x    x
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  1.0  3   1  
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN  4   1
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN  5   1
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN  6   1
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN  7   1
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0  1
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0  1
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0  10  3   x  x    x
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  1.0  11  1
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0  1
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0  1
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0  1
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  1.0  15  4   z  z
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  1.0  16  1
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0  1
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0  1
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0  19  3   x  x    x
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  1.0  20  1              

I am really trying to understand what is going on so i set up columns ii, dd, jj before, jj after, and desired. 我真的想了解发生了什么,所以我设置了列ii,dd,jj之前,jj之后和期望。 when i tweak the input to: 当我将输入调整为:

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]... ... jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ......

... a[ci:ci+1] = 1.0... ... a [ci:ci + 1] = 1.0 ......

here is the output: 这是输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  NaN

the only problem is with January 25th, where the np.diff is giving a value of 4. i just need the code to skip a value of 4 to leave the existing sets of three 1's alone. 唯一的问题是1月25日,其中np.diff给出的值为4.我只需要代码跳过值4就可以单独保留现有的三个1。 i tried to modify dd before it goes to jj with these two attempts which didn't work: 我尝试修改dd,然后转到jj,这两次尝试都没有用:

dd[dd == 4] = 1

dd = [3 if x==4 else x for x in dd]

also tried to modify the jj entry with this: 还尝试用这个修改jj条目:

jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))] jj = [ii [i] for i in range(1,len(ii))if((dd == 4)或(dd [i-1]> 2))]

which gives this error message: 这给出了以下错误消息:

Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
  File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is     ambiguous. Use a.any() or a.all()

anyone have any ideas? 有人有想法么?

The code will work if it doesn't depend on the index: 如果代码不依赖于索引,代码将起作用:

#mod version
a = np.array(df.x)
ii = np.where(np.isnan(a))[0]

dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    a[ci:ci+2] = 1.0
df.x = a

I'm not sure that the results are exactly what you are looking for though... 我不确定结果是否正是你所寻找的......

The code below allows you to search for specific patterns and then replace those patterns with other defined patterns. 下面的代码允许您搜索特定模式,然后用其他定义的模式替换这些模式。 Drawback is that is loops through the entire array several times depending on the number of search patterns, which may or may not matter depending on the size of your data. 缺点是在整个阵列中循环多次,具体取决于搜索模式的数量,根据数据的大小,搜索模式可能也可能不重要。

The 'found' patterns are marked and are not included in subsequent search loops which avoids overlapping results. “发现”模式已标记,并且不会包含在后续搜索循环中,从而避免重叠结果。 So, the searches are done in a priority fashion. 因此,搜索以优先级方式完成。 Adjust the elements in the patterns and fills lists to change the rules. 调整模式和填充列表中的元素以更改规则。

I think the pattern rules below produce the output desired per your previous question but it has only been tested lightly... 我认为下面的模式规则产生了你之前的问题所需的输出,但它只是经过了轻微的测试......

# search patterns in original data (zeros represent nans)
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]

# markers to 'set aside' found patterns (can be any list of floats > 1.0 
# for each, the same float for each fill makes it easy to see which
# replacements were done where for testing...)
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]

patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]

def fill_segments(a, test_patterns, fill_patterns):
    # replace nans with zeros so fast numpy array_equal will work
    nan_idx = np.where(np.isnan(a))[0]
    np.put(a, nan_idx, 0.)
    col_index = list(np.arange(a.size))
    # loop forward through sequence comparing segment patterns
    for j in np.arange(len(test_patterns)):
        this_pattern = test_patterns[j]
        snip = len(this_pattern)
        rng = col_index[:-snip + 1]
        for i in rng:
            seg = a[col_index[i: i + snip]]
            if np.array_equal(seg, this_pattern):
                # when a match is found, replace values in array segment
                # with fill pattern
                pattern_indexes = col_index[i: i + snip]
                np.put(a, pattern_indexes, fill_patterns[j])
    # convert all fillers to ones
    np.put(a, np.where(a > 1.)[0], 1.)
    # convert zeros back to nans
    np.put(a, np.where(a == 0.)[0], np.nan)

    return a

run function and assign to df.x column 运行函数并分配给df.x列

df.x = fill_segments(np.array(df.x), patterns, fills)

Input: 输入:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44     NaN 
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29     NaN 
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52     NaN 
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 

Output: 输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44      1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29      1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52      1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 

--------------------- FINAL ANSWER / FINALLY SOLVED ----------- well, it's been a couple weeks of part time effort and several dozen hours but i finally got it! ---------------------最终答案/最终解决-----------好吧,这是几个星期的兼职工作和几个十几个小时,但我终于明白了! i know this code is a blunt instrument but it works. 我知道这个代码是一个钝器,但它的工作原理。 if anyone has suggestions on reducing the code or speeding it up, please let me know! 如果有人有减少代码或加快速度的建议,请告诉我!

here is the final input: 这是最后的输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date 

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']

test = 0

for i in np.nditer(df['x2'], op_flags=['readwrite']):

    if test == 4:
        test = 0

    if test == 3:
        i[...] = 3
        test = 4

    if test == 2:
        i[...] = 2
        test = 3

    if (test == 1) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == np.nan):
        i[...] = np.nan
        test = 1

print (df.round(2))

here is a section the final output: 这是最终输出的一节:

              Open    High     Low   Close    Volume  Adj Close    x   x2
Date                                                                     
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  NaN  NaN
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  NaN  NaN
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  1.0  2.0
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  1.0  3.0
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  1.0  1.0
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  1.0  2.0
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  1.0  3.0
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  NaN  NaN
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  NaN  NaN
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  1.0  2.0
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  NaN  3.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  NaN  NaN
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  NaN  NaN
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN  2.0
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  NaN  3.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  NaN  NaN
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  1.0  2.0
2000-02-02  137.12  137.62  134.06  134.06  21820200      26.32  1.0  3.0
2000-02-03  135.94  139.81  135.25  139.25  20232000      27.34  1.0  1.0
2000-02-04  141.00  143.12  140.50  141.56  18167100      27.79  NaN  2.0
2000-02-07  141.69  141.75  135.88  136.50  18285000      26.80  NaN  3.0

i altered the values in column x2 to display 1 - 3 instead of just 1 to see when a new series started at the end of an old series. 我将列x2中的值更改为显示1 - 3而不是仅显示1以查看新系列在旧系列结束时的开始时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM