如果在 python (pandas) 中的稍后日期出現相同的值，則為虛擬

Question

我有一個df ，其中包含每個名稱的幾個連續日期 (yyyy-mm-dd)，以及其他列。 我想在新列Rep中創建一個虛擬變量，指示相同名稱是否在以后再次出現。 我考慮過以這樣一種方式循環遍歷Name和Date兩列，即為每個具有最年輕日期的名稱設置0而為所有其他名稱設置1 。 此外，我嘗試使用duplicated ，但由於同一Name在同一Date多次出現，因此此方法不提供目標 output。

df ：

Name    Date
A       2006-01-01
B       2006-01-02
A       2006-01-04
A       2006-01-04
B       2006-01-08

結果df ：

Name    Date           Rep
A       2006-01-01     1
B       2006-01-02     1
A       2006-01-04     0
A       2006-01-04     0
B       2006-01-08     0

具有duplicated方法的代碼：

df = df(by=["Name", "Date"])
df["Rep"] = df.duplicated(subset=["Name", "Date"], keep = "last")

取得的成果：

Name    Date           Rep
A       2006-01-01     1
B       2006-01-02     1
A       2006-01-04     1 # this should be 0!
A       2006-01-04     0
B       2006-01-08     0

根據需要，csv 文件之一的示例：

Name;Date;Name_Parent;Amount_Est
A;2006-01-01;3;646,200.00
B;2006-01-02;2;25,000,000.00
A;2006-01-04;3;18,759,000.00
A;2006-01-04;5;18,759,000.00
C;2006-01-04;4;18,759,000.00
B;2006-01-08;6;945,000.00
C;2006-01-09;2;945,000.00
A;2006-01-10;4;945,000.00

為了創建df ，我使用了 pandas。因為我有 40 個單數 csv 文件，所以我使用了一個循環：

import pandas as pd
import glob2 as glob

# import and merge data
path = r'/Users/...'
all = glob.glob(path + "/*.csv")

l = []

for f in all:
    df1 = pd.read_csv(f, sep =";", index_col = None, header = 0)
    df1 = df1.drop(df1.index[0])
    l.append(df1)

df = pd.concat(l, axis = 0)
del f, all, df1, l, path

謝謝你的幫助！

Answer 1

這是示例 output：

這是代碼：

import pandas as pd
import glob
from shutil import copyfile
import os
def file_len(fname):
    with open(fname) as fp:
        for i, line in enumerate(fp):
            pass
    return i + 1
def read_nth(fname,intNth):
    with open(fname) as fp:
        for i, line in enumerate(fp):
            if i == (intNth-1):
                return line
def showRepetitions(fname):
    temp8 = []
    temp3 = []
    for temp1 in range(file_len(fname),-1,-1):
        if "Name;Date;Name_Parent;Amount_Est" in read_nth(fname,temp1):
            temp8.append("Name;Date;Name_Parent;Amount_Est;Repeats_X_More_Times\n")
            break
        temp2 = read_nth(fname,temp1)
        temp8.append(temp2.strip()+";"+str(temp3.count(temp2.split(";")[0]))+"\n")
        temp3.append(temp2.split(";")[0])
    f = open(fname, "w")
    for temp9 in reversed(temp8):
        f.write(temp9)
    f.close()
path = r'C:\Users\USERname4\Desktop'
all = glob.glob(path + r"\*.csv")
l = []
for f in all:
    f2 = f[:-3]+"txt"
    copyfile(f, f2)
    showRepetitions(f2)
    df1 = pd.read_csv(f2, sep =";", index_col = None, header = 0)
    os.remove(f2)
    l.append(df1)
df = pd.concat(l, axis = 0)
print(df)

Answer 2

解決了問題。 也許這會在將來幫助某人：

在新的 df df_max中，我從df中提取了所有具有最新日期的名稱，因為在相應的最新日期之后沒有其他條目（dummy = 0 ）。 然后，我只保留df_max中將用於合並的相關列。 接下來，在新列Rep中，我將每個值設置為0 。 在將列Name和Date上的兩個 dfs df和df_max合並到df_new之后，所有最近的條目，無論Name和Date組合出現的頻率如何，都在Rep上填充為0 。 最后，我用1填充了Rep的nan 。

df = df.sort_values(by=["Name", "Date"])

df_max = pd.DataFrame(df.sort_values("Date").groupby("Name").last().reset_index())
df_max = df_max[["Name", "Date"]]
df_max["Rep"] = "0"

df_new = pd.merge(df, df_max, how="left", left_on=["Name", "Date"], right_on = ["Name", "Date"])
df_startups_new = df_startups_new.fillna(1)

如果在 python (pandas) 中的稍后日期出現相同的值，則為虛擬

問題描述

2 個解決方案

解決方案1
0 2021-06-21 10:42:15

解決方案2
0 已采納 2021-06-21 19:20:50

如果在 python (pandas) 中的稍后日期出現相同的值，則為虛擬

問題描述

2 個解決方案

解決方案1 0 2021-06-21 10:42:15

解決方案2 0 已采納 2021-06-21 19:20:50

解決方案1
0 2021-06-21 10:42:15

解決方案2
0 已采納 2021-06-21 19:20:50