[英]Find the latest datetime for each date in a dataframe PANDAS
I have a folder on my computer that contains ~8500.csv files that are all names of various stock tickers.我的计算机上有一个文件夹,其中包含 ~8500.csv 个文件,这些文件都是各种股票代码的名称。 Within each.csv file, there is a 'timestamp' and 'users_holding' column.
在每个 .csv 文件中,都有一个“timestamp”和“users_holding”列。 I have the 'timestamp' column set up as a datetime index, as the entries in that column include hourly entries for each day ex/ 2019-12-01 01:50, 2020-01-01 02:55... 2020-01-01 01:45 etc. Each one of those timestamps has a corresponding integer representing the number of users holding at that time.
我将“时间戳”列设置为日期时间索引,因为该列中的条目包括每天 ex/ 2019-12-01 01:50、2020-01-01 02:55... 2020- 的每小时条目01-01 01:45 等。这些时间戳中的每一个都有对应的 integer 代表当时持有的用户数量。 I want to create a for loop that iterates through all of the.csv files and tallies up the total users holding across all.csv files for the latest time every day starting on February 1st, 2020 (2020-02-01) until the last day in the.csv file.
我想创建一个 for 循环,遍历所有 .csv 文件,并计算从 2020 年 2 月 1 日 (2020-02-01) 开始直到最后一天的每天最新时间持有所有 .csv 文件的用户总数.csv 文件中的一天。 The folder updates daily, so I can't really have an end date.
该文件夹每天更新,所以我真的无法确定结束日期。
This is the for loop I have set up to establish each ticker as a dataframe:这是我设置的 for 循环,用于将每个代码建立为 dataframe:
path = 'C:\\Users\\N****\\Desktop\\r******\\t**\\p*********\\'
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, header = 0, parse_dates = ['timestamp'], index_col='timestamp')
If anyone could show me how to write the for loop that finds the latest entry for each date and tallies up that number for each day, that would be amazing.如果有人能告诉我如何编写 for 循环来查找每个日期的最新条目并计算每天的数字,那就太棒了。
Thank you!谢谢!
First, create a data frame with a Datetime index (in one-hour steps):首先,创建一个带有日期时间索引的数据框(以一小时为单位):
import numpy as np
import pandas as pd
idx = pd.date_range(start='2020-01-01', end='2020-01-31', freq='H')
data = np.arange(len(idx) * 3).reshape(len(idx), 3)
columns = ['ticker-1', 'ticker-2', 'ticker-3']
df = pd.DataFrame(data=data, index=idx, columns=columns)
print(df.head())
ticker-1 ticker-2 ticker-3
2020-01-01 00:00:00 0 1 2
2020-01-01 01:00:00 3 4 5
2020-01-01 02:00:00 6 7 8
2020-01-01 03:00:00 9 10 11
2020-01-01 04:00:00 12 13 14
Then, groupby the index (keep year-month-day), but drop hours-minutes-seconds).然后,按索引分组(保持年-月-日,但删除小时-分钟-秒)。 The aggregation function is
.last()
聚合 function 是
.last()
result = (df.groupby(by=df.index.strftime('%Y-%m-%d'))
[['ticker-1', 'ticker-2', 'ticker-3']]
.last()
)
print(result.head())
ticker-1 ticker-2 ticker-3
2020-01-01 69 70 71
2020-01-02 141 142 143
2020-01-03 213 214 215
2020-01-04 285 286 287
2020-01-05 357 358 359
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.