[英]Pandas Dataframe generate variables using previous rows
I am attempting to generate more variables for my dataset.我正在尝试为我的数据集生成更多变量。 My data is stored in multiple files, and using pandas I can only read a single file at once because of the memory limitations.
我的数据存储在多个文件中,使用 pandas 由于 memory 的限制,我一次只能读取一个文件。 Each csv file has the data for a single month and goes something like this:
每个 csv 文件都有一个月的数据,如下所示:
Index Date Sender Recipient Quantity Type
------------------------------------------------------------------------
79XT 26-03-19 Adam Tiffany 72 Box
57ZY 14-03-19 Josh Ross 13 Snack
29UQ 19-03-19 Adam Alex 60 Fruit
56PY 06-03-19 Lucy Alex 29 Book
41BR 28-03-19 Josh Steve 33 Snack
Now I am trying to generate more feature for each row based on the history of each sender and join these features to the dataframe.现在我正在尝试根据每个发件人的历史记录为每一行生成更多功能,并将这些功能加入 dataframe。 For example:
例如:
Index Date Sender Recipient Quantity Type Days Since Days Since Cumulative Quantity Increase First Shipment
Previous Shipment First Shipment Quantity from Previous Shipment to This Recipient?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
79XT 26-03-19 Adam Tiffany 72 Box 7 62 1792 12 0
57ZY 14-03-19 Josh Ross 13 Snack NaN NaN 13 NaN 1
29UQ 19-03-19 Adam Alex 60 Fruit 5 55 1730 -7 1
56PY 06-03-19 Lucy Alex 29 Book 23 32 88 -4 0
41BR 28-03-19 Josh Steve 33 Snack 14 14 46 20 1
As you can see from the desired dataframe above, the new variables are generated based on the sender's previous observations.从上面想要的 dataframe 可以看出,新变量是根据发送者之前的观察生成的。 What is the least computationally expensive way of generating such features?
生成此类特征的计算成本最低的方法是什么? I will need to obtain information from all my monthly csv files to gather such data.
我需要从我所有的每月 csv 文件中获取信息以收集此类数据。 There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files.
There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. I am aware of dask and dask distributed, but I want to find out if there is a simpler way for me to implement what I am trying to do.
我知道 dask 和 dask 分布式,但我想知道是否有更简单的方法来实现我正在尝试做的事情。
I see multiple sub-problems in your problem.我在您的问题中看到了多个子问题。
df = df.merge(df.groupby("sender").agg(first_occurence_date=("Date","min"))["sender", "first_occurrence_date"], on="sender", how="left")
# Computationally likely inefficient, and doesn't solve multiple file-issue immediately.
.feather
as an efficient storing format..feather
作为一种高效的存储格式。 The standard for this changes, so always keep a .csv
as backup..csv
作为备份。 You can write a file as feather like this df.to_feather("filename")
df.to_feather("filename")
Consider factoring your strings with pd.factorize()
as described in the Pandas Docs: pd.Factorize() - I have not seen benchmarks on this, but comparing int
is faster than string
.考虑使用
pd.factorize()
分解您的字符串,如Pandas 文档中所述: pd.Factorize() - 我还没有看到这方面的基准,但比较int
比string
更快。
Lastly, consider setting up a small sqlite3 database that reads the individual files and stores them.最后,考虑设置一个小型 sqlite3 数据库来读取单个文件并存储它们。 Otherwise, getting the first occurrence will be a pain, because you have to keep overwriting the old value and do a computationally expensive operation multiple times.
否则,第一次出现会很痛苦,因为您必须不断覆盖旧值并多次执行计算量大的操作。
Here I have a different approach.在这里,我有一个不同的方法。 I'd try to
我会尝试
csv
to parquet
(Eventually see this answer ) changing dtypes
.csv
转换为parquet
(最终看到这个答案)改变dtypes
。 At leastdf['Date'] = df['Date'].astype("M8")
or或者
df['Date'] = pd.to_datetime(df['Date'])
processed
folder.processed
的文件夹中。import dask.dataframe as dd
df = dd.read_parquet('processed')
df.to_parquet('processed2', partition_on='Sender')
Now you have many files in every Sender=username
you should merge all of them to a single file现在您在每个
Sender=username
中有许多文件,您应该将所有文件合并到一个文件中
You can now create your function for every Sender=username
您现在可以为每个
Sender=username
创建 function
def fun(df):
df = df.sort_values("Date")
df["Day Since Prev Shipment"] = df["Date"].diff().dt.days
df["Day Since First Shipment"](df["Date"] - df["Date"].min()).dt.days
df["Cumulative Quantity"] = df["Quantity"].cumsum()
df["Quantity difference"] = df["Quantity"].diff()
grp = df.groupby("Recipient")["Date"].min().reset_index(name="First Shipment")
df = pd.merge(df, grp, how="left", on="Recipient")
df["First Shipment"] = (df["Date"]==df["First Shipment"]).astype("int8")
return df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.