[英]Get unique of a column ordered by date in 800million rows
Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]输入:具有相同列(8 亿行)的多个 csv [时间戳、用户 ID、Col1、Col2、Col3]
Memory available : 60GB of RAM and 24 core CPU Memory可用:60GB内存和24核CPU
Input Output example输入 Output示例
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.问题:我想按用户 ID 分组,按时间戳排序并采用唯一的 Col1,但删除重复项,同时保留基于时间戳的顺序。
Solutions Tried :尝试过的解决方案:
joblib
to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)尝试使用joblib
并行加载csv并使用pandas排序写入csv(排序步骤报错)LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :问题:
read_csv
?假设 dask 将使用磁盘对多部分 output 进行排序和写入,它会在我使用read_csv
读取 output 后保留顺序吗?def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.如果有比 dask 更好的选择,请指导我。
So, I think I would approach this with two passes.所以,我想我会通过两次通过来解决这个问题。 In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id
and col1
and the "best" timestamp.在第一遍中,我希望遍历所有 csv 文件并构建一个数据结构来保存user_id
和col1
的键以及“最佳”时间戳。 In this case, "best" will be the lowest.在这种情况下,“最佳”将是最低的。
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.注意:这里使用字典只是为了阐明我们正在尝试做什么,如果性能或 memory 是一个问题,我会首先考虑在没有它们的情况下重新实现。
so, starting with CSV data like:因此,从 CSV 数据开始,例如:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:在处理了所有 csv 个文件后,我希望有一个临时表示:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.请注意,可以为每个 CSV 并行创建这样的表示,然后如果您想这样做,可以通过 pass 1.5 重新提炼成最终的临时表示。
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value.现在我们可以根据这个嵌套结构的键(按最内层值排序)创建最终表示。 Giving us:给我们:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.这是我在调整性能之前可能首先处理此任务的方式。
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.