简体   繁体   English

在 8 亿行中获取按日期排序的列的唯一性

[英]Get unique of a column ordered by date in 800million rows

Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]输入:具有相同列(8 亿行)的多个 csv [时间戳、用户 ID、Col1、Col2、Col3]

Memory available : 60GB of RAM and 24 core CPU Memory可用:60GB内存和24核CPU

Input Output example输入 Output示例

Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.问题:我想按用户 ID 分组,按时间戳排序并采用唯一的 Col1,但删除重复项,同时保留基于时间戳的顺序。

Solutions Tried :尝试过的解决方案

  1. Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)尝试使用joblib并行加载csv并使用pandas排序写入csv(排序步骤报错)
  2. Used dask (New to Dask);使用过的 dask(Dask 的新手); \ \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server           
ddf = read_csv("/path/*.csv")                                
ddf = ddf.set_index("Time Stamp")                                        
ddf.to_csv("/outdir/")

Questions :问题

  1. Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv ?假设 dask 将使用磁盘对多部分 output 进行排序和写入,它会在我使用read_csv读取 output 后保留顺序吗?
  2. How do I achieve the 2 part of the problem in dask.如何在 dask 中实现问题的第二部分。 In pandas, I'd just apply and gather results in a new dataframe?在 pandas 中,我只是申请并在新的 dataframe 中收集结果?
def getUnique(user_group):  ## assuming the rows for each user are sorted by timestamp
  res = list()
  for val in user_group["Col1"]:
    if val not in res:
      res.append(val)
  return res

Please direct me if there is a better alternative to dask.如果有比 dask 更好的选择,请指导我。

So, I think I would approach this with two passes.所以,我想我会通过两次通过来解决这个问题。 In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp.在第一遍中,我希望遍历所有 csv 文件并构建一个数据结构来保存user_idcol1的键以及“最佳”时间戳。 In this case, "best" will be the lowest.在这种情况下,“最佳”将是最低的。

Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.注意:这里使用字典只是为了阐明我们正在尝试做什么,如果性能或 memory 是一个问题,我会首先考虑在没有它们的情况下重新实现。

so, starting with CSV data like:因此,从 CSV 数据开始,例如:

[
    {"user_id": 1, "col1": "a", "timestamp": 1},
    {"user_id": 1, "col1": "a", "timestamp": 2},
    {"user_id": 1, "col1": "b", "timestamp": 4},
    {"user_id": 1, "col1": "c", "timestamp": 3},
]

After processing all the csv files I hope to have an interim representation of:在处理了所有 csv 个文件后,我希望有一个临时表示:

{
    1: {'a': 1, 'b': 4, 'c': 3}
}

Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.请注意,可以为每个 CSV 并行创建这样的表示,然后如果您想这样做,可以通过 pass 1.5 重新提炼成最终的临时表示。

Now we can create a final representation based on the keys of this nested structure sorted by the inner most value.现在我们可以根据这个嵌套结构的键(按最内层值排序)创建最终表示。 Giving us:给我们:

[
    {'user_id': 1, 'col1': ['a', 'c', 'b']}
]

Here is how I might first approach this task before tweaking things for performance.这是我在调整性能之前可能首先处理此任务的方式。

import csv

all_csv_files = [
    "some.csv",
    "bunch.csv",
    "of.csv",
    "files.csv",
]

data = {}
for csv_file in all_csv_files:
    #with open(csv_file, "r") as file_in:
    #    rows = csv.DictReader(file_in)

    ## ----------------------------
    ## demo data
    ## ----------------------------
    rows = [
        {"user_id": 1, "col1": "a", "timestamp": 1},
        {"user_id": 1, "col1": "a", "timestamp": 2},
        {"user_id": 1, "col1": "b", "timestamp": 4},
        {"user_id": 1, "col1": "c", "timestamp": 3},
    ]
    ## ----------------------------

    ## ----------------------------
    ## First pass to determine the "best" timestamp
    ## for a user_id/col1
    ## ----------------------------
    for row in rows:
        user_id = row['user_id']
        col1 = row['col1']
        ts_new = row['timestamp']
        ts_old = (
            data
                .setdefault(user_id, {})
                .setdefault(col1, ts_new)
        )

        if ts_new < ts_old:
            data[user_id][col1] = ts_new
    ## ----------------------------

print(data)

## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
    {
        "user_id": outer_key,
        "col1": [
            inner_kvp[0]
            for inner_kvp
            in sorted(outer_value.items(), key=lambda v: v[1])
        ]
    }
    for outer_key, outer_value
    in data.items() 
]
## ----------------------------

print(data_out)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM