将 dataframe 发送到 Redis 的最快方式

Question

I have a dataframe that contains 2 columns.我有一个包含 2 列的 dataframe。 For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set.对于每一行，我只想创建一个 Redis 集，其中 dataframe 的第一个值是键，第二个值是 Redis 集的值。 I've done research and I think I found the fastest way of doing this via iterables:我已经完成了研究，我想我找到了通过迭代最快的方法：

def send_to_redis(df, r):
    df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
    df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
    df = df[['bin_subscriber', 'bin_total_score']]
    with r.pipeline() as pipe:
        index = 0
        for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
            r.set(subscriber, total_score)
            if (index + 1) % 2000 == 0:
                pipe.execute()
            index += 1

With this, I can send about 400-500k sets to Redis per minute.有了这个，我可以每分钟向 Redis 发送大约 400-500k 套。 We may end up processing up to 300 million which at this rate would take half a day or so.我们最终可能会处理多达 3 亿个，按照这个速度，这需要半天左右的时间。 Doable but not ideal.可行但不理想。 Note that in the outer wrapper I am downloading.parquet files from s3 one at a time and pulling into Pandas via IO bytes.请注意，在外包装中，我一次从 s3 下载 .parquet 文件，并通过 IO 字节拉入 Pandas 。

def process_file(s3_resource, r, bucket, key):
    buffer = io.BytesIO()
    s3_object = s3_resource.Object(bucket, key)
    s3_object.download_fileobj(buffer)
    send_to_redis(
        pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)

def main():
    args = get_args()
    s3_resource = boto3.resource('s3')
    r = redis.Redis()
    file_prefix = get_prefix(args)
    s3_keys = [
        item.key for item in
        s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
        if item.key.endswith('.parquet')
    ]
    for key in s3_keys:
        process_file(s3_resource, r, args.bucket, key)

Is there a way to send this data to Redis without the use of iteration?有没有办法在不使用迭代的情况下将此数据发送到 Redis？ Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob?是否可以将整个数据块发送到 Redis 并让 Redis 为数据块的每个第一个和第二个值设置键和值？ I imagine that would be slightly faster.我想那会稍微快一点。

The original parquet that I am pulling into Pandas is created via Pyspark.我拉入 Pandas 的原始镶木地板是通过 Pyspark 创建的。 I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable.我尝试使用速度非常快的 Spark-Redis 插件，但我不确定如何在 Spark dataframe 本身中将我的数据转换为上述二进制文件，而且我不喜欢如何将列名添加为字符串到每一个值，它似乎不是可配置的。 Every redis object having that label seems very space inefficient.每个具有 label 的 redis object 似乎空间效率都很低。

Any suggestions would be greatly appreciated!任何建议将不胜感激！

Answer 1

Try Redis Mass Insertion and redis bulk import using --pipe :使用 --pipe 尝试Redis Mass Insertion和redis 批量导入：

Create a new text file input.txt containing the Redis command创建一个包含 Redis 命令的新文本文件input.txt

Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen

use redis-mass.py (see below) to insert to redis使用redis-mass.py （见下文）插入 redis

python redis-mass.py input.txt | redis-cli --pipe

redis-mass.py from github.来自github的 redis-mass.py。

#!/usr/bin/env python
"""
    redis-mass.py
    ~~~~~~~~~~~~~
    Prepares a newline-separated file of Redis commands for mass insertion.
    :copyright: (c) 2015 by Tim Simmons.
    :license: BSD, see LICENSE for more details.
"""
import sys

def proto(line):
    result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
    for arg in line[1:]:
        result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
    return result

if __name__ == "__main__":
    try:
        filename = sys.argv[1]
        f = open(filename, 'r')
    except IndexError:
        f = sys.stdin.readlines()

    for line in f:
        print(proto(line.rstrip().split(' ')),)

将 dataframe 发送到 Redis 的最快方式

问题描述

1 个解决方案

解决方案1
0 2021-02-05 07:07:03

将 dataframe 发送到 Redis 的最快方式

问题描述

1 个解决方案

解决方案1 0 2021-02-05 07:07:03

解决方案1
0 2021-02-05 07:07:03