如何使用pandas有效地执行行式操作？

Question

I want to get some basic statistics from some csv files without loading the whole file in memory . 我想从一些csv文件中获取一些基本的统计信息， 而不将整个文件加载到内存中 。 I do it in two ways, one seemingly "smart" way using pandas and another casual way using csv I expect the pandas way to be faster but the csv way is actually faster by a very large margin. 我用两种方式做到这一点，一种看似使用熊猫的“聪明”方式和另一种使用csv的休闲方式我希望熊猫的方式更快，但csv方式实际上更快。 I was wondering why. 我想知道为什么。

Here is my code: 这是我的代码：

import pandas as pd
import csv

movies  = pd.read_csv('movies.csv')  # movieId,title,genres
movie_count  = movies.shape[0]       # 9742
movieId_min = ratings.movieId.min()
movieId_max = ratings.movieId.max()
movieId_disperse = movies.movieId.sort_values().to_dict()
movieId_squeeze = {v: k for k, v in movieId_disperse.items()}

def get_ratings_stats():
    gp_by_user  = []
    gp_by_movie = [0] * movie_count
    top_rator = (0, 0) # (idx, value)
    top_rated = (0, 0) # (idx, value)
    rating_count = 0
    user_count = 0
    last_user = -1
    for row in csv.DictReader(open('ratings.csv')):
        user = int(row['userId'])-1
        movie = movieId_squeeze[int(row['movieId'])]
        if last_user != user:
            last_user = user
            user_count += 1
            gp_by_user += [0]
        rating_count += 1
        gp_by_user[user]   += 1
        gp_by_movie[movie] += 1
        top_rator = (user,  gp_by_user[user])   if gp_by_user[user]   > top_rator[1] else top_rator
        top_rated = (movie, gp_by_movie[movie]) if gp_by_movie[movie] > top_rated[1] else top_rated
    top_rator = (top_rator[0]+1, top_rator[1])
    top_rated = (movieId_disperse[top_rated[0]], top_rated[1])
    return rating_count, top_rator, top_rated

Now if I replace the line: 现在，如果我更换线：

for row in csv.DictReader(open('ratings.csv')):

With: 附：

for chunk in pd.read_csv('ratings.csv', chunksize=1000):
    for _,row in chunk.iterrows():

The code actually becomes 10 times slower. 代码实际上变慢了10倍。

Here are the timing results: 以下是时间结果：

> %timeit get_ratings_stats() # with csv
325 ms ± 9.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit get_ratings_stats() # with pandas
3.45 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Any comments as to how I can make this code better/faster/more readable would be much appreciated 关于如何使这些代码更好/更快/更易读的任何评论将非常感激

Answer 1

I think the point is that you shouldn't use pandas if you're going to then treat the big, expensive data structure like a dict. 我认为重点是你不应该使用大熊猫，如果你打算像dict一样对待庞大，昂贵的数据结构。 The question shouldn't be how to get pandas to be better at that, it should be how to write your code with pandas to do what you want. 问题不应该是如何让熊猫变得更好，它应该是如何用熊猫编写代码来做你想做的事情。

import pandas as pd

def get_ratings_stats():
    movie_rating_data = pd.read_csv('ratings.csv')
    # Get the movie with the best rating
    top_movie = movie_rating_data.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
    # Get the user with the best rating
    top_user = movie_rating_data.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
    return movie_rating_data.shape[0], top_movie, top_user

def get_ratings_stats_slowly():
    movies = pd.DataFrame(columns = ["movieId", "ratings"])
    users = pd.DataFrame(users = ["userId", "ratings"])
    data_size = 0
    for chunk in pd.read_csv('ratings.csv', chunksize=1000):
        movies = movies.append(chunk.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max'))
        users = users.append(chunk.loc[:, ['userId', 'rating']].groupby('userId').agg('max'))
        data_size += chunk.shape[0]
    top_movie = movies.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
    top_user = users.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
    return data_size, top_movie, top_user

I'm not really sure that this is what you want to do overall, but your code is incomprehensible - this should be a good place to start (you could replace .agg('max') with .count() if you're interested in the number of ratings, etc). 我不确定这是你想要做的整体，但是你的代码是不可理解的 - 这应该是一个好的开始（你可以用.agg('max')取代.agg('max') .count()如果你是感兴趣的评级数量等）。

Answer 2

I think parallel processing is the answer for your question. 我认为并行处理是您问题的答案。 I've tried doing some parallel processing on your problem but I had to split the ratings file into multiple files for processing. 我已尝试对您的问题进行一些并行处理，但我不得不将评级文件拆分为多个文件进行处理。

What I did initially was to duplicate the ratings data from the CSV files by a factor of 10, and then I executed your script to have an initial execution time, which for me was about 3.6 seconds . 我最初做的是将CSV文件中的评级数据复制10倍，然后我执行你的脚本以获得初始执行时间，这对我来说大概是3.6 seconds 。 Now, by splitting the files into multiple ones, that can be addressed by multiple child processes, and for example by using my script with -k 2 (basically 2 workers), the total execution time reduced to 1.87 seconds . 现在，通过将文件拆分为多个文件，可以通过多个子进程进行处理，例如使用我的脚本-k 2 （基本上是2个工作程序），总执行时间减少到1.87 seconds 。 If I use -k 4 (4 workers) the execution time will be 1.13 seconds . 如果我使用-k 4 （4个工人），执行时间将是1.13 seconds 。

I am not sure if it is possible to read the CSV in chunks and basically do a parallel seek reading from the CSV, from a single big file, but that would make it a lot faster, the only drawback being the need to do an initial count of the rows in the big CSV file, to know how many rows will go per worker. 我不确定是否可以以块的形式读取CSV并且基本上从单个大文件中执行CSV的并行搜索读取，但这会使它更快，唯一的缺点是需要进行初始化大CSV文件中的行数，以了解每个worker的行数。

The splitting script: 分裂脚本：

import csv

file_path = "data/ratings.csv"
out_path = "data/big_ratings_{}.csv"

out_csv = None

for i in range(10):
    print("Iteration #{}".format(i+1))
    pin = open(file_path, "r")
    pout = open(out_path.format(i), "w")
    in_csv = csv.DictReader(pin)
    out_csv = csv.DictWriter(pout, fieldnames=in_csv.fieldnames)
    out_csv.writeheader()

    for row in in_csv:
        out_csv.writerow(row)

    pin.close()
    pout.close()

The actual rating processing script 实际评级处理脚本

import time
import csv
import argparse
import os
import sys

from multiprocessing import Process, Queue, Value

import pandas as pd


top_rator_queue = Queue()
top_rated_queue = Queue()

DEFAULT_NO_OF_WORKERS = 1
RATINGS_FILE_PATH = "data/big_ratings_{}.csv"

NUMBER_OF_FILES = 10


class ProcessRatings(Process):

    def __init__(self, file_index_range, top_rator_queue, top_rated_queue, movie_id_squeeze):
        super(ProcessRatings, self).__init__()

        self.file_index_range = file_index_range
        self.top_rator_queue = top_rator_queue
        self.top_rated_queue = top_rated_queue
        self.movie_id_squeeze = movie_id_squeeze

    def run(self):

        for file_index in self.file_index_range:
            print("[PID: {}] Processing file index {} .".format(os.getpid(), file_index))

            start = time.time()

            gp_by_user  = []
            gp_by_movie = [0] * movie_count
            top_rator = (0, 0) # (idx, value)
            top_rated = (0, 0) # (idx, value)
            rating_count = 0
            user_count = 0
            last_user = -1

            for row in csv.DictReader(open(RATINGS_FILE_PATH.format(file_index))):
                user = int(row['userId'])-1
                movie = self.movie_id_squeeze[int(row['movieId'])]

                if last_user != user:
                    last_user = user
                    user_count += 1
                    gp_by_user += [0]

                gp_by_user[user]   += 1
                gp_by_movie[movie] += 1

                top_rator = (user,  gp_by_user[user])   if gp_by_user[user]   > top_rator[1] else top_rator
                top_rated = (movie, gp_by_movie[movie]) if gp_by_movie[movie] > top_rated[1] else top_rated

            end = time.time()
            print("[PID: {}] Processing time for file index {} : {}s!".format(os.getpid(), file_index, end-start))

        print("[PID: {}] WORKER DONE!".format(os.getpid()))


if __name__ == "__main__":
    print("Processing ratings in multiple worker processes.")

    start = time.time()

    # script arguments handling
    parser = argparse.ArgumentParser()
    parser.add_argument("-k", dest="workers", action="store")
    args_space = parser.parse_args()

    # determine the number of workers
    number_of_workers = DEFAULT_NO_OF_WORKERS
    if args_space.workers:
        number_of_workers = int(args_space.workers)
    else:
        print("Number of workers not specified. Assuming: {}".format(number_of_workers))

    # rating data
    rating_count = 0
    movies  = pd.read_csv('data/movies.csv')  # movieId,title,genres
    movie_count  = movies.shape[0]       # 9742
    movieId_min = movies.movieId.min()
    movieId_max = movies.movieId.max()
    movieId_disperse = movies.movieId.sort_values().to_dict()
    movieId_squeeze = {v: k for k, v in movieId_disperse.items()}

    # process data
    processes = []

    # initialize the worker processes
    number_of_files_per_worker = NUMBER_OF_FILES // number_of_workers
    for i in range(number_of_workers):
        p = ProcessRatings(
            range(i, i+number_of_files_per_worker),  # file index
            top_rator_queue,
            top_rated_queue,
            movieId_squeeze
        )
        p.start()
        processes.append(p)

    print("MAIN: Wait for processes to finish ...")
    # wait until all processes are done
    while True:
        # determine if the processes are still running
        if not any(p.is_alive() for p in processes):
            break

    # gather the data and do a final processing

    end = time.time()

    print("Processing time: {}s".format(end - start))

    print("Rating count: {}".format(rating_count))

如何使用pandas有效地执行行式操作？

问题描述

2 个解决方案

解决方案1
2 2018-12-22 16:07:13

解决方案2
1 2018-12-22 16:42:35

如何使用pandas有效地执行行式操作？

问题描述

2 个解决方案

解决方案1 2 2018-12-22 16:07:13

解决方案2 1 2018-12-22 16:42:35

解决方案1
2 2018-12-22 16:07:13

解决方案2
1 2018-12-22 16:42:35