如何有效地将函数应用于大型pandas数据帧的行？

Question

Im trying to create a training dataset for a model from a current data set. 我试图从当前数据集为模型创建训练数据集。 Its for blackjack and each row is how a player plays their hands. 它用于二十一点，每一行都是玩家如何玩牌。

The table might look something like this: 该表可能如下所示：

|Card1|Card2|Card3|Card4|Card5|PlayerTotal|DealerCard1|Win/Lose
|   7 | 10  |  0  |  0  |  0  |  17       |    10     |  0
|   4 | 3   |  10 |  0  |  0  |  17       |     8     |  1

Id like to turn it into rows with just the sum of the players hand, the dealers card and the win/lose. 我喜欢把它变成行，只有玩家手，经销商卡和输赢的总和。 However, if more than 2 cards have been played (so the player hit) then id like to make multiple rows for that sample with the game at each stage (so before the player hits each time) 但是，如果已经播放了超过2张牌（因此播放器命中），那么id喜欢在每个阶段使用游戏为该样本制作多行（所以在播放器每次击中之前）

So the example would become: 所以这个例子将成为：

|PlayerTotal|DealerCard1|Win/Lose
|    17     |     10    |  0
|    7      |     8     |  1
|    17     |     8     |  1

How can I do this efficiently? 我怎样才能有效地做到这一点？

I can do this fine with a small dataset using pd.apply and a custom function with if statements, but once I use the whole dataset (~1 mill points) its very slow and memory intensive. 我可以使用pd.apply的小数据集和if语句的自定义函数来做这件事，但是一旦我使用整个数据集（~1毫分），它的速度非常慢且占用大量内存。

Something like this: 像这样的东西：

def extractRounds(x):
    totals = []
    totals.append(x[0:2], x[5], x[6]])

    if x[2] > 0:
        totals.append([sum(x[0:3]), x[5], x[6]])
    else:
        return pd.Series(totals)

    if x[3] > 0:
        totals.append([sum(x[0:4]), x[5], x[6]])
    else:
        return pd.Series(totals)

    if x[4] > 0:
        totals.append([sum(x[0:5]), x[5], x[6]])

    return pd.Series(totals)


b = (a.apply(extractRounds, axis = 1)).stack()

Im guessing that it is the extractRounds(x) function that isn't the most effective or efficient. 我猜测它是extractRounds(x)函数不是最有效或最有效的。

So im wondering if I am barking up the wrong tree trying to do this by applying a function to each row or if there is a better way? 所以，我想知道我是否正在试图通过对每一行应用一个函数或者如果有更好的方法来实现这个错误的树？

Let me know if this isn't clear. 如果不清楚，请告诉我。 Cheers! 干杯!

Answer 1

You can use melt to convert your data into long format, add a cumulative sum, and then just exclude the zero card values for cards 3-5. 您可以使用“融合”将数据转换为长格式，添加累积总和，然后只排除卡3-5的零卡值。 And exclude card 1 since the player will always have a minimum of 2 cards. 并排除卡1，因为玩家将始终拥有至少2张牌。

Here's your example as a dataframe: 以下是您作为数据帧的示例：

import pandas as pd
import numpy as np

raw = pd.DataFrame({'Card1': [7, 4],
                    'Card2': [10, 3],
                    'Card3': [0, 10],
                    'Card4': [0, 0],
                    'Card5': [0, 0],
                    'DealerCard1': [10, 8],
                    'PlayerTotal': [17, 17],
                    'Win/Lose': [0, 1]})

raw.index.name = 'Game'

Use melt to create another dataframe in long format: 使用melt以长格式创建另一个数据帧：

df = (raw.reset_index()
     .melt(value_vars=['Card1', 'Card2', 'Card3', 'Card4', 'Card5'], 
           id_vars=['Game', 'DealerCard1', 'Win/Lose'],
           value_name='CardValue', 
           var_name='Card')
     .sort_values('Game')
     .reset_index(drop=True))

Recreate the PlayerTotal column as a cumulative sum: 将PlayerTotal列重新创建为累计总和：

df['PlayerTotal'] = df.groupby('Game')['CardValue'].apply(np.cumsum)

And then you can just exclude card 1 and the zero cards and select your desired columns: 然后你可以只排除卡1和零卡并选择你想要的列：

df.loc[(df['CardValue']!=0) & (df['Card']!='Card1'), ['PlayerTotal', 'DealerCard1', 'Win/Lose']]

That will give you: 那会给你：

PlayerTotal DealerCard1 Win/Lose
1   17  10  0
6   7   8   1
7   17  8   1

Answer 2

You can use command-line tools to add the extra lines to the csv file and do the summation. 您可以使用命令行工具将额外的行添加到csv文件并进行求和。

Let's say first few lines of CSV file data.csv is 我们先说几行CSV文件data.csv是

Card1,Card2,Card3,Card4,Card5,PlayerTotal,DealerCard1,Win/Lose
7,10,0,0,0,17,10,0
4,3,10,0,0,17,8,1

Running the following command gives us the desired output 运行以下命令可以获得所需的输出

sed 's/\(.*,\)\(.*,\)\([1-9][0-9]*,\)\(.*,.*,.*,.*,.*\)/\1\2\3\4\n\1\20,\4/' data.csv | cut -d ',' -f 1,2,3,7,8 | awk -F ',' 'NR>1 {print $1+$2+$3 "," $4 "," $5}' > data_2.csv

It creates a file named data_2.csv containing 它创建一个名为data_2.csv的文件

17,10,0
17,8,1
7,8,1

-------------------------------- --------------------------------

Explanation of the command: 命令说明：

sed 's/\(.*,\)\(.*,\)\([1-9][0-9]*,\)\(.*,.*,.*,.*,.*\)/\1\2\3\4\n\1\20,\4/' data.csv

reads the data.csv line by line, if a line has 0 value in third column, it adds another line where the third column is 0. data.csv读取data.csv ，如果第三列中的行具有0值，则在第三data.csv 0时添加另一行。

| cut -d ',' -f 1,2,3,7,8

reads the data from previous step and filter the data to columns 1,2,3,7,8 (these are the columns we care about) 从上一步读取数据并将数据过滤到第1,2,3,7,8列（这些是我们关心的列）

| awk -F ',' 'NR>1 {print $1+$2+$3 "," $4 "," $5}' > data_2.csv

reads the data from previous step, adds up the first three columns and writes it in a file called data_2.csv together with the last two columns. 从上一步读取数据，将前三列相加，并将其与最后两列一起写入名为data_2.csv的文件中。

如何有效地将函数应用于大型pandas数据帧的行？

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-12 16:01:30

解决方案2
0 2019-06-12 06:54:33

-------------------------------- --------------------------------

如何有效地将函数应用于大型pandas数据帧的行？

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-12 16:01:30

解决方案2 0 2019-06-12 06:54:33

-------------------------------- --------------------------------

解决方案1
1 已采纳 2019-06-12 16:01:30

解决方案2
0 2019-06-12 06:54:33