简体   繁体   English

分析来自 Pandas Dataframe 的代币数据

[英]Analyzing Token Data from a Pandas Dataframe

I'm a relative python noob and also new to natural language processing (NLP).我是 python 菜鸟的亲戚,也是自然语言处理 (NLP) 的新手。

I have dataframe containing names and sales.我有 dataframe 包含姓名和销售额。 I want to: 1) break out all the tokens and 2) aggregate sales by each token.我想:1) 分解所有代币和 2) 每个代币的总销售额。

Here's an example of the dataframe:这是 dataframe 的示例:

name    sales
Mike Smith  5
Mike Jones  3
Mary Jane   4

Here's the desired output:这是所需的 output:

token   sales
mike    8
mary    4
Smith   5
Jones   3
Jane    4

Thoughts on what to do?想做什么? I'm using Python.我正在使用 Python。

Assumption : you have a function tokenize that takes in a string as input and returns a list of tokens假设:您有一个 function tokenize ,它接受一个字符串作为输入并返回一个标记列表

I'll use this function as a tokenizer for now:我现在将使用这个 function 作为分词器:

def tokenize(word):
    return word.casefold().split()

Solution解决方案

df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()

In [45]: df
Out[45]:
             name  sales
0      Mike Smith      5
1      Mike Jones      3
2       Mary Jane      4
3  Mary Anne Jane      1

In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
  tokens  sales
0   anne      1
1   jane      5
2  jones      3
3   mary      5
4   mike      8
5  smith      5

Explanation解释

  1. The assign step creates a column called tokens that applies the tokenize functio assign 步骤创建一个名为tokens的列,该列应用tokenize函数

Note : For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)注意:对于这个特定的tokenize function - 你可以使用df['name'].str.lower().str.split() - 但是这不会推广到自定义 tokenizers 因此.apply(tokenize)

this generates a df that looks like这会生成一个看起来像的 df

             name  sales              tokens
0      Mike Smith      5       [mike, smith]
1      Mike Jones      3       [mike, jones]
2       Mary Jane      4        [mary, jane]
3  Mary Anne Jane      1  [mary, anne, jane]
  1. use df.explode on this to get对此使用df.explode得到
             name  sales tokens
0      Mike Smith      5   mike
0      Mike Smith      5  smith
1      Mike Jones      3   mike
1      Mike Jones      3  jones
2       Mary Jane      4   mary
2       Mary Jane      4   jane
3  Mary Anne Jane      1   mary
3  Mary Anne Jane      1   anne
3  Mary Anne Jane      1   jane
  1. last step is just a groupy-agg step.最后一步只是 groupy-agg 步骤。

You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.您可以使用str.split()方法并将项目0保留为名字,将其用作 groupby 键并求和,然后对项目-1 (姓氏)执行相同操作并将两者连接起来。

import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
 'sales': {0: 5, 1: 3, 2: 4}})


df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
    df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()

df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space

在此处输入图像描述

tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
                       df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])

在此处输入图像描述

pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM