[英]Analyzing Token Data from a Pandas Dataframe
I'm a relative python noob and also new to natural language processing (NLP).我是 python 菜鸟的亲戚,也是自然语言处理 (NLP) 的新手。
I have dataframe containing names and sales.我有 dataframe 包含姓名和销售额。 I want to: 1) break out all the tokens and 2) aggregate sales by each token.
我想:1) 分解所有代币和 2) 每个代币的总销售额。
Here's an example of the dataframe:这是 dataframe 的示例:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:这是所需的 output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do?想做什么? I'm using Python.
我正在使用 Python。
Assumption : you have a function tokenize
that takes in a string as input and returns a list of tokens假设:您有一个 function
tokenize
,它接受一个字符串作为输入并返回一个标记列表
I'll use this function as a tokenizer for now:我现在将使用这个 function 作为分词器:
def tokenize(word):
return word.casefold().split()
Solution解决方案
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation解释
tokens
that applies the tokenize
functio tokens
的列,该列应用tokenize
函数Note : For this particular tokenize
function - you can use df['name'].str.lower().str.split()
- however this won't generalize to custom tokenizers hence the .apply(tokenize)
注意:对于这个特定的
tokenize
function - 你可以使用df['name'].str.lower().str.split()
- 但是这不会推广到自定义 tokenizers 因此.apply(tokenize)
this generates a df that looks like这会生成一个看起来像的 df
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
df.explode
on this to getdf.explode
得到 name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
You can use the str.split()
method and keep item 0
for the first name, using that as the groupby key and take the sum, then do the same for item -1
(last name) and concatenate the two.您可以使用
str.split()
方法并将项目0
保留为名字,将其用作 groupby 键并求和,然后对项目-1
(姓氏)执行相同操作并将两者连接起来。
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.