简体   繁体   English

Python 计数 pandas 在 pandas 中 substring 的出现次数

[英]Python Count occurrences of a substring in pandas by row appending distinct string as column

Initial note: I cannot use many third party packages and it's highly likely I won't be able to use any you suggest.初始说明:我不能使用许多第三方软件包,很可能我将无法使用您建议的任何软件包。 Try to keep a solution to Pandas, NumPy, or Python 3.7 built in libraries.尝试保留 Pandas、NumPy 或 Python 3.7 内置库的解决方案。 My end goal is a word-bubble like graph, where word frequency is coded by categoricalInt我的最终目标是像图一样的单词气泡,其中单词频率由categoricalInt编码

Say I have a pandas data frame like:假设我有一个 pandas 数据框,例如:

index | categoricalInt1 | categoricalInt2 | sanitizedStrings 
0     |    -4           |    -5           |   some lowercase strings
1     |     2           |     4           |   addtnl lowercase strings here
2     |     3           |     3           |   words

Is there any easier way than iterating over every single value in sanitizedStrings to return a structure like有没有比遍历sanitizedStrings中的每个值更简单的方法来返回类似的结构

index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0     |     -4          |    -5           |      ...         |  1   |    1      |   1     |   0    |  0  | 0
1     |      2          |     4           |      ...         |  0   |    1      |   1     |   1    |  1  | 0
2     |      3          |     3           |      ...         |  0   |    0      |   0     |   0    |  0  | 1

My overall goal is simple: Count all substrings by categorical grouping.我的总体目标很简单:按分类分组计算所有子字符串。 I've managed to get the strings aggregated together and condensed down into the categorical bins, but I'm struggling to get the counts together.我已经设法将字符串聚合在一起并浓缩到分类箱中,但我正在努力将计数放在一起。

So far my code looks like:到目前为止,我的代码如下所示:

df['Comments'] = df['Comments'].str.lower()

punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')

pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings

# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)

print(groupedComments)

Previous to realizing I needed to bin these strings by categoricalInt , I was using the following function: groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()在意识到我需要通过 categoricalInt 对这些字符串进行categoricalInt之前,我使用的是以下 function: groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()

If I could get that to return by row instead of stacking across column, I bet we'd be pretty close.如果我可以让它按行返回而不是跨列堆叠,我敢打赌我们会非常接近。

This isn't a particularly elegant solution and I am not sure how much data you are working with, but you could use an apply function to add the additional columns.这不是一个特别优雅的解决方案,我不确定您正在处理多少数据,但您可以使用 apply function 添加其他列。

After reading your comment, it seems you are also looking at grouping by your categorical columns.阅读您的评论后,您似乎也在考虑按分类列进行分组。

This can be achieved by also tweaking the column you create.这也可以通过调整您创建的列来实现。

import pandas as pd
import numpy as np

##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])

#create apply function
def add_cols(row):
    col_dict = {}
    new_cols = row['sanitizedStrings'].split(' ')
    for col in new_cols:
        if col not in col_dict.keys():
            col_dict[col]=1
        else:
            col_dict[col]+=1
    for key,value in col_dict.items():
        #add _ so we can query these columns later
        row['_'+key] = value
    return row

#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df

   _addtnl  _here  _lowercase  _some  _string  _strings  _words  \
0      0.0    0.0         1.0    1.0      1.0       0.0     0.0   
1      1.0    1.0         1.0    0.0      0.0       1.0     0.0   
2      0.0    0.0         0.0    0.0      0.0       0.0     1.0   

   categoricalInt1               sanitizedStrings  
0               -4          some lowercase string  
1                3  addtnl lowercase strings here  
2                2                          words 

#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group


                 addtnl  here  lowercase  some  string  strings  words
categoricalInt1                                                       
-4                  0.0   0.0        1.0   1.0     1.0      0.0    0.0
 2                  0.0   0.0        0.0   0.0     0.0      0.0    1.0
 3                  1.0   1.0        1.0   0.0     0.0      1.0    0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM