使用正则表达式从熊猫数据帧python中删除在单列的所有行中找到的唯一单词

Question

我将从 csv 中提取的大约 11,000 个文本作为数据帧传递给 remove_unique 函数。 我正在查找唯一的单词并将其保存为函数中名为“唯一”的列表。 唯一词是从整个列中找到的所有唯一词中创建的。

使用正则表达式，我试图从熊猫数据帧的每一行（单列）中删除唯一词，但没有按预期删除唯一词，而是删除所有词并返回空的“文本”。

def remove_unique(text):
   //Gets all the unique words in the entire corpus
    unique = list(set(text.str.findall("\w+").sum()))
    pattern = re.compile(r'\b(?:{})\b'.format('|'.join(unique)))
    //Ideally should remove the unique words from the corpus.
    text = text.apply(lambda x: re.sub(pattern, '', x)) 
    return text

有人可以告诉指出是什么问题吗？

before
0    card layout broken window resiz unabl error ex...
1    chart lower rang border patch merg recheck...
2    left align text team close c...
3    descript sma...
4    list disappear navig make contain...
Name: description_plus, dtype: object
0
1                                                  ...
2
3
4                                                  ...
Name: description_plus, dtype: object

Answer 1

不确定我是否完全理解。 您是否想查看某个词是否在整个列中出现多次？

也许

import re
a_list = list(df["column"].values) #column to list
string = " ".join(a_list) # list of rows to string
words = re.findall("(\w+)", string) # split to  single list of words

print([item for item in words if words.count(item) > 1]) #list of words that appear multiple times

使用正则表达式从熊猫数据帧python中删除在单列的所有行中找到的唯一单词

问题描述

1 个解决方案

解决方案1
0 2020-03-12 16:07:38

使用正则表达式从熊猫数据帧python中删除在单列的所有行中找到的唯一单词

问题描述

1 个解决方案

解决方案1 0 2020-03-12 16:07:38

解决方案1
0 2020-03-12 16:07:38