简体   繁体   English

使用正则表达式从熊猫数据帧python中删除在单列的所有行中找到的唯一单词

[英]remove unique words found in all rows of a single column from panda dataframe python using regex

I am passing about 11,000 texts extracted from csv as a dataframe to the remove_unique function.我将从 csv 中提取的大约 11,000 个文本作为数据帧传递给 remove_unique 函数。 I am finding unique words and saving it as list named as "unique" in the function.我正在查找唯一的单词并将其保存为函数中名为“唯一”的列表。 The unique words is created out of all the unique words found in the entire column.唯一词是从整个列中找到的所有唯一词中创建的。

Using regex I'm trying to remove the unique words from each row(single column) of panda dataframe but the unique words do not get removed as expected, instead all the words are removed and empty "text" is returned.使用正则表达式,我试图从熊猫数据帧的每一行(单列)中删除唯一词,但没有按预期删除唯一词,而是删除所有词并返回空的“文本”。

def remove_unique(text):
   //Gets all the unique words in the entire corpus
    unique = list(set(text.str.findall("\w+").sum()))
    pattern = re.compile(r'\b(?:{})\b'.format('|'.join(unique)))
    //Ideally should remove the unique words from the corpus.
    text = text.apply(lambda x: re.sub(pattern, '', x)) 
    return text

Can somebody tell point out what is the issue?有人可以告诉指出是什么问题吗?

before
0    card layout broken window resiz unabl error ex...
1    chart lower rang border patch merg recheck...
2    left align text team close c...
3    descript sma...
4    list disappear navig make contain...
Name: description_plus, dtype: object
0
1                                                  ...
2
3
4                                                  ...
Name: description_plus, dtype: object

Not sure I totally understand.不确定我是否完全理解。 Are you trying to see if a word appears multiple times throughout the entire column?您是否想查看某个词是否在整个列中出现多次?

Maybe也许

import re
a_list = list(df["column"].values) #column to list
string = " ".join(a_list) # list of rows to string
words = re.findall("(\w+)", string) # split to  single list of words

print([item for item in words if words.count(item) > 1]) #list of words that appear multiple times

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM