[英]How to remove common words from list of lists in Python?
我有大量的單詞“組”。 如果一組中的任何單詞同時出現在 A 列和 B 列中,我想從兩列中刪除該組中的單詞。 如何循環遍歷所有組(即遍歷列表中的子列表)?
下面的有缺陷的代碼只刪除了最后一組中的常用詞,而不是全部三個組(列表)。 [如果組中的一個單詞在字符串中,我首先創建一個指示符,然后如果兩個字符串都有來自組的單詞,則創建另一個指示符。 僅對於 A 和 B 對,其中都有一個來自組的單詞,我刪除了特定的組詞。]
如何正確指定循環?
編輯:在我建議的代碼中,每個循環都使用原始列重新開始,而不是循環遍歷從前一組中刪除的單詞的列。
解決方案建議更加優雅和整潔,但如果它們是另一個單詞的一部分,則刪除這些單詞(例如,單詞 'foo' 正確地從 'foo hello' 中刪除,但也錯誤地從 'foobar' 中刪除。
# Input data:
data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])
A B
0 summer time third grey abc defg autumn times fourth table
1 yellow sky hello table not red skies second garnet
2 fourth autumnwind first blue chair winter
# Groups of words to be removed:
colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']
stuff = [colors, seasons, numbers]
# Code below only removes the last list in stuff (numbers):
def fA(S,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
y = 1
return y
def fB(T,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
y = 1
return y
def fARemove(S):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
return S
def fBRemove(T):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
return T
for listed in stuff:
df['A_Ind'] = 0
df['B_Ind'] = 0
df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)
df['inboth'] = 0
df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1
df['A_new'] = df['A']
df['B_new'] = df['B']
df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)
del df['inboth']
del df['A_Ind']
del df['B_Ind']
df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
df['A_new'] = df['A_new'].str.strip()
df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
df['B_new'] = df['B_new'].str.strip()
預期的 output 是:
A_new B_new
0 grey abc defg table
1 hello table no second garnet
2 autumnwind blue chair winter
這需要 python 3.7+ 才能工作(否則需要更多代碼)。 根據您的關鍵字列表,我認為您正在嘗試優先考慮多字匹配。
dummy=0
def splitter(text):
global dummy
text=text.strip()
if not text:
return []
for n,s in enumerate(stuff):
for keyword in s:
p=text.find(keyword)
if p>=0:
return splitter(text[:p])+[((dummy,keyword),n)]+splitter(text[p+len(keyword):])
else:
return [((dummy,text),-1)]
def remover(row):
A=dict(splitter(row['A']))
B=dict(splitter(row['B']))
s=set(A.values()).intersection(set(B.values()))
return [' '.join([k[1] for k,v in A.items() if v<0 or v not in s]),' '.join([k[1] for k,v in B.items() if v<0 or v not in s])]
pd.concat([df,pd.DataFrame(df.apply(remover, axis=1).to_list(), columns=['newA','newB'])], axis=1)
import re
flatten_list = lambda l: [item for subl in l for item in subl]
def remove_recursive(s, l):
while len(l) > 0:
s = s.replace(l[0], '')
l = l[1:]
return re.sub(r'\ +', ' ', s).strip()
df['A_new'] = df.apply(lambda x: remove_recursive(x.A, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df['B_new'] = df.apply(lambda x: remove_recursive(x.B, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df.head()
# A_new B_new
# 0 time grey abc defg table
# 1 hello table not second garnet
# 2 wind blue chair
這與注釋中的代碼類似,使用遞歸 lambda 來匹配單詞,並使用扁平列表來計算列表中在兩列中匹配的單詞。
下面是使用正則表達式 r'\b{}\b' 的原始問題中的代碼,更正了循環最新字符串而不是原始字符串的問題。
# Groups of words to be removed:
colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']
stuff = [colors, seasons, numbers]
df['A_new'] = df['A']
df['B_new'] = df['B']
def f_indicator(S,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
y = 1
return y
def fRemove(S):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
return S
for listed in stuff:
df['A_Ind'] = 0
df['B_Ind'] = 0
df['A_Ind'] = df.apply(lambda x: f_indicator(x.A_new, x.A_Ind), axis=1)
df['B_Ind'] = df.apply(lambda x: f_indicator(x.B_new, x.B_Ind), axis=1)
df['inboth'] = 0
df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1
df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fRemove(x.A_new), axis=1)
df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fRemove(x.B_new), axis=1)
del df['inboth']
del df['A_Ind']
del df['B_Ind']
df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
df['A_new'] = df['A_new'].str.strip()
df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
df['B_new'] = df['B_new'].str.strip()
del df['A']
del df['B']
print(df)
Output:
A_new B_new
0 grey abc defg table
1 hello table not second garnet
2 autumnwind blue chair winter
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.