簡體   English   中英

如何從 Python 的列表中刪除常用詞?

[英]How to remove common words from list of lists in Python?

我有大量的單詞“組”。 如果一組中的任何單詞同時出現在 A 列和 B 列中,我想從兩列中刪除該組中的單詞。 如何循環遍歷所有組(即遍歷列表中的子列表)?

下面的有缺陷的代碼只刪除了最后一組中的常用詞,而不是全部三個組(列表)。 [如果組中的一個單詞在字符串中,我首先創建一個指示符,然后如果兩個字符串都有來自組的單詞,則創建另一個指示符。 僅對於 A 和 B 對,其中都有一個來自組的單詞,我刪除了特定的組詞。]

如何正確指定循環?

編輯:在我建議的代碼中,每個循環都使用原始列重新開始,而不是循環遍歷從前一組中刪除的單詞的列。

解決方案建議更加優雅和整潔,但如果它們是另一個單詞的一部分,則刪除這些單詞(例如,單詞 'foo' 正確地從 'foo hello' 中刪除,但也錯誤地從 'foobar' 中刪除。


# Input data:

data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
        'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])  

                            A                               B
0  summer time third grey abc  defg autumn times fourth table
1      yellow sky hello table     not red skies second garnet
2           fourth autumnwind         first blue chair winter
# Groups of words to be removed:

colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']

stuff = [colors, seasons, numbers]



# Code below only removes the last list in stuff (numbers):

def fA(S,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            y = 1
    return y


def fB(T,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', T):
            y = 1
    return y



def fARemove(S):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
    return S



def fBRemove(T):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', T):
            T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
    return T

for listed in stuff:

    df['A_Ind'] = 0
    df['B_Ind'] = 0

    df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
    df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)

    df['inboth'] = 0
    df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1

    df['A_new'] = df['A']
    df['B_new'] = df['B']

    df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
    df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)


    del df['inboth']
    del df['A_Ind']
    del df['B_Ind']
    
    df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
    df['A_new'] = df['A_new'].str.strip()
    df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
    df['B_new'] = df['B_new'].str.strip()

預期的 output 是:

         A_new              B_new
0     grey abc         defg table
1  hello table   no second garnet
2   autumnwind  blue chair winter

這需要 python 3.7+ 才能工作(否則需要更多代碼)。 根據您的關鍵字列表,我認為您正在嘗試優先考慮多字匹配。

dummy=0
def splitter(text):
    global dummy
    text=text.strip()
    if not text:
        return []
    for n,s in enumerate(stuff):
        for keyword in s:
            p=text.find(keyword)
            if p>=0:
                return splitter(text[:p])+[((dummy,keyword),n)]+splitter(text[p+len(keyword):])
    else:
        return [((dummy,text),-1)]

def remover(row):
    A=dict(splitter(row['A']))
    B=dict(splitter(row['B']))
    s=set(A.values()).intersection(set(B.values()))
    return [' '.join([k[1] for k,v in A.items() if v<0 or v not in s]),' '.join([k[1] for k,v in B.items() if v<0 or v not in s])]
pd.concat([df,pd.DataFrame(df.apply(remover, axis=1).to_list(), columns=['newA','newB'])],  axis=1)

import re

flatten_list = lambda l: [item for subl in l for item in subl]
def remove_recursive(s, l):
    while len(l) > 0:
        s = s.replace(l[0], '')
        l = l[1:]

    return re.sub(r'\ +', ' ', s).strip()


df['A_new'] = df.apply(lambda x: remove_recursive(x.A, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df['B_new'] = df.apply(lambda x: remove_recursive(x.B, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)

df.head()

#            A_new              B_new
# 0  time grey abc         defg table
# 1    hello table  not second garnet
# 2           wind         blue chair

這與注釋中的代碼類似,使用遞歸 lambda 來匹配單詞,並使用扁平列表來計算列表中在兩列中匹配的單詞。

下面是使用正則表達式 r'\b{}\b' 的原始問題中的代碼,更正了循環最新字符串而不是原始字符串的問題。

# Groups of words to be removed:

colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']

stuff = [colors, seasons, numbers]


df['A_new'] = df['A']
df['B_new'] = df['B']


def f_indicator(S,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            y = 1
    return y


def fRemove(S):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
    return S


for listed in stuff:

    df['A_Ind'] = 0
    df['B_Ind'] = 0

    df['A_Ind'] = df.apply(lambda x: f_indicator(x.A_new, x.A_Ind), axis=1)
    df['B_Ind'] = df.apply(lambda x: f_indicator(x.B_new, x.B_Ind), axis=1)

    df['inboth'] = 0
    df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1



    df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fRemove(x.A_new), axis=1)
    df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fRemove(x.B_new), axis=1)


    del df['inboth']
    del df['A_Ind']
    del df['B_Ind']

    
    df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
    df['A_new'] = df['A_new'].str.strip()
    df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
    df['B_new'] = df['B_new'].str.strip()

del df['A']
del df['B']
print(df)

Output:

         A_new              B_new
0     grey abc         defg table
1  hello table  not second garnet
2   autumnwind  blue chair winter

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM