簡體   English   中英

如何根據列值刪除行,其中某行的列值是另一行的子集?

[英]How to remove rows based on a column value where some row's column value are subset of another?

假設我有一個dataframe df如下: -

index company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 1     A .    www.abc.contact.com .       MA, USA
 2     A .    www.abc.about.com .         USA
 3     B .    www.pqr.com .               New Bedford, MA, USA
 4     B.     www.pqr.com/about .         MA, USA

我想從dataframe刪除所有行,其中address是另一個地址的子集,公司是相同的。 例如,我希望這5行中的這兩行。

index  company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 3     B .    www.pqr.com .               New Bedford, MA, USA

也許它不是最佳解決方案,但它可以在這個小型數據框架上工作:

EDIT添加了對公司名稱的檢查,假設我們刪除了標點符號

df = pd.DataFrame({"company": ['A', 'A', 'A', 'B', 'B'],
                   "address": ['16D Bayberry Rd, New Bedford, MA, 02740, USA',
                               'MA, USA',
                               'USA',
                               'New Bedford, MA, USA',
                               'MA, USA']})
# Splitting addresses by column and making sets from every address to use "issubset" later
addresses = list(df['address'].apply(lambda x: set(x.split(', '))).values)
companies = list(df['company'].values)

rows_to_drop = []  # Storing row indexes to drop here
# Iterating by every address
for i, (address, company) in enumerate(zip(addresses, companies)):
    # Iteraing by the remaining addresses
    rem_addr = addresses[:i] + addresses[(i + 1):]
    rem_comp = companies[:i] + companies[(i + 1):]

    for other_addr, other_comp in zip(rem_addr, rem_comp):
        # If address is a subset of another address, add it to drop
        if address.issubset(other_addr) and company == other_comp:
            rows_to_drop.append(i)
            break

df = df.drop(rows_to_drop)
print(df)

company address
0   A   16D Bayberry Rd, New Bedford, MA, 02740, USA
3   B   New Bedford, MA, USA

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM