简体   繁体   English

如何从列中的字符串中提取与 python 列表中的另一个字符串匹配的 substring

[英]How to extract a substring from a string in a column, that matches another string in a list in python

I have a dataframe which is as follows:我有一个 dataframe 如下:

     col 1                                     col 2
0       59       538 Walton Avenue, Chester, FY6 7NP
1       62 42 Chesterton Road, Peterborough, FR7 2NY
2      179       3 Wallbridge Street, Essex, 4HG 3HT
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP

With a list similar to:列表类似于:

[Stevenage, Essex, Coventry, Chester]

Following the solution from here: How to check if Pandas rows contain any full string or substring of a list?按照此处的解决方案: 如何检查 Pandas 行是否包含列表的任何完整字符串或 substring? which went like this:是这样的:

city_list = list(cities["name"])
df["col3"] = np.where(df["col2"].str.contains('|'.join(city_list)), df["col2"], '')

I found that some in col 2 match the strings in a list but that the col3 is the same as col2.我发现 col 2 中的一些匹配列表中的字符串,但 col3 与 col2 相同。 I want col3 to be the values in the list rather the same as col3.我希望 col3 成为列表中的值,而不是与 col3 相同。 This would be:这将是:

     col 1                                     col 2     col3
0       59       538 Walton Avenue, Chester, FY6 7NP  Chester 
1       62 42 Chesterton Road, Peterborough, FR7 2NY 
2      179       3 Wallbridge Street, Essex, 4HG 3HT    Essex
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP Coventry

I have tried:我努力了:

pat = "|".join(cities.name)
df.insert(0, "name", df["col2"].str.extract('(' + pat + ')', expand = False))

But this returned an error saying 456 inputs when expecting 1.但这会返回一个错误,说在期望 1 时输入 456 个。

Also:还:

df["col2"] = df["col2"].apply(lambda x: difflib.get_close_matches(x, cities["name"])[0])
df.merge(cities)

But this came back with the error list index out of range.但这回来时错误列表索引超出范围。

Is there anyway to do this?有没有办法做到这一点? df1 is around 160,000 entries with each address in col2 from different countries so there is no standard way they are written, while the city list is around 170,000 entries df1 大约有 160,000 个条目,col2 中的每个地址来自不同国家,因此没有标准的书写方式,而城市列表大约有 170,000 个条目

Thank you谢谢

You could do as follows:您可以执行以下操作:

city_list = ["Stevenage", "Essex", "Coventry", "Chester"]

def get_match(row):
    col_2 = row["col 2"].replace(",", " ").split() # Here you should process the string as you want
    for c in city_list:
        if difflib.get_close_matches(col_2, c)
            return c
    return ""

df["col 3"] = df.apply(lambda row: get_match(row), axis=1)

Lean on an auxiliary function like this:像这样依靠辅助 function :

df = pd.DataFrame({'col 1': [59, 62, 179, 180],
                   'col 2': ['538 Walton Avenue, Chester, FY6 7NP',
                             '42 Chesterton Road, Peterborough, FR7 2NY',
                             '3 Wallbridge Street, Essex, 4HG 3HT',
                             '6 Stevenage Avenue, Coventry, 7PY 9NP'
                             ]})

def aux_func(x):

    # split by comma and select the interesting part ([1])
    x = x.split(',')
    x = x[1]

    aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']
    for v in aux_list:
        if v in x:
            return v
    return ""

df['col 3'] = [aux_func(name) for name in df['col 2']]

have a look at str.contains function that tests if a pattern match a series:看看str.contains function 测试模式是否匹配系列:

df = pd.DataFrame([[59, '538 Walton Avenue, Chester,', 'FY6 7NP'],
                   [62, '42 Chesterton Road, Peterborough', '4HG 3HT'],
                   [179, '3 Wallbridge Street, Essex', '4HG 3HT'],
                   [180, '6 Stevenage Avenue, Coventry', '7PY 9NP']])
city_list = ["Stevenage", "Essex", "Coventry", "Chester"]
for city in city_list:
    df.loc[df[1].str.contains(city), 'match'] = city

Thanks for your last replied, try this感谢您上次回复,试试这个

def aux_func(address):
    aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']

    # remove commas
    address = address.split(',')

    # avoide matches with the first part of the address
    if len(address)>1:
        # remove the first element of the address
        address = address[1:]

    for v in aux_list:
        for chunk in address:
            if v in chunk:
                return v

    return ""

df['col 3'] = [aux_func(address) for address in df['col 2']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM