简体   繁体   English

在两个 pandas 数据帧之间找到部分字符串匹配的最快方法

[英]Quickest way to find partial string match between two pandas dataframes

I have two location-based pandas DataFrames.我有两个基于位置的 pandas 数据帧。

df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". df1:其中有一列包含完整地址,例如“Avon Road, Ealing, London, UK”。 The address varies in format.地址格式不同。

df1.address[0] --> "Avon Road, Ealing, London, UK"

df2: Which just has cities of UK, such as "London". df2:只有英国的城市,例如“伦敦”。

df2.city[5] --> "London"

I want to locate the city of the first dataframe, given the full address.我想定位第一个dataframe所在的城市,给出完整地址。 This would go on my first dataframe as such.这将是我的第一个 dataframe 上的 go 。

df1.city[0] --> "London"

Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.方法1:对于df2中的每个城市,检查df1是否有这些城市,并将df1的索引和df2的城市存储在一个列表中。

I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:我不确定我将如何 go 这样做,但我假设我会使用此代码来确定是否存在部分字符串匹配并找到索引:

df1['address'].str.contains("London",na=False).index.values  

Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.方法 2:对于每个 df1 地址,检查是否有任何单词与 df2 中的城市匹配,并将 df2 的值存储在列表中。

I would assume this approach is more intuitive, but would it be computationally more expensive?我会假设这种方法更直观,但它的计算成本会更高吗? Assume df1 has millions of addresses.假设 df1 有数百万个地址。

Apologies if this is a stupid or easy problem: Any direction to the most efficient code would be helpful :)如果这是一个愚蠢或简单的问题,请道歉:任何指向最有效代码的方向都会有所帮助:)

Approach 2 is indeed a good start.方法 2 确实是一个好的开始。 However, using a Python dictionary rather than a list should be much faster.但是,使用 Python 字典而不是列表应该快得多。 Here is an example code:这是一个示例代码:

cityIndex = set(df2.city)

addressLocations = []
for address in df1.address:
    location = None
    # Warning: ignore characters like '-' in the cities
    for word in re.findall(r'[a-zA-Z0-9]+', address):
        if word in cityIndex:
            location = word
            break
    addressLocations.append(location)
df1['city'] = addressLocations

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM