迭代数据框列以在字符串中查找匹配项的最快方法

Question

Here's a very truncated extract from a large dataframe:这是来自大型数据帧的非常截断的摘录：

name姓名	age年龄	city城市
ben本	66 66	NY纽约
rob抢	45 45	LON伦敦
james詹姆士	22 22	LA洛杉矶

I also have a numerous strings that each contain different words but will contain one (not more) of the values in the name column.我还有许多字符串，每个字符串都包含不同的单词，但会包含 name 列中的一个（不是更多）值。

For example:例如：

"rob was born in London" “罗伯出生在伦敦”
"ben once lived in New York" “本曾经住在纽约”

For each string I want to iterate over the "name" column to find the name that matches the name in the string and return the age of the person.对于每个字符串，我想遍历“名称”列以查找与字符串中的名称匹配的名称并返回该人的年龄。

So in the first example the desired result is 45 and in the second example the desired result is 66.所以在第一个例子中，期望的结果是 45，而在第二个例子中，期望的结果是 66。

I am new to Pandas and am struggling.我是 Pandas 的新手，正在苦苦挣扎。 Can anyone point me in the right direction?谁能指出我正确的方向？

Answer 1

Hope this helps:希望这可以帮助：

List of all strings.所有字符串的列表。 This can be part of another dataframe.这可以是另一个数据框的一部分。 Just select the column where this values are & convert it to list.只需选择此值所在的列并将其转换为列表。

l = ['rob was born in London', "ben once lived in New York"]

The dataframe from your example您示例中的数据框

df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Final dataset where string & age exist.存在string和age的最终数据集。

age_dat = pd.DataFrame()

The first for-loop, loops over names from your original ( df ) dataset.第一个 for 循环遍历原始 ( df ) 数据集中的名称。 The second for-loop loops over list of sentences (list l ).第二个 for 循环遍历sentences列表（列表l ）。 If any name is found in l , it gets appended in age_dat .如果在l中找到任何名称，则将其附加到age_dat中。

for x in list(df.name):
   
    for z in l:
        if x in z:
            dat = pd.DataFrame()
            dat['string']=[z]
            dat['age'] = [df[df['name']==x].age.tolist()[0]]
       
            age_dat = age_dat.append(dat)

print(age_dat)



                          string  age
0  ben once lived in New York   66
0      rob was born in London   45

Answer 2

Data数据

s = pd.Series(['rob was born in London', "ben once lived in New York"])
df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Solution解决方案

who = s.str.extract('(' + ')|('.join(df.name) + ')').bfill(axis=1)[0]
age_by_name = dict(zip(df.name, df.age))
pd.DataFrame({'text': s, 'age': who.map(age_by_name)})


                      text  age
0   rob was born in London  45
1   ben once lived in New York  66

Explanation解释

Use .str.extract to get the name in the string and then match it with the dataframe to get the age.使用.str.extract获取字符串中的名称，然后将其与数据框匹配以获取年龄。

迭代数据框列以在字符串中查找匹配项的最快方法

问题描述

2 个解决方案

解决方案1
1 2022-06-22 16:17:16

解决方案2
1 已采纳 2022-06-22 16:36:07

迭代数据框列以在字符串中查找匹配项的最快方法

问题描述

2 个解决方案

解决方案1 1 2022-06-22 16:17:16

解决方案2 1 已采纳 2022-06-22 16:36:07

解决方案1
1 2022-06-22 16:17:16

解决方案2
1 已采纳 2022-06-22 16:36:07