简体   繁体   English

迭代数据框列以在字符串中查找匹配项的最快方法

[英]Fastest Way to Iterate Over Dataframe Column to Find Match in Strings

Here's a very truncated extract from a large dataframe:这是来自大型数据帧的非常截断的摘录:

name姓名 age年龄 city城市
ben 66 66 NY纽约
rob 45 45 LON伦敦
james詹姆士 22 22 LA洛杉矶

I also have a numerous strings that each contain different words but will contain one (not more) of the values in the name column.我还有许多字符串,每个字符串都包含不同的单词,但会包含 name 列中的一个(不是更多)值。

For example:例如:

  1. "rob was born in London" “罗伯出生在伦敦”
  2. "ben once lived in New York" “本曾经住在纽约”

For each string I want to iterate over the "name" column to find the name that matches the name in the string and return the age of the person.对于每个字符串,我想遍历“名称”列以查找与字符串中的名称匹配的名称并返回该人的年龄。

So in the first example the desired result is 45 and in the second example the desired result is 66.所以在第一个例子中,期望的结果是 45,而在第二个例子中,期望的结果是 66。

I am new to Pandas and am struggling.我是 Pandas 的新手,正在苦苦挣扎。 Can anyone point me in the right direction?谁能指出我正确的方向?

Hope this helps:希望这可以帮助:

List of all strings.所有字符串的列表。 This can be part of another dataframe.这可以是另一个数据框的一部分。 Just select the column where this values are & convert it to list.只需选择此值所在的列并将其转换为列表。

l = ['rob was born in London', "ben once lived in New York"]

The dataframe from your example您示例中的数据框

df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Final dataset where string & age exist.存在stringage的最终数据集。

age_dat = pd.DataFrame()

The first for-loop, loops over names from your original ( df ) dataset.第一个 for 循环遍历原始 ( df ) 数据集中的名称。 The second for-loop loops over list of sentences (list l ).第二个 for 循环遍历sentences列表(列表l )。 If any name is found in l , it gets appended in age_dat .如果在l中找到任何名称,则将其附加到age_dat中。

for x in list(df.name):
   
    for z in l:
        if x in z:
            dat = pd.DataFrame()
            dat['string']=[z]
            dat['age'] = [df[df['name']==x].age.tolist()[0]]
       
            age_dat = age_dat.append(dat)

print(age_dat)



                          string  age
0  ben once lived in New York   66
0      rob was born in London   45

Data数据

s = pd.Series(['rob was born in London', "ben once lived in New York"])
df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Solution解决方案

who = s.str.extract('(' + ')|('.join(df.name) + ')').bfill(axis=1)[0]
age_by_name = dict(zip(df.name, df.age))
pd.DataFrame({'text': s, 'age': who.map(age_by_name)})


                      text  age
0   rob was born in London  45
1   ben once lived in New York  66

Explanation解释

Use .str.extract to get the name in the string and then match it with the dataframe to get the age.使用.str.extract获取字符串中的名称,然后将其与数据框匹配以获取年龄。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 pandas dataframe 上迭代 function 的最快方法 - Fastest way to iterate function over pandas dataframe 迭代包含 Python 中字符串的大型列表的最快方法? - Fastest way to iterate over a large list containing strings in Python? Python,迭代正则表达式但在第一次匹配时停止的最快方法 - Python, fastest way to iterate over regular expressions but stop on first match 在 pandas dataframe 中迭代超过 7000 万行的最快方法 - Fastest way to iterate over 70 million rows in pandas dataframe 迭代Pandas DataFrame并插入行的最快方法 - Fastest way to iterate over Pandas DataFrame and insert a Row 在 pandas dataframe 列中查找单词的最快方法 - Fastest way to find a word in a pandas dataframe column 迭代 2 个数据帧以在多边形中查找点的最快方法 - Fastest way iterate over 2 dataframes to find point in polygon 查找大文本中许多字符串的第一个匹配索引的最快方法 - Fastest way to find first match index of lots of strings in large text 有没有一种方法可以遍历Pandas中的一列以从另一个数据框中找到匹配的索引值? - Is there a way to iterate over a column in Pandas to find matching index values from another dataframe? 查找以列表形式存在的列元素的数据框索引的最快方法 - Fastest way to find dataframe indexes of column elements that exist as lists
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM