简体   繁体   English

根据列值创建列表并使用该列表从 df 中的字符串列中提取单词,而不用 for 循环覆盖行值

[英]Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop

Ok I admit it, I'm stuck.好吧,我承认,我被卡住了。 Hope someone can help me figure this out.希望有人能帮我解决这个问题。 I'll try to explain to the best of my abilities.我会尽力解释。 I have two df's.我有两个df。 One of them has the string column and municipalities and the other df has municipalities and streets.其中一个有字符串列和自治市,另一个 df 有自治市和街道。 I want to create a street list per row (for that specific municipality) so it only extract streets in the string column for that specific municipality, The code I have now kinda works but it keeps iterating over all of the municipalities.我想为每行创建一个街道列表(针对该特定自治市),因此它只提取该特定自治市的字符串列中的街道,我现在拥有的代码有点工作,但它不断迭代所有自治市。 therefore extracting streets in other municipalities and adding streets to the wrong rows.因此提取其他城市的街道并将街道添加到错误的行中。 I hope the code examples below make my question a little more clear.我希望下面的代码示例能让我的问题更清楚一点。

Create dataframes:创建数据框:

import pandas as pd
import re

# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
        'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne", 
                 "I'm not going to the Balijelaan","I'm not going to Socrateshof today",
                 "Next week I'll be going to Socrateshof"]}

df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)

Output: Output:

  municipality                                               text
0          Urk  I'm going to Plantage, Pollux and Oostvaarders...
1      Utrecht              Tomorrow I'm going to Hoog Catharijne
2       Almere                    I'm not going to the Balijelaan
3      Utrecht                 I'm not going to Socrateshof today
4       Huizen             Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street 
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
        'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)

Output: Output:

  municipality       street_name
0          Urk          Plantage
1          Urk            Pollux
2      Utrecht        Balijelaan
3       Almere  Oostvaardersdiep
4       Almere        Catharijne
5       Huizen       Socrateshof

Run the function below:运行下面的 function:

# Function
street = []
def extract_street(txt):
    mun_list_filter = df['municipality'] # I want the streets for this municipality
    df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
    street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
    st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
    street.append(st) # Append to empty street list
    return street # As you can see it keeps iterating over all municipalities 

# Call function by iterating over rows in string column
for txt in df['text']:
    extract_street(txt)

# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df

Output: Output:

    municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux, Oostvaardersdiep
1   Utrecht      Tomorrow I'm going to Hoog Catharijne               Catharijne
2   Almere       I'm not going to the Balijelaan                     Balijelaan
3   Utrecht      I'm not going to Socrateshof today                  Socrateshof
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

As you can see in the first row for municipality 'Urk' the function added the street 'Oostvaardersdiep' even though this should've only been matched if the municipality for the first row is 'Almere'.正如您在市政“Urk”的第一行中看到的那样,function 添加了街道“Oostvaardersdiep”,即使只有在第一行的市政当局是“Almere”时才应该匹配。 Only the last row is correct since 'Socrateshof' is in fact in the municipality 'Huizen'.只有最后一行是正确的,因为“Socrateshof”实际上位于“Huizen”市。

Desired result:期望的结果:

    municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1   Utrecht      Tomorrow I'm going to Hoog Catharijne              
2   Almere       I'm not going to the Balijelaan                    
3   Utrecht      I'm not going to Socrateshof today                 
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

I know what the problem is I just don't know how to fix it.我知道问题是什么我只是不知道如何解决它。 I've tried with apply/lambda but no luck either.我试过 apply/lambda 但也没有运气。 Thanks!谢谢!

Adding another answer to show a shorter/simpler way to do what you wanted.添加另一个答案以显示更短/更简单的方式来做你想做的事。 (The first one was just to fix what was not working in your code.) 第一个只是修复代码中不起作用的问题。)

Using .apply() , you can call a modified verison of your function per row of df and then do the checking with the street names in df2 .使用.apply() ,您可以为df每行调用 function 的修改版本,然后使用df2中的街道名称进行检查。

def extract_street(row):
    street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
    streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
    streets_found = set(re.findall(streets_regex, row['text']))
    return ', '.join(streets_found)
    ## or if you want this to return a list of streets
    # return list(streets_found)

df['street_match'] = df.apply(extract_street, axis=1)
df

Output: Output:

  municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

Note:笔记:

  1. There's an issue with your regex - the join part of the expression generates strings like Plantage\b|Pollux .您的正则表达式存在问题 - 表达式的join部分会生成像Plantage\b|Pollux这样的字符串。 Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't.如果(a)最后一条街道名称在另一个单词的开头或(b)如果最后一条街道名称在另一个单词的末尾,这将给出匹配:“我要去 NotPlantage , Polluxsss 和 Oostvaardersdiep”将匹配两条街道,但不应该。 Instead, the word boundary \b should be at ends of the list of options and with parentheses to separate them.相反,单词边界\b应该位于选项列表的末尾,并用括号将它们分开。 It should generate strings like: \b(Plantage|Pollux)\b .它应该生成如下字符串: \b(Plantage|Pollux)\b This won't match with "Polluxsss" or "NotPlantage".这与“Polluxsss”或“NotPlantage”不匹配。 I've made that change in the code above.我已经在上面的代码中进行了更改。

  2. I'm using set to get a unique list of street matches.我正在使用set来获取唯一的街头比赛列表。 If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.如果该行是“我要去 Pollux,Pollux,Pollux”,它将给出 3 次而不是一次的结果。

One problem with passing in only the text is that you can't do the municipality filter.仅传递text的一个问题是您无法执行市政过滤器。 Which is why you're getting the street 'Oostvaardersdiep' for 'Urk', even though it's in 'Almere'.这就是为什么你会为“Urk”获得街道“Oostvaardersdiep”,即使它在“Almere”。 You get it because the name 'Oostvaardersdiep' appears in the text for the 'Urk' entry.您得到它是因为名称“Oostvaardersdiep”出现在“Urk”条目的文本中。 Your extract_streets() function doesn't know which municipality to be matching with.您的extract_streets() function 不知道要匹配哪个自治市。

The smallest change to get your code to work is this:使您的代码工作的最小更改是:

  1. Pass in mun along with txt to extract_street()muntxt一起extract_street()
  2. mun_list_filter should use the mun instead of all the municipalities mun_list_filter应该使用mun而不是所有的城市
street = []
def extract_street(txt, mun):  # Pass in municipality
    df_bag_filter_mun = df2[df2['municipality'] == mun]
    ### everything below is COPY-PASTED from your question
    street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
    st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
    street.append(st) # Append to empty street list
    return street # As you can see it keeps iterating over all municipalities 

# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):  
    extract_street(txt, mun)

# Add street list to df
df = df.assign(**{'street_match': street})

Output: Output:

  municipality                                                text        street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  [Plantage, Pollux]
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  []
2       Almere                     I'm not going to the Balijelaan                  []
3      Utrecht                  I'm not going to Socrateshof today                  []
4       Huizen              Next week I'll be going to Socrateshof       [Socrateshof]

And then join the list to make it a string:然后加入列表以使其成为字符串:

df['street_match'] = df['street_match'].str.join(', ')

Output: Output:

  municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

@aneroid I now want to extract multiple exact matches (which are in a list) from a similar text column. @aneroid我现在想从类似的文本列中提取多个完全匹配(在列表中)。 The code below (based on your regex) works for this simple example but on my larger more complex dataset I get a bunch of tuples and empty strings.. Do you know how I could improve this code?下面的代码(基于您的正则表达式)适用于这个简单的示例,但在我更大更复杂的数据集上,我得到了一堆元组和空字符串。你知道如何改进这段代码吗?

# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk", 
                 "Next week I'll be going to the Amsterdamsestraatweg"]}

df = pd.DataFrame(data1, columns = ['text'])
print(df)

# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)

# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))

# Extract the words if there is an exact match 
df['city_match'] = df['text'].str.findall(r'\b(' + '|'.join(city_list) + r')\b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)

# Output
                                                text    city_match
0                         Today I'm going to Utrecht       Utrecht
1              Tomorrow I'm going to Utrecht and Urk  Utrecht, Urk
2  Next week I'll be going to the Amsterdamsestra...      

As you can see it works.如您所见,它有效。 The 'Amsterdamsestraatweg' is not an exact match so it didn't match. 'Amsterdamsestraatweg' 不完全匹配,因此不匹配。 Strangely in my larger df I get a bunch of tuples and empty strings as output like so:奇怪的是,在我较大的 df 中,我得到了一堆元组和空字符串,例如 output,如下所示:

0                        ('Wijk bij Duurstede', '', '')
6                                   ('Utrecht', '', '')
7     ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9     ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10                     ('Urk', '', ''), ('Urk', '', '')
11    ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12                                 ('Lelystad', '', '')
13             ('Utrecht', '', ''), ('Utrecht', '', '')
16    ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18             ('De Bilt', '', ''), ('De Bilt', '', '')
19                                      ('Urk', '', '')

Thanks again再次感谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据列值从 df 访问一行 - Access a row from a df based on a column value 通过与其他 df 列进行比较,从一个 df 列的值列表中提取值 - Extract value from list of values in one df column by comparing with other df column 根据列条件从以前的 df 中提取值到新的 df - extract value from previous df to new df based on column criteria 如果 df1 column1 中的值与列表中的值匹配,Pandas 从另一个 df1 column2 在 df2 中创建新列 - Pandas create new column in df2 from another df1 column2 if a value in df1 column1 matches value in a list 从 df 列的列表中过滤期望值 - Filter expected value from list in df column 根据列值和列表创建列 - Create a column based on column value and a list 根据列表中的另一列内容创建新的列值 - Create a new column value based on another column content from a list 将三元组列表(行、列、值)转换为矩阵 pandas df - Converting a list of triplets (row, column, value) to matrix as pandas df 如何将列表字典从列值转换为 pandas df 中的列? - How to convert the list dictionary from a column value into column in pandas df? 从 python 的 dataframe 列中的列表中的字符串中提取 integer 值? - Extract an integer value from a string that is in a list that is in a dataframe column in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM