简体   繁体   English

熊猫列中的完全匹配字符串

[英]Exact match string in panda column

Set-up 设定

I scrape housing ad data and analyse with pandas. 我抓取住房广告数据并使用熊猫进行分析。 I have computed average statistics and inserted them in a pandas dataframe: district_df . 我已经计算了平均统计数据并将其插入到熊猫数据框: district_df

One of the district_df columns contains district names: district_df['district'] . district_df列之一包含地区名称: district_df['district']

Another of the district_df columns contains subdistrict names: district_df['subdistrict'] district_df列中的另一个包含分区名称: district_df['subdistrict']

They look like, 他们看着像是,

        district           subdistrict      
     Bergen-Enkheim      Bergen-Enkheim    
    Bornheim/Ostend            Bornheim
    Bornheim/Ostend              Ostend
            Harheim             Harheim
       Innenstadt I            Altstadt
       Innenstadt I     Bahnhofsviertel
       Innenstadt I              Gallus
      Innenstadt II          Bockenheim 
      Innenstadt II        Westend-Nord
                  ⋮                   ⋮

Problem 问题

I create a district table ( district_table ) from district_df per district. 我创建一个分区表( district_table从) district_df每区。 Ie for the above I create five district tables. 即以上,我创建了五个区表。 I do this by the following code, 我通过以下代码来做到这一点,

for district in d_set: # d_set is a set containing all district names 
    district_table = district_df[district_df['district'].str.match(district)]

This code works, that is: a table per district is created. 该代码有效,即:每个分区都创建了一个表。

However, the table for Innenstadt II also contains the subdistricts of Innenstadt I . 但是, Innenstadt II的表格也包含Innenstadt I的街道。

It seems to me that .str.match(district) matches not exact, but partly. 在我看来.str.match(district)不完全匹配,但部分匹配。 Ie Innenstadt I will match Innenstadt II . Ie Innenstadt I将与Innenstadt II比赛。

My actual district_df columns contain more then what I display here – issue occurs for a variety of district names. 我实际的district_df列包含的内容比我在此处显示的要多-各种地区名称都会出现此问题。

How do I get exact matches? 如何获得完全匹配?

I'd do it this way: 我会这样:

{ dist: df[df.district == dist] for dist in df.district.unique() }

But then again you might be better off using a MultiIndex: 但是话又说回来,使用MultiIndex可能会更好:

df.set_index(['district', 'subdistrict'], inplace=True)

This is a lot like the dict solution, but downstream processing is likely to be faster. 这与dict解决方案非常相似,但是下游处理可能会更快。

I think you need boolean indexing in loop: 我认为您需要在循环中进行boolean indexing

d_set = district_df['district'].unique()

for district in d_set: 
    district_table = district_df[district_df['district'] == district]
    print (district_table)

         district     subdistrict
0  Bergen-Enkheim  Bergen-Enkheim
          district subdistrict
1  Bornheim/Ostend    Bornheim
2  Bornheim/Ostend      Ostend
  district subdistrict
3  Harheim     Harheim
       district      subdistrict
4  Innenstadt I         Altstadt
5  Innenstadt I  Bahnhofsviertel
6  Innenstadt I           Gallus
        district   subdistrict
7  Innenstadt II    Bockenheim
8  Innenstadt II  Westend-Nord

If need dict of DataFrames better is convert groupby object: 如果需要dictDataFrames更好的是转换groupby对象:

a = dict(tuple(district_df.groupby('district')))

print (a['Innenstadt I'])
       district      subdistrict
4  Innenstadt I         Altstadt
5  Innenstadt I  Bahnhofsviertel
6  Innenstadt I           Gallus

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM