[英]Exact match string in panda column
Set-up 设定
I scrape housing ad data and analyse with pandas. 我抓取住房广告数据并使用熊猫进行分析。 I have computed average statistics and inserted them in a pandas dataframe:
district_df
. 我已经计算了平均统计数据并将其插入到熊猫数据框:
district_df
。
One of the district_df
columns contains district names: district_df['district']
. district_df
列之一包含地区名称: district_df['district']
。
Another of the district_df
columns contains subdistrict names: district_df['subdistrict']
district_df
列中的另一个包含分区名称: district_df['subdistrict']
They look like, 他们看着像是,
district subdistrict
Bergen-Enkheim Bergen-Enkheim
Bornheim/Ostend Bornheim
Bornheim/Ostend Ostend
Harheim Harheim
Innenstadt I Altstadt
Innenstadt I Bahnhofsviertel
Innenstadt I Gallus
Innenstadt II Bockenheim
Innenstadt II Westend-Nord
⋮ ⋮
Problem 问题
I create a district table ( district_table
) from district_df
per district. 我创建一个分区表(
district_table
从) district_df
每区。 Ie for the above I create five district tables. 即以上,我创建了五个区表。 I do this by the following code,
我通过以下代码来做到这一点,
for district in d_set: # d_set is a set containing all district names
district_table = district_df[district_df['district'].str.match(district)]
This code works, that is: a table per district is created. 该代码有效,即:每个分区都创建了一个表。
However, the table for Innenstadt II
also contains the subdistricts of Innenstadt I
. 但是,
Innenstadt II
的表格也包含Innenstadt I
的街道。
It seems to me that .str.match(district)
matches not exact, but partly. 在我看来
.str.match(district)
不完全匹配,但部分匹配。 Ie Innenstadt I
will match Innenstadt II
. Ie
Innenstadt I
将与Innenstadt II
比赛。
My actual district_df
columns contain more then what I display here – issue occurs for a variety of district names. 我实际的
district_df
列包含的内容比我在此处显示的要多-各种地区名称都会出现此问题。
How do I get exact matches? 如何获得完全匹配?
I'd do it this way: 我会这样:
{ dist: df[df.district == dist] for dist in df.district.unique() }
But then again you might be better off using a MultiIndex: 但是话又说回来,使用MultiIndex可能会更好:
df.set_index(['district', 'subdistrict'], inplace=True)
This is a lot like the dict
solution, but downstream processing is likely to be faster. 这与
dict
解决方案非常相似,但是下游处理可能会更快。
I think you need boolean indexing
in loop: 我认为您需要在循环中进行
boolean indexing
:
d_set = district_df['district'].unique()
for district in d_set:
district_table = district_df[district_df['district'] == district]
print (district_table)
district subdistrict
0 Bergen-Enkheim Bergen-Enkheim
district subdistrict
1 Bornheim/Ostend Bornheim
2 Bornheim/Ostend Ostend
district subdistrict
3 Harheim Harheim
district subdistrict
4 Innenstadt I Altstadt
5 Innenstadt I Bahnhofsviertel
6 Innenstadt I Gallus
district subdistrict
7 Innenstadt II Bockenheim
8 Innenstadt II Westend-Nord
If need dict
of DataFrames
better is convert groupby
object: 如果需要
dict
的DataFrames
更好的是转换groupby
对象:
a = dict(tuple(district_df.groupby('district')))
print (a['Innenstadt I'])
district subdistrict
4 Innenstadt I Altstadt
5 Innenstadt I Bahnhofsviertel
6 Innenstadt I Gallus
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.