[英]Add new column to Pandas DataFrame and fill with first word from another column from same df
I have a dataset of crimes reported by Gloucestershire Constabulary from 2011-16. 我拥有格罗斯特郡警察局从2011-16年报告的犯罪数据集。 It's a .csv file that I have imported to a Pandas dataframe. 这是我导入到Pandas数据框的.csv文件。 The data include a column stating the Lower Super Output Area (LSOA) in which the crime occurred, so for crimes in Tewkesbury, for instance, each record has the corresponding LSOA name, eg 'Tewkesbury 009D'; 数据包括指出犯罪发生的下超级输出区域(LSOA)的列,例如,对于图克斯伯里(Tewkesbury)的犯罪,每条记录都有对应的LSOA名称,例如“ Tewkesbury 009D”; 'Tewkesbury 009E'. 'Tewkesbury 009E'。
I want to group these data by the town/city they relate to, eg 'Gloucester', 'Tewkesbury', ignoring the specific LSOAs within each conurbation. 我想将这些数据按与之相关的城镇/城市进行分组,例如“ Gloucester”,“ Tewkesbury”,而忽略每个城市中特定的LSOA。 Ideally, I would append a new column to the dataframe, with just the place name copied across, and group on that. 理想情况下,我会将一个新列添加到数据框,其中仅复制地名,并在其上进行分组。 I am comfortable with how to do the grouping, just not the new column in the first place. 我对如何进行分组感到很满意,但首先不是新的列。 Any advice on how to do this is gratefully received. 非常感谢您提供有关如何执行此操作的任何建议。
I am no Pandas expert but I think you can do string slicing to strip out the last five digits (it supports regex too if I recall correctly, so you can do a proper 'search' if required). 我不是Pandas专家,但我认为您可以进行字符串切片以去除最后五个数字(如果我没记错的话,它也支持正则表达式,因此如果需要,可以进行适当的“搜索”)。
#x is the original dataframe
new_col = x.lsoa.str[:-5] #lsoa is the column containing city names
pd.concat([x, new_col], axis=1)
The str method can be used to extract a string out of the lsoa column of the dataframe. str方法可用于从数据帧的lsoa列中提取字符串。
遵循以下原则应该可以:
df['town'] = [x.split()[0] for x in df['LSOA']]
You can use regex to extract the city name from the DataFrame and then join the result to the original DataFrame. 您可以使用正则表达式从DataFrame中提取城市名称,然后将结果加入到原始DataFrame中。 If your inital DataFrame is df
如果您的初始DataFrame是df
df = pd.DataFrame([ 'Tewkesbury 009D', 'Tewkesbury 009E'], columns=['LSOA'])
In [2]: df
Out[2]:
LSOA
0 Tewkesbury 009D
1 Tewkesbury 009E
Then you can extract the city name and optionally the LSOA code in to a new DataFrame df_new
然后,您可以将城市名称和LSOA代码(可选)提取到新的DataFrame df_new
df_new = df['LSOA'].str.extract('(\w*)\s(\d+\w*)', expand=True)
In [10]: df_new
Out[10]:
0 1
0 Tewkesbury 009D
1 Tewkesbury 009E
If you want to discard the code and just keep the city name remove the second bracket from the regex as '(\\w*)\\s\\d+\\w*'
. 如果您想放弃代码而只保留城市名称,请从正则表达式中删除第二个括号为'(\\w*)\\s\\d+\\w*'
。 Now you can append this result to the original DataFrame 现在您可以将此结果附加到原始DataFrame中
In [11]: df.join(df_new)
Out[11]:
LSOA 0 1
0 Tewkesbury 009D Tewkesbury 009D
1 Tewkesbury 009E Tewkesbury 009E
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.