简体   繁体   English

熊猫数据框列上的子字符串

[英]Substring on pandas dataframe column

I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe. 我想从熊猫数据框中的列(名称)中提取一个子字符串(标题-Mr. Mrs,Miss等),然后将新列(Title)写回到该数据框中。

In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and . 在数据框的“名称”列中,我有一个名称,例如“ Brand,Owen Harris先生”。两个分隔符是和。

I have attempted to use a split method, but this only splits the original string in two within a list. 我试图使用split方法,但这只会将原始字符串在列表中一分​​为二。 So I still send up ['Braund', ' Mr. Owen Harris'] in the list. 因此,我仍然在列表中发送['Braund','Owen Harris先生']。

import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
    print(i[1])

I am thinking this might be situation where regex comes into play. 我认为这可能是正则表达式起作用的情况。 My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. 我的阅读建议使用Lookahead(?=,)和Lookbehind(?<='。')方法可以解决问题。 for example 例如

import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
    print(i)
    print(i[1])`

But I am running into errors (EOL while scanning string literal) . 但是我遇到了错误(扫描字符串文字时停产)。 Can someone point me in the right direction? 有人可以指出我正确的方向吗?

Cheers Mike 干杯迈克

You do it like this. 你是这样做的。

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()

Output head(5): 输出头(5):

0       Mr
1      Mrs
2     Miss
3      Mrs
4       Mr

Summation of results 结果汇总

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
             .value_counts()

Output 产量

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Mme               1
Sir               1
Ms                1
the Countess      1
Jonkheer          1
Don               1
Capt              1
Name: Name, dtype: int64

The error is coming from the fact that you have single quotes around the period inside your single-quoted regex string-literal; 错误是由于您在单引号的正则表达式字符串字面量内的句点周围有单引号引起的; this actually isn't the correct syntax, I think you mean to use an escaped-period ie r'(?=,)*(?<=\\.) . 这实际上不是正确的语法,我认为您的意思是使用转义符,即r'(?=,)*(?<=\\.) However you don't need to use lookahead/lookbehind here, it's more usual and simpler to use capture-groups to describe your regex; 但是,您无需在此处使用先行/后备功能,使用捕获组来描述您的正则表达式更为常见和简单; in this case the regex would be 在这种情况下,正则表达式将是

df_Train['Name'].str.extract(", (\w*)\.")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM