[英]applying regex to a pandas dataframe
I'm having trouble applying a regex function a column in a python dataframe.我在将正则表达式函数应用于 python 数据框中的列时遇到问题。 Here is the head of my dataframe:
这是我的数据框的头部:
Name Season School G MP FGA 3P 3PA 3P%
74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625
84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455
176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000
177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300
243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.我认为我已经很好地掌握了将函数应用于 Dataframes 的知识,所以也许我的 Regex 技能缺乏。
Here is what I put together:这是我整理的内容:
import re
def split_it(year):
return re.findall('(\d\d\d\d)', year)
df['Season2'] = df['Season'].apply(split_it(x))
TypeError: expected string or buffer
Output would be a column called Season2 that contains the year before the hyphen.输出将是一个名为 Season2 的列,其中包含连字符之前的年份。 I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong
我确信没有正则表达式有更简单的方法,但更重要的是,我试图弄清楚我做错了什么
Thanks for any help in advance.提前感谢您的任何帮助。
When I try (a variant of) your code I get NameError: name 'x' is not defined
-- which it isn't.当我尝试(一个变体)你的代码时,我得到
NameError: name 'x' is not defined
- 它不是。
You could use either你可以使用
df['Season2'] = df['Season'].apply(split_it)
or或者
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list , though:但是第二个只是编写第一个的更长更慢的方式,所以没有太大意义(除非你有其他参数要处理,我们这里没有。)不过,你的函数将返回一个list :
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
although you could easily change that.虽然你可以很容易地改变它。 FWIW, I'd use vectorized string operations and do something like
FWIW,我会使用矢量化字符串操作并执行类似的操作
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
or或者
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
You can simply use str.extract
您可以简单地使用
str.extract
df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')
Here you locate \\d{4}-\\d{2}
(for example 1982-83) but only extracts the captured group between parenthesis \\d{4}
(for example 1982)在这里您找到
\\d{4}-\\d{2}
(例如 1982-83)但只提取括号\\d{4}
之间的捕获组(例如 1982)
The asked problem can be solved by writing the following code :所问的问题可以通过编写以下代码来解决:
import re
def split_it(year):
x = re.findall('([\d]{4})', year)
if x :
return(x.group())
df['Season2'] = df['Season'].apply(split_it)
You were facing this problem as some rows didn't had year in the string您正面临这个问题,因为某些行在字符串中没有年份
I had the exact same issue.我有完全相同的问题。 Thanks for the answers @DSM.
感谢@DSM 的回答。 FYI @itjcms, you can improve the function by removing the repetition of the
'\\d\\d\\d\\d'
.仅供参考@itjcms,您可以通过删除
'\\d\\d\\d\\d'
的重复来改进该功能。
def split_it(year):
return re.findall('(\d\d\d\d)', year)
Becomes:变成:
def split_it(year):
return re.findall('(\d{4})', year)
我会提取:
df['Season2']=df['Season'].str.extract(r'(\d{4}))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.