简体   繁体   English

将正则表达式应用于熊猫数据框

[英]applying regex to a pandas dataframe

I'm having trouble applying a regex function a column in a python dataframe.我在将正则表达式函数应用于 python 数据框中的列时遇到问题。 Here is the head of my dataframe:这是我的数据框的头部:

               Name   Season          School   G    MP  FGA  3P  3PA    3P%
 74       Joe Dumars  1982-83   McNeese State  29   NaN  487   5    8  0.625   
 84      Sam Vincent  1982-83  Michigan State  30  1066  401   5   11  0.455   
 176  Gerald Wilkins  1982-83     Chattanooga  30   820  350   0    2  0.000   
 177  Gerald Wilkins  1983-84     Chattanooga  23   737  297   3   10  0.300   
 243    Delaney Rudd  1982-83     Wake Forest  32  1004  324  13   29  0.448  

I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.我认为我已经很好地掌握了将函数应用于 Dataframes 的知识,所以也许我的 Regex 技能缺乏。

Here is what I put together:这是我整理的内容:

import re

def split_it(year):
    return re.findall('(\d\d\d\d)', year)

 df['Season2'] = df['Season'].apply(split_it(x))

TypeError: expected string or buffer

Output would be a column called Season2 that contains the year before the hyphen.输出将是一个名为 Season2 的列,其中包含连字符之前的年份。 I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong我确信没有正则表达式有更简单的方法,但更重要的是,我试图弄清楚我做错了什么

Thanks for any help in advance.提前感谢您的任何帮助。

When I try (a variant of) your code I get NameError: name 'x' is not defined -- which it isn't.当我尝试(一个变体)你的代码时,我得到NameError: name 'x' is not defined - 它不是。

You could use either你可以使用

df['Season2'] = df['Season'].apply(split_it)

or或者

df['Season2'] = df['Season'].apply(lambda x: split_it(x))

but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list , though:但是第二个只是编写第一个的更长更慢的方式,所以没有太大意义(除非你有其他参数要处理,我们这里没有。)不过,你的函数将返回一个list

>>> df["Season"].apply(split_it)
74     [1982]
84     [1982]
176    [1982]
177    [1983]
243    [1982]
Name: Season, dtype: object

although you could easily change that.虽然你可以很容易地改变它。 FWIW, I'd use vectorized string operations and do something like FWIW,我会使用矢量化字符串操作并执行类似的操作

>>> df["Season"].str[:4].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

or或者

>>> df["Season"].str.split("-").str[0].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

You can simply use str.extract您可以简单地使用str.extract

df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')

Here you locate \\d{4}-\\d{2} (for example 1982-83) but only extracts the captured group between parenthesis \\d{4} (for example 1982)在这里您找到\\d{4}-\\d{2} (例如 1982-83)但只提取括号\\d{4}之间的捕获组(例如 1982)

The asked problem can be solved by writing the following code :所问的问题可以通过编写以下代码来解决:

import re
def split_it(year):
    x = re.findall('([\d]{4})', year)
    if x :
      return(x.group())

df['Season2'] = df['Season'].apply(split_it)

You were facing this problem as some rows didn't had year in the string您正面临这个问题,因为某些行在字符串中没有年份

you can use pandas native function to do it too.您也可以使用 pandas 本机函数来执行此操作。

check this page for the pandas functions that accepts regular expression.检查此页面以获取接受正则表达式的熊猫函数。 for your case, you can do对于你的情况,你可以做

df["Season"].str.extract(r'([\d]{4}))')

I had the exact same issue.我有完全相同的问题。 Thanks for the answers @DSM.感谢@DSM 的回答。 FYI @itjcms, you can improve the function by removing the repetition of the '\\d\\d\\d\\d' .仅供参考@itjcms,您可以通过删除'\\d\\d\\d\\d'的重复来改进该功能。

def split_it(year):  
    return re.findall('(\d\d\d\d)', year)

Becomes:变成:

def split_it(year):
    return re.findall('(\d{4})', year)

我会提取:

df['Season2']=df['Season'].str.extract(r'(\d{4}))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM