简体   繁体   English

如何使用正则表达式拆分列以将尾随大写字母移动到单独的列中?

[英]How can I split columns with regex to move trailing CAPS into a separate column?

I'm trying to split a column using regex, but can't seem to get the split correctly.我正在尝试使用正则表达式拆分列,但似乎无法正确拆分。 I'm trying to take all the trailing CAPS and move them into a separate column.我正在尝试将所有尾随大写字母移动到单独的列中。 So I'm getting all the CAPS that are either 2-4 CAPS in a row.所以我得到了连续 2-4 个大写字母的所有大写字母。 However, it's only leaving the 'Name' column while the 'Team' column is blank.但是,它只留下'Name'列,而'Team'列是空白的。

Here's my code:这是我的代码:

import pandas as pd

url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"

df = pd.read_html(url)[0].join(pd.read_html(url)[1])
df[['Name','Team']] = df['Name'].str.split('[A-Z]{2,4}', expand=True)  

I want this:我要这个:

print(df.head(5).to_string())
   RK             Name POS  GP   MIN   PTS  FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA   FT%  REB   AST  STL  BLK   TO  DD2  TD3    PER
0   1  LeBron JamesLA  SF  35  35.1  24.9  9.6  19.7  48.6  2.0  6.0  33.8  3.7  5.5  67.7  7.9  11.0  1.3  0.5  3.7   28    9  26.10
1   2   Ricky RubioPHX  PG  30  32.0  13.6  4.9  11.9  41.3  1.2  3.7  31.8  2.6  3.1  83.7  4.6   9.3  1.3  0.2  2.5   12    1  16.40
2   3   Luka DoncicDAL  SF  32  32.8  29.7  9.6  20.2  47.5  3.1  9.4  33.1  7.3  9.1  80.5  9.7   8.9  1.2  0.2  4.2   22   11  31.74
3   4   Ben SimmonsPHIL  PG  36  35.4  14.9  6.1  10.8  56.3  0.1  0.1  40.0  2.7  4.6  59.0  7.5   8.6  2.2  0.7  3.6   19    3  19.49
4   5    Trae YoungATL  PG  34  35.1  28.9  9.3  20.8  44.8  3.5  9.4  37.5  6.7  7.9  85.0  4.3   8.4  1.2  0.1  4.8   11    1  23.47

to become this:变成这样:

print(df.head(5).to_string())
   RK             Name    Team    POS  GP   MIN   PTS  FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA   FT%  REB   AST  STL  BLK   TO  DD2  TD3    PER
0   1  LeBron James        LA    SF  35  35.1  24.9  9.6  19.7  48.6  2.0  6.0  33.8  3.7  5.5  67.7  7.9  11.0  1.3  0.5  3.7   28    9  26.10
1   2   Ricky Rubio        PHX    PG  30  32.0  13.6  4.9  11.9  41.3  1.2  3.7  31.8  2.6  3.1  83.7  4.6   9.3  1.3  0.2  2.5   12    1  16.40
2   3   Luka Doncic        DAL    SF  32  32.8  29.7  9.6  20.2  47.5  3.1  9.4  33.1  7.3  9.1  80.5  9.7   8.9  1.2  0.2  4.2   22   11  31.74
3   4   Ben Simmons        PHIL    PG  36  35.4  14.9  6.1  10.8  56.3  0.1  0.1  40.0  2.7  4.6  59.0  7.5   8.6  2.2  0.7  3.6   19    3  19.49
4   5    Trae Young        ATL    PG  34  35.1  28.9  9.3  20.8  44.8  3.5  9.4  37.5  6.7  7.9  85.0  4.3   8.4  1.2  0.1  4.8   11    1  23.47

You may extract the data into two columns by using a regex like ^(.*?)([AZ]+)$ or ^(.*[^AZ])([AZ]+)$ :您可以使用^(.*?)([AZ]+)$^(.*[^AZ])([AZ]+)$等正则表达式将数据提取到两列中:

df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)

This will keep all up to the last char that is not an uppercase letter in Group "Name" and the last uppercase letters in Group "Team".这将保留所有“名称”组中不是大写字母的最后一个字符和“团队”组中的最后一个大写字母。

See regex demo #1 and regex demo #2请参阅正则表达式演示 #1正则表达式演示 #2

Details细节

  • ^ - start of a string ^ - 字符串的开始
  • (.*?) - Capturing group 1: any zero or more chars other than line break chars, as few as possible (.*?) - 捕获组 1:除换行符以外的任何零个或多个字符,尽可能少
    or或者
  • (.*[^AZ]) - any zero or more chars other than line break chars, as many as possible, up to the last char that is not an ASCII uppercase letter (granted the subsequent patterns match) (note that this pattern implies there is at least 1 char before the last uppercase letters) (.*[^AZ]) - 除换行符以外的任何零个或多个字符,尽可能多,直到最后一个不是 ASCII 大写字母的字符(允许后续模式匹配)(注意,此模式暗示最后一个大写字母前至少有 1 个字符)
  • ([AZ]+) - Capturing group 2: one or more ASCII uppercase letters ([AZ]+) - 捕获组 2:一个或多个 ASCII 大写字母
  • $ - end of string. $ - 字符串的结尾。

I have made a few alterations in the functions, You might need to add re package.我在功能上做了一些改动,您可能需要添加 re 包。

Its a bit manual, But I hope this will suffice.它有点手动,但我希望这足够了。 Have a great day!祝你有美好的一天!

df_obj_skel = dict()
df_obj_skel['Name'] = list()
df_obj_skel['Team'] = list()
for index,row in df.iterrows():
    Name = row['Name']
    Findings = re.search('[A-Z]{2,4}$', Name)
    Refined_Team = Findings[0]
    Refined_Name = re.sub(Refined_Team + "$", "", Name)
    df_obj_skel['Team'].append(Refined_Team)
    df_obj_skel['Name'].append(Refined_Name)
df_final = pd.DataFrame(df_obj_skel)
print(df_final)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM