[英]How can I split columns with regex to move trailing CAPS into a separate column?
I'm trying to split a column using regex, but can't seem to get the split correctly.我正在尝试使用正则表达式拆分列,但似乎无法正确拆分。 I'm trying to take all the trailing CAPS and move them into a separate column.
我正在尝试将所有尾随大写字母移动到单独的列中。 So I'm getting all the CAPS that are either 2-4 CAPS in a row.
所以我得到了连续 2-4 个大写字母的所有大写字母。 However, it's only leaving the
'Name'
column while the 'Team'
column is blank.但是,它只留下
'Name'
列,而'Team'
列是空白的。
Here's my code:这是我的代码:
import pandas as pd
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
df = pd.read_html(url)[0].join(pd.read_html(url)[1])
df[['Name','Team']] = df['Name'].str.split('[A-Z]{2,4}', expand=True)
I want this:我要这个:
print(df.head(5).to_string())
RK Name POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron JamesLA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky RubioPHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka DoncicDAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben SimmonsPHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae YoungATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
to become this:变成这样:
print(df.head(5).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron James LA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky Rubio PHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka Doncic DAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben Simmons PHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae Young ATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
You may extract the data into two columns by using a regex like ^(.*?)([AZ]+)$
or ^(.*[^AZ])([AZ]+)$
:您可以使用
^(.*?)([AZ]+)$
或^(.*[^AZ])([AZ]+)$
等正则表达式将数据提取到两列中:
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
This will keep all up to the last char that is not an uppercase letter in Group "Name" and the last uppercase letters in Group "Team".这将保留所有“名称”组中不是大写字母的最后一个字符和“团队”组中的最后一个大写字母。
See regex demo #1 and regex demo #2请参阅正则表达式演示 #1和正则表达式演示 #2
Details细节
^
- start of a string ^
- 字符串的开始(.*?)
- Capturing group 1: any zero or more chars other than line break chars, as few as possible (.*?)
- 捕获组 1:除换行符以外的任何零个或多个字符,尽可能少(.*[^AZ])
- any zero or more chars other than line break chars, as many as possible, up to the last char that is not an ASCII uppercase letter (granted the subsequent patterns match) (note that this pattern implies there is at least 1 char before the last uppercase letters) (.*[^AZ])
- 除换行符以外的任何零个或多个字符,尽可能多,直到最后一个不是 ASCII 大写字母的字符(允许后续模式匹配)(注意,此模式暗示最后一个大写字母前至少有 1 个字符)([AZ]+)
- Capturing group 2: one or more ASCII uppercase letters ([AZ]+)
- 捕获组 2:一个或多个 ASCII 大写字母$
- end of string. $
- 字符串的结尾。I have made a few alterations in the functions, You might need to add re package.我在功能上做了一些改动,您可能需要添加 re 包。
Its a bit manual, But I hope this will suffice.它有点手动,但我希望这足够了。 Have a great day!
祝你有美好的一天!
df_obj_skel = dict()
df_obj_skel['Name'] = list()
df_obj_skel['Team'] = list()
for index,row in df.iterrows():
Name = row['Name']
Findings = re.search('[A-Z]{2,4}$', Name)
Refined_Team = Findings[0]
Refined_Name = re.sub(Refined_Team + "$", "", Name)
df_obj_skel['Team'].append(Refined_Team)
df_obj_skel['Name'].append(Refined_Name)
df_final = pd.DataFrame(df_obj_skel)
print(df_final)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.