[英]How to strip words from a column value if the value contains specific substrings?
I have row values as such:我有这样的行值:
ID MyColumn
0 A "Best Position 3 5"
1 B "Healthy (unexpired)
2 C "At-Large"
3 D "Run 2 Position 1"
4 E "Hello"
4 E "None"
4 E "Tomorrow"
I want to scan this table for any rows that contain substrings "Position", and then for those rows keep only the first instance of an int.我想扫描此表以查找包含子字符串“Position”的任何行,然后对于这些行仅保留 int 的第一个实例。 I have the Lambda / regex for taking the first instance of an int in a value:
我有 Lambda / regex 用于在值中获取 int 的第一个实例:
...str.replace(r'\D+', '').str.split()
but I'm not sure how to apply it on the condition of substring appearances.但我不确定如何在 substring 出现的情况下应用它。
Resulting set:结果集:
ID MyColumn
0 A "3"
1 B "Healthy (unexpired)
2 C "At-Large"
3 D "2"
4 E "Hello"
4 E "None"
4 E "Tomorrow"
We might be able to use str.replace
here with a smart regex:我们也许可以在这里使用带有智能正则表达式的
str.replace
:
regex = '.*?(\d+).*(?:Position|unexpired).*|.*?(?:Position|unexpired).*?(\d+).*'
df['new'] = df.loc['MyColumn'].str.replace(regex, '\1\2', case=False)
Use Series.str.contains
with Series.str.extract
for first integer with Series.mask
and last replace by original non matched values by Series.fillna
:将
Series.str.contains
与Series.str.extract
用于第一个 integer 与Series.mask
并最后由Series.fillna
替换为原始不匹配值:
mask= df['MyColumn'].str.contains('Position|unexpired', case=False)
df['MyColumn']=(df['MyColumn'].mask(mask,df['MyColumn'].str.extract(r'(\d+)',expand=False))
.fillna(df['MyColumn']))
print (df)
ID MyColumn
0 A 3
1 B "Healthy (unexpired)
2 C "At-Large"
3 D 2
4 E "Hello"
4 E "None"
4 E "Tomorrow"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.