[英]Python/Pandas:How to process a column of data when it meets certain criteria
i have a csv lie this我有一个 csv 谎言这个
userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba
I use pandas
to read this csv, I'd like to add a new column ci
,Its value comes from userlabel
,and the following conditions must be met:我使用pandas
读取这个 csv,我想添加一个新列ci
,其值来自userlabel
,必须满足以下条件:
the code is like this:代码是这样的:
(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]
when it matched, ci
is the number between the first "- or _" and second "- or _" from userlabel
.当它匹配时, ci
是userlabel
中第一个“- or _”和第二个“- or _”之间的数字。
the fake code is like this:假代码是这样的:
ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)
finally,the result is like this最后,结果是这样的
userlabel ci country
SZ5GZTD_[56][13631808] russia
YZ5GZTC-3_[51][13680735] 3 uk
XZ5GZTA_12-[51][13574893] usa
testYZ5GZWC_11-[51][13632101] 11 cuba
import re
def get_val(s):
l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
return None if(len(l) == 0) else l[0][1]
df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]
userlabel ci country
0 SZ5GZTD_[56][13631808] None russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] None usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
You can use您可以使用
import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0 NaN
1 3
2 NaN
3 11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
userlabel ci country
0 SZ5GZTD_[56][13631808] NaN russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] NaN usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
See the regex demo .请参阅正则表达式演示。
Regex details :正则表达式详细信息:
(?i)
- make the pattern case insensitive (no need using str.lower()
) (?i)
- 使模式不区分大小写(无需使用str.lower()
)^
- start of string ^
- 字符串的开头(?:yz|testyz)
- a non-capturing group matching either yz
or testyz
(?:yz|testyz)
- 匹配yz
或testyz
的非捕获组[^_-]*
- zero or more chars other than _
and -
[^_-]*
- 除_
和-
之外的零个或多个字符[_-]
- the first _
or -
[_-]
- 第一个_
或-
(\d+)
- Group 1 (the Series.str.extract
requires a capturing group since it only returns this captured substring): one or more digits (\d+)
- 第 1 组( Series.str.extract
需要一个捕获组,因为它只返回这个捕获的子字符串):一位或多位数字[-_]
- a -
or _
. [-_]
- 一个-
或_
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.