Python/Pandas：当一列数据满足一定条件时如何处理

Question

i have a csv lie this我有一个 csv 谎言这个

userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba

I use pandas to read this csv, I'd like to add a new column ci ,Its value comes from userlabel ,and the following conditions must be met:我使用pandas读取这个 csv，我想添加一个新列ci ，其值来自userlabel ，必须满足以下条件：

convert values to lowercase将值转换为小写
start with 'yz' or 'testyz'以“yz”或“testyz”开头

the code is like this:代码是这样的：

(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]

when it matched, ci is the number between the first "- or _" and second "- or _" from userlabel .当它匹配时， ci是userlabel中第一个“- or _”和第二个“- or _”之间的数字。

the fake code is like this:假代码是这样的：

ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)

finally,the result is like this最后，结果是这样的

userlabel                      ci country
SZ5GZTD_[56][13631808]            russia
YZ5GZTC-3_[51][13680735]       3  uk
XZ5GZTA_12-[51][13574893]         usa
testYZ5GZWC_11-[51][13632101]  11 cuba

Answer 1

import re

def get_val(s):
    l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
    return  None if(len(l) == 0) else l[0][1]

df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]

userlabel                         ci    country
0   SZ5GZTD_[56][13631808]        None  russia
1   YZ5GZTC-3_[51][13680735]      3     uk
2   XZ5GZTA_12-[51][13574893]     None  usa
3   testYZ5GZWC_11-[51][13632101] 11    cuba

Answer 2

You can use您可以使用

import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0    NaN
1      3
2    NaN
3     11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
                       userlabel   ci country
0         SZ5GZTD_[56][13631808]  NaN  russia
1       YZ5GZTC-3_[51][13680735]    3      uk
2      XZ5GZTA_12-[51][13574893]  NaN     usa
3  testYZ5GZWC_11-[51][13632101]   11    cuba

See the regex demo .请参阅正则表达式演示。

Regex details :正则表达式详细信息：

(?i) - make the pattern case insensitive (no need using str.lower() ) (?i) - 使模式不区分大小写（无需使用str.lower() ）
^ - start of string ^ - 字符串的开头
(?:yz|testyz) - a non-capturing group matching either yz or testyz (?:yz|testyz) - 匹配yz或testyz的非捕获组
[^_-]* - zero or more chars other than _ and - [^_-]* - 除_和-之外的零个或多个字符
[_-] - the first _ or - [_-] - 第一个_或-
(\d+) - Group 1 (the Series.str.extract requires a capturing group since it only returns this captured substring): one or more digits (\d+) - 第 1 组（ Series.str.extract需要一个捕获组，因为它只返回这个捕获的子字符串）：一位或多位数字
[-_] - a - or _ . [-_] - 一个-或_ 。

Python/Pandas：当一列数据满足一定条件时如何处理

问题描述

2 个解决方案

解决方案1
2 2021-02-06 12:24:26

解决方案2
2 已采纳 2021-02-06 13:02:00

Python/Pandas：当一列数据满足一定条件时如何处理

问题描述

2 个解决方案

解决方案1 2 2021-02-06 12:24:26

解决方案2 2 已采纳 2021-02-06 13:02:00

解决方案1
2 2021-02-06 12:24:26

解决方案2
2 已采纳 2021-02-06 13:02:00