在 DataFrame 中拆分字符串并仅保留某些部分

Question

I have a DataFrame like this:我有一个像这样的 DataFrame：

x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])

    id
0   3.13.1.7-2.1
1   3.21.1.8-2.2
2   4.20.1.6-2.1
3   4.8.1.2-2.0
4   5.23.1.10-2.2

I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE).我需要在句点上拆分每个 id 字符串，然后我需要知道第二部分何时为 13，第三部分何时为 1。理想情况下，我会有一个额外的列是 boolean 值（在上面的示例中，索引0 为 TRUE，其他所有为 FALSE）。 But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.但是我可以使用多个附加列，其中一个或多个包含单独的字符串部分，一个用于所述 boolean 值。

I first tried to just split the string into parts:我首先尝试将字符串分成几部分：

df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))

This worked, however if I try to isolate only the second part of the string like this...这有效，但是如果我尝试像这样仅隔离字符串的第二部分......

df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])

...I get an error that the list index is out of range. ...我收到列表索引超出范围的错误。

Yet, if I check any individual index in the DataFrame like this...但是，如果我像这样检查 DataFrame 中的任何单个索引...

df['id_split'][0][1]

...this works, producing only the second item in the list of strings. ...这行得通，只产生字符串列表中的第二项。

I guess I'm not familiar enough with what the.apply() method is doing to know why it won't accept list indices.我想我对 .apply() 方法的作用还不够熟悉，不知道为什么它不接受列表索引。 But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows).但无论如何，我想知道如何以可扩展的方式（实际数据集为数百万行）基于这些值隔离每个字符串的第二和第三部分，检查它们的值，以及 output 和 boolean。 Thanks!谢谢！

Answer 1

Let's use str.split to get the parts, then you can compare:让我们使用str.split来获取零件，然后您可以比较：

parts = df['id'].str.split('\.', expand=True)

(parts[[1,2]] == ['13','1']).all(1)

Output: Output：

0     True
1    False
2    False
3    False
4    False
dtype: bool

Answer 2

You can do something like this你可以做这样的事情

df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)

Output Output

            id   flag
0   3.13.1.7-2.1   True
1   3.21.1.8-2.2  False
2   4.20.1.6-2.1  False
3    4.8.1.2-2.0  False
4  5.23.1.10-2.2  False

Answer 3

You can do it directly, like below:您可以直接执行此操作，如下所示：

df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')

>>> print(df)
              id    new
0   3.13.1.7-2.1   True
1   3.21.1.8-2.2  False
2   4.20.1.6-2.1  False
3    4.8.1.2-2.0  False
4  5.23.1.10-2.2  False

在 DataFrame 中拆分字符串并仅保留某些部分

问题描述

3 个解决方案

解决方案1
1 2020-11-25 17:16:58

解决方案2
1 2020-11-25 17:19:24

解决方案3
0 已采纳 2020-11-25 17:18:18

在 DataFrame 中拆分字符串并仅保留某些部分

问题描述

3 个解决方案

解决方案1 1 2020-11-25 17:16:58

解决方案2 1 2020-11-25 17:19:24

解决方案3 0 已采纳 2020-11-25 17:18:18

解决方案1
1 2020-11-25 17:16:58

解决方案2
1 2020-11-25 17:19:24

解决方案3
0 已采纳 2020-11-25 17:18:18