根据 Pandas 中列的内容将列拆分为多列

Question

I have a column with data like this我有一列包含这样的数据

Ticket NO: 123456789;票号：123456789； Location ID:ABC123;位置编号：ABC123； Type:Network;类型：网络；

Ticket No. 132123456, Location ID:ABC444;票号132123456，地点ID：ABC444； Type:App类型：应用程序

Tickt#222256789;票号#222256789； Location ID:AMC121;位置编号：AMC121； Type:Network;类型：网络；

I am trying like this我正在尝试这样

new = data["Description"].str.split(";", n = 1, expand = True)
data["Ticket"]= new[0]
data["Location"]= new[1]  
data["Type"]= new[2]

# Dropping old  columns
data.drop(columns =["Description"], inplace = True)

I can separate based on ";"我可以根据“;”分开but how to do for both ";"但是如何为两者做“;” and ","?和 ”，”？

Answer 1

A more general solution, that allows you to perform as much processing as you like comfortably.一个更通用的解决方案，它允许您舒适地执行尽可能多的处理。 Let's start by defining an example dataframe for easy debugging:让我们首先定义一个示例 dataframe 以便于调试：

df = pd.DataFrame({'Description': [
    'Ticket NO: 123456789 , Location ID:ABC123; Type:Network;',
    'Ticket NO: 123456789 ; Location ID:ABC123; Type:Network;']})

Then, let's define our processing function, where you can do anything you like:然后，让我们定义我们的处理 function，在这里你可以做任何你喜欢的事情：

def process(row):
    parts = re.split(r'[,;]', row)
    return pd.Series({'Ticket': parts[0], 'Location': parts[1], 'Type': parts[2]})

In addition to splitting by ,;除了按,; and then separating into the 3 sections, you can add code that will strip whitespace characters, remove whatever is on the left of the colons etc. For example, try:然后分成 3 个部分，您可以添加将去除空白字符的代码，删除冒号左侧的任何内容等。例如，尝试：

def process(row):
    parts = re.split(r'[,;]', row)
    data = {}
    for part in parts:
        for field in ['Ticket', 'Location', 'Type']:
            if field.lower() in part.lower():
                data[field] = part.split(':')[1].strip()
    return pd.Series(data)

Finally, apply to get the result:最后，申请得到结果：

df['Description'].apply(process)

This is much more readable and easily maintainable than doing everything in a single regex, especially as you might end up needing additional processing.这比在单个正则表达式中完成所有事情更具可读性和易于维护性，尤其是当您最终可能需要额外的处理时。

The output of this application will look like this:此应用程序的 output 将如下所示：

To add this output to the original dataframe, simply run:要将这个 output 添加到原始 dataframe 中，只需运行：

df[['Ticket', 'Location', 'Type']] = df['Description'].apply(process)

Answer 2

You can use您可以使用

new = data["Description"].str.split("[;,]", n = 2, expand = True)
new.columns = ['Ticket', 'Location', 'Type']

Output: Output：

>>> new
                  Ticket             Location            Type
0  Ticket NO: 123456789    Location ID:ABC123   Type:Network;
1   Ticket No. 132123456   Location ID:ABC444        Type:App
2       Tickt#222256789    Location ID:AMC121   Type:Network;

The [;,] regex matches either a ; [;,]正则表达式匹配任一; or a , char, and n=2 sets max split to two times.或 a , char 和n=2将最大拆分设置为两次。

Another regex Series.str.extract solution:另一个正则表达式Series.str.extract解决方案：

new[['Ticket', 'Location', 'Type']] = data['Description'].str.extract(r"(?i)Ticke?t\D*(\d+)\W*Location ID\W*(\w+)\W*Type:(\w+)")
>>> new
      Ticket Location     Type
0  123456789   ABC123  Network
1  132123456   ABC444      App
2  222256789   AMC121  Network
>>>

See the regex demo .请参阅正则表达式演示。 Details :详情：

(?i) - case insensitive flag (?i) - 不区分大小写的标志
Ticke?t - Ticket with an optional e Ticke Ticke?t - 带有可选e的Ticket
\D* - zero or more non-digit chars \D* - 零个或多个非数字字符
(\d+) - Group 1: one or more digits (\d+) - 第 1 组：一位或多位数字
\W* - zero or more non-word chars \W* - 零个或多个非单词字符
Location ID - a string Location ID - 一个字符串
\W* - zero or more non-word chars \W* - 零个或多个非单词字符
(\w+) - Group 2: one or more word chars (\w+) - 第 2 组：一个或多个单词字符
\W* - zero or more non-word chars \W* - 零个或多个非单词字符
Type: - a string Type: - 一个字符串
(\w+) - Group 3: one or more word chars (\w+) - 第 3 组：一个或多个单词字符

Answer 3

One approach using str.extract一种使用str.extract的方法

Ex:前任：

df[['Ticket', 'Location', 'Type']] = df['Description'].str.extract(r"[Ticket\sNO:.#](\d+).*ID:([A-Z0-9]+).*Type:([A-Za-z]+)", flags=re.I)
print(df[['Ticket', 'Location', 'Type']])

Output: Output：

      Ticket Location     Type
0  123456789   ABC123  Network
1  132123456   ABC444      App
2  222256789   AMC121  Network

根据 Pandas 中列的内容将列拆分为多列

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-05-18 15:43:23

解决方案2
0 2021-05-18 15:35:17

解决方案3
0 2021-05-18 15:39:21

根据 Pandas 中列的内容将列拆分为多列

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-05-18 15:43:23

解决方案2 0 2021-05-18 15:35:17

解决方案3 0 2021-05-18 15:39:21

解决方案1
1 已采纳 2021-05-18 15:43:23

解决方案2
0 2021-05-18 15:35:17

解决方案3
0 2021-05-18 15:39:21