使用python使用正则表达式在多个变量字段中拆分字符串

Question

I have a dataframe were each row of a certain column is a text that comes from some bad formatted form where each 'field' is after the the 'field title', an example is:我有一个数据框，某列的每一行都是来自某些格式错误的格式的文本，其中每个“字段”都在“字段标题”之后，例如：

col山口
Name: Bob Surname: Ross Title: painter age:34姓名：鲍勃姓氏：罗斯头衔：画家年龄：34
Surname: Isaac Name: Newton Title: coin checker age: 42姓氏：艾萨克姓名：牛顿头衔：硬币检查员年龄：42
age:20 Title: pilot Name: jack年龄：20 头衔：飞行员姓名：杰克
this is some trash text Name: John Surname: Doe这是一些垃圾文本名称：John 姓氏：Doe

As from example, the fields can be in any order an some of them could not exist.例如，这些字段可以按任何顺序排列，其中一些字段可能不存在。

What I need to do is to parse the fields so that the second line becomes something like:我需要做的是解析字段，以便第二行变成这样：

{'Name': 'Isaac','Surname': 'Newton',...}

While i can deal with the 'pythonic part' I believe that the parsing should be done using some regex (also due to the fact that the rows are thousands) but I have no idea on how to design it.虽然我可以处理“pythonic 部分”，但我认为应该使用一些正则表达式来完成解析（也因为行数为数千），但我不知道如何设计它。

Answer 1

Try:尝试：

x = df["col"].str.extractall(r"([^\s:]+):\s*(.+?)\s*(?=[^\s:]+:|\Z)")
x = x.droplevel(level="match").pivot(columns=0, values=1)

print(x.apply(lambda x: x[x.notna()].to_dict(), axis=1).to_list())

Prints:印刷：

[
    {"Name": "Bob", "Surname": "Ross", "Title": "painter", "age": "34"},
    {
        "Name": "Newton",
        "Surname": "Isaac",
        "Title": "coin checker",
        "age": "42",
    },
    {"Name": "jack", "Title": "pilot", "age": "20"},
]

使用python使用正则表达式在多个变量字段中拆分字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-05-24 09:32:45

使用python使用正则表达式在多个变量字段中拆分字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-05-24 09:32:45

解决方案1
0 已采纳 2022-05-24 09:32:45