如何使用正则表达式解码字符串？

Question

I have a file which contains data of users in rows which is stored in some cryptic format.我有一个文件，其中包含以某种神秘格式存储的行中的用户数据。 I want to decode that and create a dataframe我想解码它并创建一个 dataframe

sample row -- AN04N010105SANDY0205SMITH030802031989样本行 -- AN04N010105SANDY0205SMITH030802031989

Note- AN04N01 is standard 7 letter string at the start to denote that this row is valid.注意 - AN04N01是标准的 7 个字母开头的字符串，表示该行有效。

Here 0105SANDY refers to 1st column(name) having length 5这里0105SANDY指的是长度为 5 的第一列（名称）

01 -> 1st column ( which is name column ) 01 -> 第一列（即名称列）
05 -> length of name ( Sandy ) 05 -> 名字的长度（桑迪）

Similarly, 0205SMITH refers to同样， 0205SMITH指的是

02 -> 2nd column ( which is surname column ) 02 -> 第二列（这是姓列）
05 -> length of surname ( Smith ) 05 -> 姓氏长度（史密斯）

Similarly, 030802031989 refers to同样， 030802031989指

03 -> 3rd column ( DOB ) 03 -> 第 3 列（出生日期）
08 -> length of DOB 08 -> DOB 长度

I want a data frame like --我想要一个数据框，比如——

| name | surname | DOB |
|Sandy  | SMITH | 02031989 |

I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?我试图使用正则表达式，但我不知道如何在识别名称后将其放入数据框中，你将如何找到要读取的字符数？

Answer 1

Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.与其对可能无序且长度不同的组使用正则表达式，不如以串行方式使用字符串可能更简单。

With the following, you track an index i through the string and consume two characters for code , then length and finally the variable amount of characters given by length .使用以下内容，您通过字符串跟踪索引i并使用两个字符作为code ，然后是length ，最后是由length给出的可变字符数量。 Then, you store the values in a dict , append the dict s to a list and turn that list of dict s into a dataframe.然后，将值存储在dict中，append 将dict存储到一个列表中，并将该dict list转换为 dataframe。 Bonus, it works with the elements in any order.奖励，它以任何顺序与元素一起工作。

import pandas as pd

test_strings = [
    "AN04N010105ALICE0205ADAMS030802031989",
    "AN04N010103BOB0205SMITH0306210876",
    "AN04N0103060101010104FRED0204OWEN",
    "XXXXXXX0105SANDY0205SMITH030802031989",
    ]

code_map = {"01": "name", "02": "surname", "03": "DOB"}

def parse(s):
    i = 7
    d = {}
    while i < len(s):
        code, i = s[i:i+2], i+2  # read code
        length, i = int(s[i:i+2]), i+2  # read length
        val, i = s[i:i+length], i + length  # read value
        d[code_map[code]] = val  # store value
    return d

ds = []

for s in test_strings:
    if not s.startswith("AN04N01"):
        continue
    ds.append(parse(s))

df = pd.DataFrame(ds)

df contains: df包含：

    name surname       DOB
0  ALICE   ADAMS  02031989
1    BOB   SMITH    210876
2   FRED    OWEN    010101

Answer 2

here it is the code for this pattern:这是此模式的代码：

(\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d+\w{5})(\d+)

or use this pattern:或使用此模式：

(\D{5})\d+(\D+)\d+(02\d+)

Answer 3

Try:尝试：

def fn(x):
    rv, x = [], x[7:]
    while x:
        _, n, x = x[:2], x[2:4], x[4:]
        value, x = x[: int(n)], x[int(n) :]
        rv.append(value)
    return rv


m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)

Prints:印刷：

                                     row   NAME SURNAME       DOB
0  AN04N010105SANDY0205SMITH030802031989  SANDY   SMITH  02031989
1  AN04N010105BANDY0205BMITH030802031989  BANDY   BMITH  02031989
2  AN04N010105CANDY0205CMITH030802031989  CANDY   CMITH  02031989
3  XXXXXXX0105DANDY0205DMITH030802031989    NaN     NaN       NaN

Dataframe used: Dataframe 使用：

                                     row
0  AN04N010105SANDY0205SMITH030802031989
1  AN04N010105BANDY0205BMITH030802031989
2  AN04N010105CANDY0205CMITH030802031989
3  XXXXXXX0105DANDY0205DMITH030802031989

如何使用正则表达式解码字符串？

问题描述

3 个解决方案

解决方案1
0 2022-08-29 11:54:06

解决方案2
0 2022-08-29 12:03:59

解决方案3
0 2022-08-29 12:21:29

如何使用正则表达式解码字符串？

问题描述

3 个解决方案

解决方案1 0 2022-08-29 11:54:06

解决方案2 0 2022-08-29 12:03:59

解决方案3 0 2022-08-29 12:21:29

解决方案1
0 2022-08-29 11:54:06

解决方案2
0 2022-08-29 12:03:59

解决方案3
0 2022-08-29 12:21:29