简体   繁体   English

如何使用正则表达式解码字符串?

[英]How to decode a string using regex?

I have a file which contains data of users in rows which is stored in some cryptic format.我有一个文件,其中包含以某种神秘格式存储的行中的用户数据。 I want to decode that and create a dataframe我想解码它并创建一个 dataframe

sample row -- AN04N010105SANDY0205SMITH030802031989样本行 -- AN04N010105SANDY0205SMITH030802031989

Note- AN04N01 is standard 7 letter string at the start to denote that this row is valid.注意 - AN04N01是标准的 7 个字母开头的字符串,表示该行有效。

  1. Here 0105SANDY refers to 1st column(name) having length 5这里0105SANDY指的是长度为 5 的第一列(名称)
  • 01 -> 1st column ( which is name column ) 01 -> 第一列(即名称列)
  • 05 -> length of name ( Sandy ) 05 -> 名字的长度(桑迪)
  1. Similarly, 0205SMITH refers to同样, 0205SMITH指的是
  • 02 -> 2nd column ( which is surname column ) 02 -> 第二列(这是姓列)
  • 05 -> length of surname ( Smith ) 05 -> 姓氏长度(史密斯)
  1. Similarly, 030802031989 refers to同样, 030802031989
  • 03 -> 3rd column ( DOB ) 03 -> 第 3 列(出生日期)
  • 08 -> length of DOB 08 -> DOB 长度

I want a data frame like --我想要一个数据框,比如——

| name | surname | DOB |
|Sandy  | SMITH | 02031989 |

I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?我试图使用正则表达式,但我不知道如何在识别名称后将其放入数据框中,你将如何找到要读取的字符数?

Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.与其对可能无序且长度不同的组使用正则表达式,不如以串行方式使用字符串可能更简单。

With the following, you track an index i through the string and consume two characters for code , then length and finally the variable amount of characters given by length .使用以下内容,您通过字符串跟踪索引i并使用两个字符作为code ,然后是length ,最后是由length给出的可变字符数量。 Then, you store the values in a dict , append the dict s to a list and turn that list of dict s into a dataframe.然后,将值存储在dict中,append 将dict存储到一个列表中,并将该dict list转换为 dataframe。 Bonus, it works with the elements in any order.奖励,它以任何顺序与元素一起工作。

import pandas as pd

test_strings = [
    "AN04N010105ALICE0205ADAMS030802031989",
    "AN04N010103BOB0205SMITH0306210876",
    "AN04N0103060101010104FRED0204OWEN",
    "XXXXXXX0105SANDY0205SMITH030802031989",
    ]

code_map = {"01": "name", "02": "surname", "03": "DOB"}

def parse(s):
    i = 7
    d = {}
    while i < len(s):
        code, i = s[i:i+2], i+2  # read code
        length, i = int(s[i:i+2]), i+2  # read length
        val, i = s[i:i+length], i + length  # read value
        d[code_map[code]] = val  # store value
    return d

ds = []

for s in test_strings:
    if not s.startswith("AN04N01"):
        continue
    ds.append(parse(s))

df = pd.DataFrame(ds)

df contains: df包含:

    name surname       DOB
0  ALICE   ADAMS  02031989
1    BOB   SMITH    210876
2   FRED    OWEN    010101

here it is the code for this pattern:这是此模式的代码:

(\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d+\w{5})(\d+)

or use this pattern:或使用此模式:

(\D{5})\d+(\D+)\d+(02\d+)

Try:尝试:

def fn(x):
    rv, x = [], x[7:]
    while x:
        _, n, x = x[:2], x[2:4], x[4:]
        value, x = x[: int(n)], x[int(n) :]
        rv.append(value)
    return rv


m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)

Prints:印刷:

                                     row   NAME SURNAME       DOB
0  AN04N010105SANDY0205SMITH030802031989  SANDY   SMITH  02031989
1  AN04N010105BANDY0205BMITH030802031989  BANDY   BMITH  02031989
2  AN04N010105CANDY0205CMITH030802031989  CANDY   CMITH  02031989
3  XXXXXXX0105DANDY0205DMITH030802031989    NaN     NaN       NaN

Dataframe used: Dataframe 使用:

                                     row
0  AN04N010105SANDY0205SMITH030802031989
1  AN04N010105BANDY0205BMITH030802031989
2  AN04N010105CANDY0205CMITH030802031989
3  XXXXXXX0105DANDY0205DMITH030802031989

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM