[英]How to decode a string using regex?
I have a file which contains data of users in rows which is stored in some cryptic format.我有一个文件,其中包含以某种神秘格式存储的行中的用户数据。 I want to decode that and create a dataframe
我想解码它并创建一个 dataframe
sample row -- AN04N010105SANDY0205SMITH030802031989
样本行 -- AN04N010105SANDY0205SMITH030802031989
Note- AN04N01
is standard 7 letter string at the start to denote that this row is valid.注意 -
AN04N01
是标准的 7 个字母开头的字符串,表示该行有效。
0105SANDY
refers to 1st column(name) having length 50105SANDY
指的是长度为 5 的第一列(名称)0205SMITH
refers to0205SMITH
指的是030802031989
refers to030802031989
指I want a data frame like --我想要一个数据框,比如——
| name | surname | DOB |
|Sandy | SMITH | 02031989 |
I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?我试图使用正则表达式,但我不知道如何在识别名称后将其放入数据框中,你将如何找到要读取的字符数?
Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.与其对可能无序且长度不同的组使用正则表达式,不如以串行方式使用字符串可能更简单。
With the following, you track an index i
through the string and consume two characters for code
, then length
and finally the variable amount of characters given by length
.使用以下内容,您通过字符串跟踪索引
i
并使用两个字符作为code
,然后是length
,最后是由length
给出的可变字符数量。 Then, you store the values in a dict
, append the dict
s to a list and turn that list
of dict
s into a dataframe.然后,将值存储在
dict
中,append 将dict
存储到一个列表中,并将该dict
list
转换为 dataframe。 Bonus, it works with the elements in any order.奖励,它以任何顺序与元素一起工作。
import pandas as pd
test_strings = [
"AN04N010105ALICE0205ADAMS030802031989",
"AN04N010103BOB0205SMITH0306210876",
"AN04N0103060101010104FRED0204OWEN",
"XXXXXXX0105SANDY0205SMITH030802031989",
]
code_map = {"01": "name", "02": "surname", "03": "DOB"}
def parse(s):
i = 7
d = {}
while i < len(s):
code, i = s[i:i+2], i+2 # read code
length, i = int(s[i:i+2]), i+2 # read length
val, i = s[i:i+length], i + length # read value
d[code_map[code]] = val # store value
return d
ds = []
for s in test_strings:
if not s.startswith("AN04N01"):
continue
ds.append(parse(s))
df = pd.DataFrame(ds)
df
contains: df
包含:
name surname DOB
0 ALICE ADAMS 02031989
1 BOB SMITH 210876
2 FRED OWEN 010101
here it is the code for this pattern:这是此模式的代码:
(\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d+\w{5})(\d+)
or use this pattern:或使用此模式:
(\D{5})\d+(\D+)\d+(02\d+)
Try:尝试:
def fn(x):
rv, x = [], x[7:]
while x:
_, n, x = x[:2], x[2:4], x[4:]
value, x = x[: int(n)], x[int(n) :]
rv.append(value)
return rv
m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)
Prints:印刷:
row NAME SURNAME DOB
0 AN04N010105SANDY0205SMITH030802031989 SANDY SMITH 02031989
1 AN04N010105BANDY0205BMITH030802031989 BANDY BMITH 02031989
2 AN04N010105CANDY0205CMITH030802031989 CANDY CMITH 02031989
3 XXXXXXX0105DANDY0205DMITH030802031989 NaN NaN NaN
Dataframe used: Dataframe 使用:
row
0 AN04N010105SANDY0205SMITH030802031989
1 AN04N010105BANDY0205BMITH030802031989
2 AN04N010105CANDY0205CMITH030802031989
3 XXXXXXX0105DANDY0205DMITH030802031989
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.