繁体   English   中英

Python 将文本格式化为字典

[英]Python formatting text to a Dict

我有一个文本文档需要将其格式化为字典。 我是 python 的新手,所以完全不知道如何实现它,请有人帮忙

Hi this is a employee document 
200 Name 200 Order # 200 code # 200 Case #
george,bendti  11-11111 1111111111 11111111
below are the details report 
200 Birth Date 200 Age 200 Gender 200 Area   200 Account #
10/14/1944 75 Y Female  Newyork 111111

{
       "Name":"george,bendti",
       "Order #":"11-11111",
       "code #":"1111111111",
       "Case #":"11111111",
       "Birth Date":"10/14/1944",
       "Age":"75 Y",
       "Gender":"Female",
       "Area":"Newyork",
       "Account":"111111", 
}

以下是如何使用re.split()

d = '''Hi this is a employee document 
200 Name 200 Order # 200 code # 200 Case #
george,bendti  11-11111 1111111111 11111111
below are the details report 
200 Birth Date 200 Age 200 Gender 200 Area   200 Account #
10/14/1944 75 Y Female  Newyork 111111
'''
import re

dct = d.split('\n') # Splitting the lines
dct = [re.split(' 200 | 200|200 ',d) for i,d in enumerate(dct) if i%3] # Splitting the column names
dct = [re.split(' (?!Y)',d[0]) if i%2 else d for i,d in enumerate(dct)] # Splitting the column values
dct = [[a for a in d if a] for d in dct] # Removing empty strings 
d = {k:v  for i in range(0,len(dct),2) for k,v in zip(dct[i],dct[i+1])} # Creating dictionary

print(d)

Output:

{'Name': 'george,bendti', 'Order #': '11-11111', 'code #': '1111111111', 'Case #': '11111111', 'Birth Date': '10/14/1944', 'Age': '75 Y', 'Gender': 'Female', 'Area  ': 'Newyork', 'Account #': '111111'}

假设您有多个输入遵循具有一致模式的类似结构 - 您可以使用re - 很难将结构不明显的文本解析为字典 - 但正则表达式可能非常强大。

import re
from pprint import pprint

input_string = r'Hi this is a employee document \
200 Name 200 Order # 200 code # 200 Case # \
george,bendti  11-11111 1111111111 11111111 \
below are the details report \
200 Birth Date 200 Age 200 Gender 200 Area   200 Account # \
10/14/1944 75 Y Female  Newyork 111111'


my_dict = {}
my_dict['Name'] = re.search(r'[a-z]+,[a-z]+', input_string.lower()).group()
my_dict['Order'] = re.search(r'[0-9]{2,}-[0-9]{5,}', input_string).group()
my_dict['Case'] = re.search(r'[0-9]{8,}', input_string).group()
my_dict['Birth Date'] = re.search(r'[0-9]{2,}/[0-9]{2,}/[0-9]{2,}', input_string).group()
my_dict['Age'] = re.search(r'[0-9]+ Y', input_string).group()
my_dict['Gender'] = re.search(r'(female|male)', input_string.lower()).group()

pprint(my_dict)

我推荐regex101.com - 这个站点有一个清晰和交互式的测试正则表达式的方式。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM