[英]Python formatting text to a Dict
我有一个文本文档需要将其格式化为字典。 我是 python 的新手,所以完全不知道如何实现它,请有人帮忙
Hi this is a employee document
200 Name 200 Order # 200 code # 200 Case #
george,bendti 11-11111 1111111111 11111111
below are the details report
200 Birth Date 200 Age 200 Gender 200 Area 200 Account #
10/14/1944 75 Y Female Newyork 111111
{
"Name":"george,bendti",
"Order #":"11-11111",
"code #":"1111111111",
"Case #":"11111111",
"Birth Date":"10/14/1944",
"Age":"75 Y",
"Gender":"Female",
"Area":"Newyork",
"Account":"111111",
}
以下是如何使用re.split()
:
d = '''Hi this is a employee document
200 Name 200 Order # 200 code # 200 Case #
george,bendti 11-11111 1111111111 11111111
below are the details report
200 Birth Date 200 Age 200 Gender 200 Area 200 Account #
10/14/1944 75 Y Female Newyork 111111
'''
import re
dct = d.split('\n') # Splitting the lines
dct = [re.split(' 200 | 200|200 ',d) for i,d in enumerate(dct) if i%3] # Splitting the column names
dct = [re.split(' (?!Y)',d[0]) if i%2 else d for i,d in enumerate(dct)] # Splitting the column values
dct = [[a for a in d if a] for d in dct] # Removing empty strings
d = {k:v for i in range(0,len(dct),2) for k,v in zip(dct[i],dct[i+1])} # Creating dictionary
print(d)
Output:
{'Name': 'george,bendti', 'Order #': '11-11111', 'code #': '1111111111', 'Case #': '11111111', 'Birth Date': '10/14/1944', 'Age': '75 Y', 'Gender': 'Female', 'Area ': 'Newyork', 'Account #': '111111'}
假设您有多个输入遵循具有一致模式的类似结构 - 您可以使用re
- 很难将结构不明显的文本解析为字典 - 但正则表达式可能非常强大。
import re
from pprint import pprint
input_string = r'Hi this is a employee document \
200 Name 200 Order # 200 code # 200 Case # \
george,bendti 11-11111 1111111111 11111111 \
below are the details report \
200 Birth Date 200 Age 200 Gender 200 Area 200 Account # \
10/14/1944 75 Y Female Newyork 111111'
my_dict = {}
my_dict['Name'] = re.search(r'[a-z]+,[a-z]+', input_string.lower()).group()
my_dict['Order'] = re.search(r'[0-9]{2,}-[0-9]{5,}', input_string).group()
my_dict['Case'] = re.search(r'[0-9]{8,}', input_string).group()
my_dict['Birth Date'] = re.search(r'[0-9]{2,}/[0-9]{2,}/[0-9]{2,}', input_string).group()
my_dict['Age'] = re.search(r'[0-9]+ Y', input_string).group()
my_dict['Gender'] = re.search(r'(female|male)', input_string.lower()).group()
pprint(my_dict)
我推荐regex101.com - 这个站点有一个清晰和交互式的测试正则表达式的方式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.