简体   繁体   English

Python中文本文件的字典

[英]Dictionary from a text file in Python

Problem: 问题:

I have a txt file with this format: 我有一个具有以下格式的txt文件:

Intestinal infectious diseases (001-003)  
001 Cholera  
002 Fever  
003 Salmonella   
Zoonotic bacterial diseases (020-022)  
020 Plague  
021 Tularemia  
022 Anthrax  
External Cause Status (E000)  
E000 External cause status  
Activity (E001-E002)  
E001 Activities involving x and y  
E002 Other activities

where each line that begins with the 3-integer code/E+3-integer code/V+3-integer code is a value for the preceding header, which are the keys for my dictionary. 其中以3整数代码/ E + 3整数代码/ V + 3整数代码开头的每一行都是前面标头的值,这是我字典的键。 In other questions I've seen, the use of columns or colons can be used to parse each line to make a key/value pair, but the format of my txt file doesn't allow me to do that. 在我看到的其他问题中,可以使用列或冒号来解析每一行以创建键/值对,但是txt文件的格式不允许我这样做。

Is the a way to make a txt file like this into a dictionary where the keys are the group names and the values are the code+disease names? 是否可以将这样的txt文件制作成字典,其中键是组名,值是代码+疾病名?

I also need to parse the code and disease names into a second dictionary, so I end up with a dictionary that contains the group names as keys, with the values being a second dictionary with the codes as keys and the disease names as values. 我还需要将代码和疾病名称解析为第二个字典,因此我最终得到了一个字典,其中包含组名作为关键字,值是第二个字典,代码作为关键字,疾病名称为值。

Script: 脚本:

def process_file(filename):
    myDict={}
        f = open(filename, 'r')
        for line in f:
            if line[0] is not int:
                if line.startswith("E"):
                    if line[1] is int:
                        line = dictionary1_values
                    else:
                        break
                else:
                    line = dictionary1_key
            myDict[dictionary1_key].append[line]

Desired output format is: 所需的输出格式为:
{"Intestinal infectious diseases (001-003)": {"001": "Cholera", "002": "Fever", "003": "Salmonella"}, "Zoonotic bacterial diseases (020-022)": {"020": "Plague", "021": "Tularemia", "022": "Anthrax"}, "External Cause Status (E000)": {"E000": "External cause status"}, "Activity (E001-E002)": {"E001": "Activities involving x and y", "E002": "Other activities"}}

try using regular expressions to determine if it is a header or a disease 尝试使用正则表达式确定它是标题还是疾病

import re
mydict = {}
with open(filename, "r") as f:
    header = None
    for line in f:
        match_desease = re.match(r"(E?\d\d\d) (.*)", line)
        if not match_desease:
            header = line
        else:
            code = match_desease.group(1)
            desease = match_desease.group(2)
            mydict[header][code] = desease

One solution would be to use regular expressions to help you characterize and parse the two types of lines you might encounter in this file: 一种解决方案是使用正则表达式来帮助您表征和解析此文件中可能遇到的两种类型的行:

import re
header_re = re.compile(r'([\w\s]+) \(([\w\s\-]+)\)')
entry_re = re.compile(r'([EV]?\d{3}) (.+)')

This allows you to very easily check which type of line you're encountering, and break it apart as desired: 这使您可以非常轻松地检查遇到的线型,并根据需要将其分开:

# Check if a line is a header:
header = header_re.match(line)
if header:
    header_name, header_codes = header.groups()  # e.g. ('Intestinal infectious diseases', '001-009')
    # Do whatever you need to do when you encounter a new group
    # ...
else:
    entry = entry_re.match(line)
    # If the line wasn't a header, it ought to be an entry,
    # otherwise we've encountered something we didn't expect
    assert entry is not None
    entry_number, entry_name = entry.groups()  # e.g. ('001', 'Cholera')
    # Do whatever you need to do when you encounter an entry in a group
    # ...

Using that to re-work your function, we could write the following: 使用它来重新编写您的函数,我们可以编写以下代码:

import re

def process_file(filename):
    header_re = re.compile(r'([\w\s]+) \(([\w\s\-]+)\)')
    entry_re = re.compile(r'([EV]?\d{3}) (.+)')

    all_groups = {}
    current_group = None

    with open(filename, 'r') as f:
        for line in f:
            # Check if a line is a header:
            header = header_re.match(line)
            if header:
                current_group = {}
                all_groups[header.group(0)] = current_group
            else:
                entry = entry_re.match(line)
                # If the line wasn't a header, it ought to be an entry,
                # otherwise we've encountered something we didn't expect
                assert entry is not None
                entry_number, entry_name = entry.groups()  # e.g. ('001', 'Cholera')

                current_group[entry_number] = entry_name

    return all_groups
def process_file(filename):
    myDict = {}
    rootkey = None
    f = open(filename, 'r')
    for line in f:
        if line[1:3].isdigit():           # if the second and third character from the checked string (line) is the ASCII Code in range 0x30..0x39 ("0".."9"), i.e.: str.isdigit()
            subkey, data = line.rstrip().split(" ",1)     # split into two parts... the first one is the number with or without "E" at begin
            myDict[rootkey][subkey] = data
        else:
            rootkey = line.rstrip()       # str.rstrip() is used to delete newlines (or another so called "empty spaces")
            myDict[rootkey] = {}          # prepare a new empty rootkey into your myDict
    f.close()
    return myDict

Testing in Python console: 在Python控制台中进行测试:

>>> d = process_file('/tmp/file.txt')
>>>
>>> d['Intestinal infectious diseases (001-003)']
{'003': 'Salmonella', '002': 'Fever', '001': 'Cholera'}
>>> d['Intestinal infectious diseases (001-003)']['002']
'Fever'
>>> d['Activity (E001-E002)']
{'E001': 'Activities involving x and y', 'E002': 'Other activities'}
>>> d['Activity (E001-E002)']['E001']
'Activities involving x and y'
>>>
>>> d
{'Activity (E001-E002)': {'E001': 'Activities involving x and y', 'E002': 'Other activities'}, 'External Cause Status (E000)': {'E000': 'External cause status'}, 'Intestinal infectious diseases (001-003)': {'003': 'Salmonella', '002': 'Fever', '001': 'Cholera'}, 'Zoonotic bacterial diseases (020-022)': {'021': 'Tularemia', '020': 'Plague', '022': 'Anthrax'}}

Warning: First one line in the file must be just a "rootkey" ! 警告:文件中的第一行必须只是“ rootkey”! Not "subkey" or data ! 不是“子键”或数据! Otherwise cause may to be the raise error :-) 否则可能是引发错误:-)

A note: Maybe you should remove the first "E" character. 注意:也许您应该删除第一个“ E”字符。 Or can not it be done? 还是不能做到? Do you need to leave this "E" character somewhere? 您是否需要将此“ E”字符留在某处?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM