简体   繁体   English

如何将此文本文件转换为字典?

[英]How to convert this text file into a dictionary?

I have a file f that looks something like: 我有一个文件f看起来像:

#labelA
there
is
something
here
#label_Bbb
here
aswell
...

It can have a number of labels and any number of elements (only str) on a line, and several lines for each label. 它可以在一行上有许多标签和任意数量的元素(仅限str),每行标签可以有多行。 I would like to store this data in a dictionary like: 我想将这些数据存储在如下字典中:

d = {'labelA': 'thereissomethinghere', 'label_Bbb': 'hereaswell', ...}

I have a number of sub-questions: 我有一些子问题:

  1. How can I make use of the # character in order to know when a new entry is in place? 如何使用#字符以了解新条目何时到位?
  2. How to remove it and keep whatever follows until the end of the line? 如何删除它并保留以下内容直到行结束?
  3. How is it possible to append every string that follows on that new line until # pops up again. 如何才能在新行上追加每个字符串,直到#再次弹出。
  4. How can I stop when the file finishes? 文件结束后如何停止?

Firstly, mydict contains the keys which starts with #, and the value is a list( list can keep the lines in their appending order ), we append lines into this list until we find next line that starts with #. 首先, mydict包含以#开头的键,值是一个列表( 列表可以将行保持在它们的附加顺序中 ),我们将行添加到此列表中,直到我们找到以#开头的下一行。 Then we just need to convert the list of lines into one single string. 然后我们只需要将行列表转换为一个单独的字符串。

I am using python3, if you use python2 replace mydict.items() with mydict.iteritems() for iterating key-value pairs 我正在使用python3,如果你使用python2替换mydict.items()mydict.iteritems()来迭代键值对

mydict = dict()
with open("sample.csv") as inputs:
    for line in inputs:
        if line.startswith("#"):
            key = line.strip()[1:]
            mydict.setdefault(key,list())
        else:
            mydict[key].append(line.strip())

result = dict()
for key, vlist in mydict.items():
    result[key] = "".join(vlist)

print(result)

Output: 输出:

{'labelA': 'thereissomethinghere', 'label_Bbb': 'hereaswell'}

Shortest solution using re.findall() function: 使用re.findall()函数的最短解决方案:

import re 

with open("lines.txt", 'r') as fh:
    d = {k:v.replace('\n', '') for k,v in re.findall(r'^#(\w+)\s([^#]+)', fh.read(), re.M)}

print(d)

The output: 输出:

{'label_Bbb': 'hereaswell', 'labelA': 'thereissomethinghere'}

re.findall will return a list of tuples, each tuple contains two items representing two consecutive capturing groups re.findall将返回元组列表,每个元组包含两个表示两个连续捕获组的项

f = open('untitled.txt', 'r')

line = f.readline()
d = {}
last_key = None
last_element = ''
while line:
    if line.startswith('#'):
        if last_key:
            d[last_key] = last_element
            last_element = ''
        last_key = line[:-1]
        last_element = ''
    else:
        last_element += line
    line = f.readline()

d[last_key] = last_element

Use collections.defaultdict : 使用collections.defaultdict

from collections import defaultdict

d = defaultdict(list)

with open('f.txt') as file:
    for line in file:
        if line.startswith('#'):
            key = line.lstrip('#').rstrip('\n')
        else:
            d[key].append(line.rstrip('\n'))
for key in d:
    d[key] = ''.join(d[key])

As a single pass without making interim dictionaries: 作为单一通行证而不制作临时词典:

res = {}
with open("sample") as lines:
    try:
        line = lines.next()
        while True:
            entry = ""
            if line.startswith("#"):
                next = lines.next()
                while not next.startswith("#"):
                    entry += next
                    next = lines.next()
            res[line[1:]] = entry
            line = next
    except StopIteration:
        res[line[1:]] = entry  # Catch the last entry

I would do something like this (this is pseudocode so it won't compile!) 我会做这样的事情(这是伪代码所以它不会编译!)

dict = dict()
key = read_line()[1:]
while not end_file():
    text = ""
    line = read_line()
    while(line[0] != "#" and not end_file()):
        text += line
        line = read_line()

    dict[key] = text
    key = line[1:]

Here is my approach: 这是我的方法:

def eachChunk(stream):
  key = None
  for line in stream:
    if line.startswith('#'):
      line = line.rstrip('\n')
      if key:
        yield key, value
      key = line[1:]
      value = ''
    else:
      value += line
  yield key, value

You can quickly create the wished dictionary like this: 您可以像这样快速创建希望的字典:

with open('f') as data:
  d = dict(eachChunk(data))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM