简体   繁体   English

将文本文件作为字典读入程序

[英]Read text-file into program as dictionary

Using Python 3.使用 Python 3。

I have to write a function that take one argument (a string), and must return a dictionary from a txt-file that contains names of the sequences (keys) and the sequences (values).我必须编写一个带有一个参数(字符串)的函数,并且必须从包含序列(键)和序列(值)的名称的 txt 文件中返回一个字典。 Both keys and values must be strings.键和值都必须是字符串。

The text-file:文本文件:

Read1 GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC
Read2 CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG
Read4 TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG
Read3 GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT
Read5 CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC
Read6 TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT

I've come this far, but I think I am missing something and I don't know if what my work here is correct.我已经走了这么远,但我想我错过了一些东西,我不知道我在这里的工作是否正确。 I've marked the lines (with #) where I'm in doubt whether it is correct or not.我已经标记了我怀疑它是否正确的行(用#)。

def read_data(file_name):
    input_file=open(sequencing_reads.txt)
    #sequence_dict={}
    for line in input_file:
        #x=line.split(",")
    #return sequence_dict
    input_file.close()

I know it must return the dictionary with the following content:我知道它必须返回包含以下内容的字典:

{'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
 'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
 'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
 'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
 'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
 'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}

Can you help me fill out the gaps?你能帮我填补空白吗?

EDIT: I need to keep it simple so please no imports of packages and smart tricks :-)编辑:我需要保持简单,所以请不要导入包和智能技巧:-)

EDIT 2:编辑2:

I've tried this too:我也试过这个:

with open('sequencing_reads.txt', 'r') as document:
    answer = {}
    for line in document:
        line = line.split()
        if not line:  
            continue
        answer[line[0]] = line[1:]
print(answer)

The output is:输出是:

{'Read1': ['GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC'], 'Read2': ['CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG'], 'Read4': ['TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG'], 'Read3': ['GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT'], 'Read5': ['CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'], 'Read6': ['TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT']}

How do I get rid of the "[ ]" around my sequences?如何摆脱我的序列周围的“[]”?

EDIT4:编辑4:

def read_data(file_name):
    with open("sequencing_reads.txt", "r") as document:
        answer = {}
        for line in document:
            line = line.split()
            if not line:
                continue
                answer[line[0]] = line[1:]
                final_answer = {a:b[0] for a, b in answer.items()}
final_answer = read_data("sequencing_reads.txt")
print(final_answer)

prints:印刷:

None

You can try this:你可以试试这个:

import re
def read_data(file_name):
   data = open(file_name).read()
   keys = [filter(lambda x:bool(x), i)[0][1:-1] for i in re.findall("{(.*?)\:|(?<=,\n\s)(.*?)\:", data)]
   values = [filter(lambda x:bool(x), i)[0][1:-1] for i in re.findall('(?<=:\s)(.*?)(?=,\n)|(?<=\s)(.*?)(?=})', data)]
   final_data = {a:b for a, b in zip(keys, values)}
   return final_data

Output:输出:

{'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC', 'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT', 'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG', 'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC', 'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG', 'Read6': "'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT"}

Edit:编辑:

import ast 
def read_data(file_name):
   final_data = ast.literal_eval(open(file_name).read())
   return final_data

Edit 1: Regarding the removal of the brackets, just access the value by indexing:编辑 1:关于删除括号,只需通过索引访问值:

final_answer = {a:b[0] for a, b in answer.items()}
print(final_answer)

If you are having issues printing the value returned from read_data , you can try this:如果您在打印read_data返回的值时遇到问题,您可以尝试以下操作:

answer = read_data("the_file.txt")
print(answer)

Edit 3:编辑3:

def read_data(file_name):
   with open(file_name, "r") as document:
      answer = {}
      for line in document:
         line = line.split()
         if line:
            answer[line[0]] = line[1:]
      return {a:b[0] for a, b in answer.items()}

print(read_data("sequencing_reads.txt"))

Your file "sequencing_reads.txt" is in json format.您的文件"sequencing_reads.txt"是 json 格式。 You can use the json module in the python standard library to load your content into a dictionary quite easily.您可以使用 python 标准库中的 json 模块轻松地将您的内容加载到字典中。

import json

with open("sequencing_reads.txt") as f:
    sequence_dict = json.load(f)

Firstly, if your file is in json format and in separate lines, you should read it into a single line, maybe like this:首先,如果您的文件是 json 格式并在单独的行中,您应该将其读入一行,可能是这样的:

def read_data(file_name):
    lines = open(file_name).readlines()
    merged_line = " ".join([line.strip() for line in lines])

Secondly, The json.loads requires double quotes for the string(eg: {"a":"a"}).其次,json.loads 需要对字符串使用双引号(例如:{"a":"a"})。 If you are using single quote(as in your example), there may be errors.如果您使用单引号(如您的示例中所示),则可能会出现错误。 So you can do like this:所以你可以这样做:

# 1,use json.loads, but replace first
import json
merged_line = merged_line.replace("'", '"')
data = json.loads(merged_line)

# 2,use ast
import ast
data = ast.literal_eval(merged_line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM