简体   繁体   English

难以在python列表中选择元素

[英]difficulty in selecting elements in a python list

I've been trying to clean a txt file, and I'm almost done with it. 我一直在尝试清理txt文件,而我几乎已经完成了。 I have a problem with the list - I cannot select elements of the list I created in process_line() , at the line with ### . 我的列表有问题-我无法在###的行中选择在process_line()创建的列表的元素。

Below is a snippet of the code; 下面是代码片段;

def process_line(line):
    # receiving a line or string as function
    # argument and replacing '-' 'D00-D09' & 'F00-F09' to '' if it exists
    line = re.sub('D0+\d|F0+\d|-', '', line)
    seq = str(line.split())
    line = re.sub('\'|\\[|\\]|,', '', seq)
    ###  line = (seq[0] + '|' seq[3] + '-' seq[5]) # this is for shorter lines
    print line
    return line  + '\n'

Here is a sample set of data after removing some unwanted data 这是删除一些不需要的数据后的样本数据集

12asA   1  A    4  A  330 
12asB   1  B    4  B  330 
12caA   1  A    5  A  260 
12e8H   1  H    1  H  113   1  H  114  H  212   H  213  H  214  (2)
12e8L   1  L    1  L  107   1  L  108  L  211   L  212  L  214  (3)   

I was hoping to achieve a format like this, however I need to learn how to extract the elements needed - so I can rearrange the data to the required format: 我希望实现这样的格式,但是我需要学习如何提取所需的元素-这样我就可以将数据重新排列为所需的格式:

12asA|4-330
12asB|4-330
12caA|5-260
12e8H|1-113,114-212
12e8l|1-107, 108-211 

instead of getting eg. 而不是得到例如 23reA|1-14,56-65 I get something [2|1-A] 23reA|1-14,56-65我得到了一些东西[2|1-A]

I'm not really sure what you're trying to do here, but this appears to match your desired output: 我不太确定您要在这里做什么,但这似乎与所需的输出匹配:

import re

data = '''
12asA   1  A    4  A  330  
12asB   1  B    4  B  330 
12caA   1  A    5  A  260 
12e8H   1  H    1  H  113   1  H  114  H  212   H  213  H  214  (2)
12e8L   1  L    1  L  107   1  L  108  L  211   L  212  L  214  (3)
'''
lines = filter(None, data.split('\n')) # filter to remove blank lines

def process_line(line):
    line = re.sub(r'D0\d|F0\d|-', '', line)
    for char in "'[],":
        line = line.replace(char, '')
    seq = line.split()
    if len(seq) == 6:
        return '{}|{}-{}'.format(seq[0], seq[3], seq[5])
    elif len(seq) == 16:
        return '{}|{}-{},{}-{}'.format(seq[0], seq[3], seq[5], seq[8], seq[10])

result = [process_line(line) for line in lines]
for r in result:
    print(r)

Output: 输出:

12asA|4-330
12asB|4-330
12caA|5-260
12e8H|1-113,114-212
12e8L|1-107,108-211

The following regex in your code: 您的代码中的以下正则表达式:

line = re.sub('\'|\\[|\\]|,', '', seq)

is a real mess. 真是一团糟。 I have replaced it with a sequence of simple str.replace calls instead. 我已将其替换为一系列简单的str.replace调用。 In future, when writing regular expressions, please use raw strings (eg r'...' ) for readability and to help you avoid bugs. 将来,在编写正则表达式时,请使用原始字符串(例如r'...' )以提高可读性并帮助您避免错误。

If you only added that line to get rid of the brackets, commas, and quotes introduced by calling str(line.split()) (rather than to deal with garbage in your original data), you should go ahead and remove its equivalent in the code I posted, because it does nothing useful. 如果仅添加该行以消除通过调用str(line.split())引入的括号,逗号和引号(而不是处理原始数据中的垃圾),则应继续删除该行中的等效项我发布的代码,因为它没有任何用处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM