[英]difficulty in selecting elements in a python list
I've been trying to clean a txt file, and I'm almost done with it. 我一直在尝试清理txt文件,而我几乎已经完成了。 I have a problem with the list - I cannot select elements of the list I created in
process_line()
, at the line with ###
. 我的列表有问题-我无法在
###
的行中选择在process_line()
创建的列表的元素。
Below is a snippet of the code; 下面是代码片段;
def process_line(line):
# receiving a line or string as function
# argument and replacing '-' 'D00-D09' & 'F00-F09' to '' if it exists
line = re.sub('D0+\d|F0+\d|-', '', line)
seq = str(line.split())
line = re.sub('\'|\\[|\\]|,', '', seq)
### line = (seq[0] + '|' seq[3] + '-' seq[5]) # this is for shorter lines
print line
return line + '\n'
Here is a sample set of data after removing some unwanted data 这是删除一些不需要的数据后的样本数据集
12asA 1 A 4 A 330
12asB 1 B 4 B 330
12caA 1 A 5 A 260
12e8H 1 H 1 H 113 1 H 114 H 212 H 213 H 214 (2)
12e8L 1 L 1 L 107 1 L 108 L 211 L 212 L 214 (3)
I was hoping to achieve a format like this, however I need to learn how to extract the elements needed - so I can rearrange the data to the required format: 我希望实现这样的格式,但是我需要学习如何提取所需的元素-这样我就可以将数据重新排列为所需的格式:
12asA|4-330
12asB|4-330
12caA|5-260
12e8H|1-113,114-212
12e8l|1-107, 108-211
instead of getting eg. 而不是得到例如
23reA|1-14,56-65
I get something [2|1-A]
23reA|1-14,56-65
我得到了一些东西[2|1-A]
I'm not really sure what you're trying to do here, but this appears to match your desired output: 我不太确定您要在这里做什么,但这似乎与所需的输出匹配:
import re
data = '''
12asA 1 A 4 A 330
12asB 1 B 4 B 330
12caA 1 A 5 A 260
12e8H 1 H 1 H 113 1 H 114 H 212 H 213 H 214 (2)
12e8L 1 L 1 L 107 1 L 108 L 211 L 212 L 214 (3)
'''
lines = filter(None, data.split('\n')) # filter to remove blank lines
def process_line(line):
line = re.sub(r'D0\d|F0\d|-', '', line)
for char in "'[],":
line = line.replace(char, '')
seq = line.split()
if len(seq) == 6:
return '{}|{}-{}'.format(seq[0], seq[3], seq[5])
elif len(seq) == 16:
return '{}|{}-{},{}-{}'.format(seq[0], seq[3], seq[5], seq[8], seq[10])
result = [process_line(line) for line in lines]
for r in result:
print(r)
Output: 输出:
12asA|4-330
12asB|4-330
12caA|5-260
12e8H|1-113,114-212
12e8L|1-107,108-211
The following regex in your code: 您的代码中的以下正则表达式:
line = re.sub('\'|\\[|\\]|,', '', seq)
is a real mess. 真是一团糟。 I have replaced it with a sequence of simple
str.replace
calls instead. 我已将其替换为一系列简单的
str.replace
调用。 In future, when writing regular expressions, please use raw strings (eg r'...'
) for readability and to help you avoid bugs. 将来,在编写正则表达式时,请使用原始字符串(例如
r'...'
)以提高可读性并帮助您避免错误。
If you only added that line to get rid of the brackets, commas, and quotes introduced by calling str(line.split())
(rather than to deal with garbage in your original data), you should go ahead and remove its equivalent in the code I posted, because it does nothing useful. 如果仅添加该行以消除通过调用
str(line.split())
引入的括号,逗号和引号(而不是处理原始数据中的垃圾),则应继续删除该行中的等效项我发布的代码,因为它没有任何用处。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.