简体   繁体   English

从python列表中的文本文件中获取匹配列

[英]Grab matching column from text file in python list

I have a text file that look like : (from ipython ) cat path_to_file 我有一个文本文件,看起来像:(来自ipython)cat path_to_file

0   0.25    truth fact 
1   0.25    train home find travel
........
199 0.25    video box store office

I have another list 我还有另一个清单

vec = [(76, 0.04334748761500331),
 (128, 0.03697806086341099),
 (81, 0.03131634819532892),
 (1, 0.03131634819532892)]

Now i want to only grab the matching first column from vec with first column of text file and show 1,2nd columns of vec with 3rd column from text file as my output. 现在我只想从文本文件的第一列中获取vec匹配的第一列,并从文本文件的第三列中显示vec的1,2nd列,作为我的输出。

If i had text file in same format as vec, i could have used set(a) & set(b). 如果我的文本文件与vec格式相同,则可以使用set(a)和set(b)。 But values in test file are tabbed spaced(that's what it looks like when doing following) 但是测试文件中的值以选项卡式分隔(这就是执行以下操作时的样子)

with open( path_to_file ) as f: lines = f.read().splitlines() 使用open(path_to_file)作为f:lines = f.read()。splitlines()

Output is : 输出为:

['0\t0.25\ttruth fact lie
.........................
 '198\t0.25\tfan genre bit enjoy ',
 '199\t0.25\tvideo box store office  ']

Using NumPy: 使用NumPy:

import numpy as np
import numpy.lib.recfunctions as rfn

dtype = [('index', int), ('text', object)]
table = np.loadtxt(path_to_file, dtype=dtype, usecols=(0,2), delimiter='\t')

dtype = [('index', int), ('score', float)]
array = np.array(vec, dtype=dtype)

joined = rfn.join_by('index', table, array)

for row in joined:
      print row['index'], row['score'], row['text']

If you care a lot about performance you can use np.savetxt() to do the output too, but I thought it was easier to understand this way. 如果您非常关心性能,则也可以使用np.savetxt()进行输出,但是我认为用这种方式更容易理解。

Converting vec to a dict and splitting the lines using "\\t" as the delimiter should work: 将vec转换为dict并使用"\\t"作为分隔符分割行应该可以工作:

vecdict = dict(vec)

output = []
for l in open('path_to_file'):
    words = l.split('\t')
    key = float(words[0])
    if vecdict.has_key(key):
        output.append("%s %f %s"%(words[0], vecdict[key], ' '.join(words[2:])) )

output should then be a list of strings. output应为字符串列表。

PS: If you have multiple delimiters or are not sure which it is you could either use repeated calls to split , or re , eg PS:如果您有多个定界符或不确定是哪个定界符,则可以使用重复调用splitre ,例如

print re.findall("[\w]+", "this has    multiple delimiters\tHere")

>> ["this", "has", "multiple", "delimiters", "Here"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM