简体   繁体   English

python,比较位于两个不同文本文件中的列中的字符串

[英]python, compare strings in columns located in two different text files

I have 2 text files, "animals.txt" and "colors.txt" as follows, where the 2 Strings in each row are separated by a tab. 我有2个文本文件,分别是“ animals.txt”和“ colors.txt”,其中每行中的2个字符串由一个制表符分隔。

"animals.txt" “ animals.txt”

12345  dog

23456  sheep

34567  pig

"colors.txt" “ colors.txt”

34567  pink

12345  black

23456  white

I want to write Python code that: 我想编写以下Python代码:

  1. For every rows in "animals.txt" takes the string in the first column (12345, then 23456, then 34567) 对于“ animals.txt”中的每一行,请在第一列中使用字符串(12345,然后是23456,然后是34567)
  2. Compares this string to the strings in the first column in "colors.txt" 将此字符串与“ colors.txt”中第一列中的字符串进行比较
  3. If it finds a match (12345 == 12345, etc) , it writes two output files: 如果找到匹配项(12345 == 12345,依此类推),它将写入两个输出文件:

output1, containing the rows of animals.txt + the value in the second column of colors.txt that corresponds to the querying value (12345): 输出1,其中包含animals.txt行和colors.txt第二列中与查询值相对应的值(12345):

12345 dog   black
23456 sheep white
34567 pig   pink 

output2 containing a list of the values in the second column of colors.txt that correspond to the querying value (12345, then 23456, then 34567)): output2包含colors.txt第二列中与查询值相对应的值的列表(12345,然后是23456,然后是34567)):

black
white
pink

If order doesn't matter, this becomes a pretty easy problem: 如果顺序无关紧要,这将成为一个非常简单的问题:

with open('animals.txt') as f1, open('colors.txt') as f2:
    animals = {} 
    for line in f1:
        animal_id, animal_type = line.split('\t')
        animals[animal_id] = animal_type

    #animals = dict(map(str.split,f1)) would work instead of the above loop if there are no multi-word entries.

    colors={}
    for line in f2:
        color_id, color_name = line.split('\t')
        colors[color_id] = color_name

    #colors = dict(map(str.split,f2)) would work instead of the above loop if there are no multi-word entries.
    #Thanks @Sven for pointing this out.

common=set(animals.keys()) & set(colors.keys())  #set intersection. 
with open('output1.txt','w') as f1, open('output2.txt','w') as f2:
     for i in common:  #sorted(common,key=int) #would work here to sort.
         f1.write("%s\t%s\t%s\n"%(i,animals[i],colors[i])
         f2.write("%s"%colors[i])

You might be able to do this a little more elegantly via a defaultdict where you append to a list when a particular key is encountered, then when writing you test that the length of the list is 2 before you output, but, I'm not convinced that approach is better. 您可能可以通过defaultdict更优雅地执行此操作,在遇到特定键时将其追加到列表,然后在编写时测试在输出之前列表的长度为2,但是,我不是深信这种方法更好。

Do you need to use python? 您需要使用python吗? If you are using bash and your inputs are not sorted, do: 如果您正在使用bash并且输入未排序,请执行以下操作:

$ join -t $'\t' <( sort animals.txt ) <( sort colors.txt ) > output1
$ cut -f 3 output1 > output2

If you do not have a shell that supports process substitution, then sort your input files and do: 如果您没有支持进程替换的外壳,请对输入文件进行排序并执行以下操作:

$ join -t '<tab>' animals.txt colors.txt > output1
$ cut -f 3 output1 > output2

Where <tab> is an actual tab character. 其中<tab>是实际的制表符。 Depending on your shell, you may be able to enter it with ctrl-V followed by a tab key. 根据您的外壳,您可能可以使用ctrl-V和一个Tab键输入它。 (Or use a different delimiter for cut.) (或使用其他分隔符进行剪切。)

I would use pandas 我会用熊猫

animals, colors = read_table('animals.txt', index_col=0), read_table('colors.txt', index_col=0)
df = animals.join(colors)

results in: 结果是:

animals.join(colors)
Out[73]: 
       animal  color
id
12345  dog     black
23456  sheep   white
34567  pig     pink

then to output color in order of id to file: 然后按照ID的顺序将颜色输出到文件:

df.color.to_csv(r'out.csv', index=False)

if you are unable to add column headings to the textfile they can be added on import 如果您无法将列标题添加到文本文件中,则可以在导入时添加它们

animals = read_table('animals.txt', index_col=0, names=['id','animal'])

Under the assumption that each line in the input files is structured exactly as the examples: 假设输入文件中的每一行的结构均与示例完全相同:

with open("c:\\python27\\output1.txt","w") as out1, \ 
     open("c:\\python27\\output2.txt","w") as out2:

    for outline in [animal[0]+"\t"+animal[1]+"\t"+color[1] \
                    for animal in [line.strip('\n').split("\t") \
                    for line in open("c:\\python27\\animals.txt","r").readlines()] \
                    for color in [line.strip('\n').split("\t") \
                    for line in open("c:\\python27\\colors.txt","r").readlines()] \
                    if animal[0] == color[0]]:

        out1.write(outline+'\n')
        out2.write(outline[outline.rfind('\t')+1:]+'\n')

I think that would do it for you. 我认为这将为您做到。

Perhaps not the most elegant/fast/clear method - but pretty short. 也许不是最优雅/快速/清晰的方法-但很短。 Technically that's 4 lines, I believe. 我认为从技术上讲这是4行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM