简体   繁体   English

比较大型清单中的项目-查找长度相差1个字母的项目-Python

[英]Comparing items in large list - finding items differing in 1 letter by length - Python

I am looking to solve a problem I've been stuck on in Python. 我正在寻找解决我一直困在Python中的问题。 I have a single file, with one column, which contains approx. 我有一个文件,只有一栏,其中包含大约。 6,000 lines. 6,000行。 For each line, every item is unique (this file was filtered to remove duplicates froma 40,000 line file). 对于每一行,每个项目都是唯一的(已过滤此文件以删除40,000行文件中的重复项)。 The items in each row vary in length, where some are equal in length to others. 每行中的项目的长度各不相同,其中某些项目的长度与其他项目的长度相等。

An example of a single line: 单行示例:

IGHV3-30/33rn-IGHJ4-CARDPSLSSMITFGGVIVTRGYFDYW

Or more examples with tab separated after third "-" (differing first parts): 或更多带有制表符的示例,其中制表符在第三个“-”之后(不同的第一部分)分开:

IGHV3-23-IGHJ4  CAKDRGYTGYGVYFDYW
IGHV4-39-IGHJ4  CARHDILTGYSYYFDYW
IGHV3-23-IGHJ3  CAKSGGWYLSDAFDIW
IGHV4-39-IGHJ4  CARTGFGELGFDYW
IGHV1-2-IGHJ2   CARDSDYDWYFDLW
IGHV1-8-IGHJ3   CARGQTYYDILTGPSDAFDIW
IGHV4-39-IGHJ5  CARSTGDWFDPW
IGHV3-9-IGHJ3   CANVPIYSSSYDAFDIW
IGHV3-23-IGHJ4  CAKDWELYYFDYW
IGHV3-23-IGHJ4  CAKDRGYTGFGVYFDYW
IGHV4-39-IGHJ4  CARHLGYNNSWYPFDYW
IGHV1-2-IGHJ4   CAREGYNWNDEGRFDYW
IGHV3-23-IGHJ3  CAKSSGWYLSDAFDIW
IGHV4-39-IGHJ4  CARYLGYNSNWYPFDYW
IGHV3-23-IGHJ6  CAKEGCSSGCPYYYYGMDVW
IGHV3-23-IGHJ3  CAKWGPDAFDIW
IGHV3-11-IGHJ   CATSGGSP
IGHV3-11-IGHJ4  CARDGDGYNDYW
IGHV1-2-IGHJ4   CARRIGYSSGSEDYW
IGHV1-2-IGHJ4   CARDIAVPGHGDYW
IGHV6-1-IGHJ4   CASGGAVPGYYFDYW

In the first column, there are some items that are different. 在第一列中,有一些不同的项目。 In the second column every item is the unique. 在第二列中,每个项目都是唯一的。 The first column items need to match, and then the second column items need to be sorted by a minimum mismatch of 2. Ideally this would be good to do with Levenshtein module since I can put the max, but I need two strings. 第一列项目需要匹配,然后第二列项目需要以最小不匹配2进行排序。理想情况下,这可以与Levenshtein模块一起使用,因为我可以放入最大值,但是我需要两个字符串。 Is there a way to use Levenshtein on every item in a single list? 有没有一种方法可以在一个列表中的每个项目上使用Levenshtein?

What I need to do is open this file (I think sorting it by length first may help, however I'm not sure). 我需要做的是打开此文件(我认为先按长度排序可能会有所帮助,但是我不确定)。 After all the items are grouped by length, I need to sort these items into groups that differ by 1 character (the strings before the third "-" need to be identical, where the string after the "-" should differ only by 1 char.) 在将所有项目按长度分组之后,我需要将这些项目分类为相差1个字符的组(第三个“-”之前的字符串必须相同,其中“-”之后的字符串仅应相差1个字符) 。)

I think the problem I'm having is in regards to generating a proper for loop to iterate over the items length. 我认为我遇到的问题是关于生成适当的for循环以遍历项目长度的问题。

The code I have thus far: 到目前为止,我拥有的代码:

import sys
import os
import Levenshtein

inp = sys.argv[1] # Input file containing single column of items

with open(inp, "r") as f1:
        vj = [line.strip() for line in f1]

lengths = []
for k in vj:
        i = len(k)
        lengths.append(i)

lengths_sort = sorted(lengths, reverse = True)

uniq_len = []
for i in lengths_sort:
       if i not in uniq_len:
                uniq_len.append(i)

print uniq_len #For QC purposes

def get_new_list(strings, counts, outlist=[]):
        for s in strings:
                if len(s) == counts[0]:
                        outlist.append(s)
        return outlist

new_vj = get_new_list(vj, uniq_len, outlist=[])
print new_vj
ham = Levenshtein.hamming(new_vj[0], new_vj[1])
print ham

So the output I was looking for is good, but not yet complete: 所以我一直在寻找的输出是好的,但是还没有完成:

[46, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18]
46
['IGHV3-30/33rn-IGHJ4-CAKDPSLSSMITFGGVIVTRGYFDYW', 'IGHV3-30/33rn-IGHJ4-CARDPSLSSMITFGGVIVTRGYFDYW']
1

There are two items of length 46 (coincidentally the strings before the third "-" is the same; great) and they differ by only one character between the two strings. 有两个长度为46的项目(巧合的是,第三个“-”之前的字符串是相同的;很好),它们在两个字符串之间仅相差一个字符。

My trouble is, 1. how can I iterate through the numbers in the uniq_len list as unput for matching length in the "strings" list (see function in code). 我的麻烦是,1.如何遍历uniq_len列表中的数字作为“字符串”列表中匹配长度的输入(请参见代码中的函数)。 2. I want to create a new list for each differing length. 2.我想为每个不同的长度创建一个新列表。 3. If there are multiple items in each new list, all items much differ by only 1 character. 3.如果每个新列表中都有多个项目,则所有项目的差异仅相差1个字符。

Note: The "-"'s were creating using UNIX paste -d- command with 3 files containing 1 column each to create this file. 注意:“-”是使用UNIX paste -d-命令创建的,具有3个文件(每个文件包含1列)来创建此文件。 Would it be easier to paste these files together with a \\t as the delimiter to create 3 columns? 将这些文件与\\ t一起作为分隔符创建3列会更容易吗?

So, open the file, strip the lines, and one can then match up the 1st column, 2nd column, and see if the third column differs by one or more characters? 因此,打开文件,删除行,然后可以将第一列,第二列匹配,看看第三列是否相差一个或多个字符?

All help is appreciated. 感谢所有帮助。

Update: Modified to handle a variable number of "id" sub-fields and print the results as a single string. 更新:修改为处理可变数量的“ id”子字段,并将结果打印为单个字符串。 Note several test cases were added to the end of the input to have some with a different number of leading fields making up the id (ie 2 instead of 3). 请注意,在输入的末尾添加了几个测试用例,以使其中一些具有不同数量的前导字段构成id(即2而不是3)。

I also renamed the num_mismatches() function hamming_distance() because that's what it is. 我还重命名了num_mismatches()函数hamming_distance()因为它就是这样。

Using the following input: 使用以下输入:

IGHV3-23-IGHJ4-CAKDRGYTGYGVYFDYW
IGHV4-39-IGHJ4-CARHDILTGYSYYFDYW
IGHV3-23-IGHJ3-CAKSGGWYLSDAFDIW
IGHV4-39-IGHJ4-CARTGFGELGFDYW
IGHV1-2-IGHJ2-CARDSDYDWYFDLW
IGHV1-8-IGHJ3-CARGQTYYDILTGPSDAFDIW
IGHV4-39-IGHJ5-CARSTGDWFDPW
IGHV3-9-IGHJ3-CANVPIYSSSYDAFDIW
IGHV3-23-IGHJ4-CAKDWELYYFDYW
IGHV3-23-IGHJ4-CAKDRGYTGFGVYFDYW
IGHV4-39-IGHJ4-CARHLGYNNSWYPFDYW
IGHV1-2-IGHJ4-CAREGYNWNDEGRFDYW
IGHV3-23-IGHJ3-CAKSSGWYLSDAFDIW
IGHV4-39-IGHJ4-CARYLGYNSNWYPFDYW
IGHV3-23-IGHJ6-CAKEGCSSGCPYYYYGMDVW
IGHV3-23-IGHJ3-CAKWGPDAFDIW
IGHV3-11-IGHJ-CATSGGSP
IGHV3-11-IGHJ4-CARDGDGYNDYW
IGHV1-2-IGHJ4-CARRIGYSSGSEDYW
IGHV1-2-IGHJ4-CARDIAVPGHGDYW
IGHV6-1-IGHJ4-CASGGAVPGYYFDYW
IGHV1-2-CAREGYNWNDEGRFDYW
IGHV4-39-CARSTGDWFDPW
IGHV1-2-CARDSDYDWYFDLW

and this script: 和这个脚本:

from collections import defaultdict
from itertools import izip, tee
import os
import sys

# http://en.wikipedia.org/wiki/Hamming_distance#Algorithm_example
def hamming_distance(s1, s2):
    """ Count number of mismatched characters in equal length strings. """
    if not isinstance(s1, basestring): raise ValueError('s1 is not a string')
    if not isinstance(s2, basestring): raise ValueError('s2 is not a string')
    if len(s1) != len(s2): raise ValueError('string lengths do not match')
    return sum(a != b for a, b in izip(s1, s2))

def pairwise(iterable):  # itertools recipe
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

inp = sys.argv[1]  # Input file

unique = defaultdict(list)
with open(inp, 'rb') as file:
    for fields in (line.strip().split('-') for line in file):
        id = '-'.join(fields[:-1])  # recombine all but last field into an id
        unique[id].append(fields[-1])  # accumulate ending fields with same id

for id in sorted(unique):
    final_fields = unique[id]
    final_fields.sort(key=lambda field: len(field))  # sort by length
    print id + ':' + '-'.join(final_fields)
    if len(final_fields) > 1:  # at least one pair to compare for mismatches?
        for a, b in pairwise(final_fields):
            if len(a) == len(b) and hamming_distance(a, b) < 2:
                print '  {!r} and {!r} differ by < 2 characters'.format(a, b)

Output: 输出:

IGHV1-2:CARDSDYDWYFDLW-CAREGYNWNDEGRFDYW
IGHV1-2-IGHJ2:CARDSDYDWYFDLW
IGHV1-2-IGHJ4:CARDIAVPGHGDYW-CARRIGYSSGSEDYW-CAREGYNWNDEGRFDYW
IGHV1-8-IGHJ3:CARGQTYYDILTGPSDAFDIW
IGHV3-11-IGHJ:CATSGGSP
IGHV3-11-IGHJ4:CARDGDGYNDYW
IGHV3-23-IGHJ3:CAKWGPDAFDIW-CAKSGGWYLSDAFDIW-CAKSSGWYLSDAFDIW
  'CAKSGGWYLSDAFDIW' and 'CAKSSGWYLSDAFDIW' differ by < 2 characters
IGHV3-23-IGHJ4:CAKDWELYYFDYW-CAKDRGYTGYGVYFDYW-CAKDRGYTGFGVYFDYW
  'CAKDRGYTGYGVYFDYW' and 'CAKDRGYTGFGVYFDYW' differ by < 2 characters
IGHV3-23-IGHJ6:CAKEGCSSGCPYYYYGMDVW
IGHV3-9-IGHJ3:CANVPIYSSSYDAFDIW
IGHV4-39:CARSTGDWFDPW
IGHV4-39-IGHJ4:CARTGFGELGFDYW-CARHDILTGYSYYFDYW-CARHLGYNNSWYPFDYW-CARYLGYNSNWYPFDYW
IGHV4-39-IGHJ5:CARSTGDWFDPW
IGHV6-1-IGHJ4:CASGGAVPGYYFDYW

Hope this update is also helpful... 希望此更新对您有所帮助...

Let assume that you want to sort some lines by A) length of column, B) Levenshtein distance of that column vs the col above or below. 假设您要按A)列的长度,B)该列的Levenshtein距离与上方或下方的col进行排序。

The issue immediately arrises that levenshtein distance is between 2 relative objects; 问题立即表明,levenshtein距离在2个相对物体之间。 ie, Levenshtein(a, b) not a mono property like len(a) . 即, Levenshtein(a, b)不像len(a)这样的单声道属性。 The value of the levenshtein distance changes as the lines in the list change in position relative to each other. levenshtein距离的值随着列表中各行相对于彼此位置的变化而变化。

Python 2.X supports the older cmp versus key argument for sort. Python 2.X支持较早的cmp vs key参数进行排序。 It is inefficient, since it must be reevaluated every pass through the sort. 这是低效的,因为每次进行排序时都必须重新评估它。 However, Levenshtein sort is relative to the item above and below the entry. 但是,Levenshtein排序相对于条目上方和下方的项目。

As an example, lets make a matrix of your example data: 例如,让我们对示例数据进行矩阵处理:

txt='''\
IGHV3-23-IGHJ4  CAKDRGYTGYGVYFDYW
IGHV4-39-IGHJ4  CARHDILTGYSYYFDYW
IGHV3-23-IGHJ3  CAKSGGWYLSDAFDIW
IGHV4-39-IGHJ4  CARTGFGELGFDYW
IGHV1-2-IGHJ2   CARDSDYDWYFDLW
IGHV1-8-IGHJ3   CARGQTYYDILTGPSDAFDIW
IGHV4-39-IGHJ5  CARSTGDWFDPW
IGHV3-9-IGHJ3   CANVPIYSSSYDAFDIW
IGHV3-23-IGHJ4  CAKDWELYYFDYW
IGHV3-23-IGHJ4  CAKDRGYTGFGVYFDYW
IGHV4-39-IGHJ4  CARHLGYNNSWYPFDYW
IGHV1-2-IGHJ4   CAREGYNWNDEGRFDYW
IGHV3-23-IGHJ3  CAKSSGWYLSDAFDIW
IGHV4-39-IGHJ4  CARYLGYNSNWYPFDYW
IGHV3-23-IGHJ6  CAKEGCSSGCPYYYYGMDVW
IGHV3-23-IGHJ3  CAKWGPDAFDIW
IGHV3-11-IGHJ   CATSGGSP
IGHV3-11-IGHJ4  CARDGDGYNDYW
IGHV1-2-IGHJ4   CARRIGYSSGSEDYW
IGHV1-2-IGHJ4   CARDIAVPGHGDYW
IGHV6-1-IGHJ4   CASGGAVPGYYFDYW'''

data=[line.split() for line in txt.splitlines()] 
# [['IGHV3-23-IGHJ4', 'CAKDRGYTGYGVYFDYW'], ['IGHV4-39-IGHJ4', 'CARHDILTGYSYYFDYW'], ['IGHV3-23-IGHJ3', 'CAKSGGWYLSDAFDIW'], ['IGHV4-39-IGHJ4', 'CARTGFGELGFDYW'], ['IGHV1-2-IGHJ2', 'CARDSDYDWYFDLW'], ['IGHV1-8-IGHJ3', 'CARGQTYYDILTGPSDAFDIW'], ['IGHV4-39-IGHJ5', 'CARSTGDWFDPW'], ['IGHV3-9-IGHJ3', 'CANVPIYSSSYDAFDIW'], ['IGHV3-23-IGHJ4', 'CAKDWELYYFDYW'], ['IGHV3-23-IGHJ4', 'CAKDRGYTGFGVYFDYW'], ['IGHV4-39-IGHJ4', 'CARHLGYNNSWYPFDYW'], ['IGHV1-2-IGHJ4', 'CAREGYNWNDEGRFDYW'], ['IGHV3-23-IGHJ3', 'CAKSSGWYLSDAFDIW'], ['IGHV4-39-IGHJ4', 'CARYLGYNSNWYPFDYW'], ['IGHV3-23-IGHJ6', 'CAKEGCSSGCPYYYYGMDVW'], ['IGHV3-23-IGHJ3', 'CAKWGPDAFDIW'], ['IGHV3-11-IGHJ', 'CATSGGSP'], ['IGHV3-11-IGHJ4', 'CARDGDGYNDYW'], ['IGHV1-2-IGHJ4', 'CARRIGYSSGSEDYW'], ['IGHV1-2-IGHJ4', 'CARDIAVPGHGDYW'], ['IGHV6-1-IGHJ4', 'CASGGAVPGYYFDYW']]

Now sort those rows on length of row[1] : 现在将这些行按row[1]长度排序:

data.sort(key=lambda l: len(l[1])) 

Now sort based on the levenshtein distance of the entries. 现在根据条目的levenshtein距离进行排序。 Note that you need to use cmp since levenstein is based on two values: 请注意,由于levenstein基于两个值,因此您需要使用cmp

data.sort(cmp=levenshtein)

# [['IGHV3-11-IGHJ', 'CATSGGSP'], ['IGHV4-39-IGHJ5', 'CARSTGDWFDPW'], ['IGHV3-23-IGHJ3', 'CAKWGPDAFDIW'], ['IGHV3-11-IGHJ4', 'CARDGDGYNDYW'], ['IGHV3-23-IGHJ4', 'CAKDWELYYFDYW'], ['IGHV4-39-IGHJ4', 'CARTGFGELGFDYW'], ['IGHV1-2-IGHJ2', 'CARDSDYDWYFDLW'], ['IGHV1-2-IGHJ4', 'CARDIAVPGHGDYW'], ['IGHV1-2-IGHJ4', 'CARRIGYSSGSEDYW'], ['IGHV6-1-IGHJ4', 'CASGGAVPGYYFDYW'], ['IGHV3-23-IGHJ3', 'CAKSGGWYLSDAFDIW'], ['IGHV3-23-IGHJ3', 'CAKSSGWYLSDAFDIW'], ['IGHV3-23-IGHJ4', 'CAKDRGYTGYGVYFDYW'], ['IGHV4-39-IGHJ4', 'CARHDILTGYSYYFDYW'], ['IGHV3-9-IGHJ3', 'CANVPIYSSSYDAFDIW'], ['IGHV3-23-IGHJ4', 'CAKDRGYTGFGVYFDYW'], ['IGHV4-39-IGHJ4', 'CARHLGYNNSWYPFDYW'], ['IGHV1-2-IGHJ4', 'CAREGYNWNDEGRFDYW'], ['IGHV4-39-IGHJ4', 'CARYLGYNSNWYPFDYW'], ['IGHV3-23-IGHJ6', 'CAKEGCSSGCPYYYYGMDVW'], ['IGHV1-8-IGHJ3', 'CARGQTYYDILTGPSDAFDIW']]

If you want to change the two element sort cmp to a key type (for Python 3 that does not support cmp), you can produce comparison objects: 如果要将两个元素排序cmp更改为键类型(对于不支持cmp的Python 3),则可以生成比较对象:

def cmp_to_key(mycmp):
    'Convert a cmp= function into a key= function'
    class K(object):
        def __init__(self, obj, *args):
            self.obj = obj
        def __lt__(self, other):
            return mycmp(self.obj, other.obj) < 0
        def __gt__(self, other):
            return mycmp(self.obj, other.obj) > 0
        def __eq__(self, other):
            return mycmp(self.obj, other.obj) == 0
        def __le__(self, other):
            return mycmp(self.obj, other.obj) <= 0
        def __ge__(self, other):
            return mycmp(self.obj, other.obj) >= 0
        def __ne__(self, other):
            return mycmp(self.obj, other.obj) != 0
    return K 

The you call sort using data.sort(key=cmp_to_key(levenshtein)) 您使用data.sort(key=cmp_to_key(levenshtein))排序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM