简体   繁体   中英

Python equivalent of bash sort lexicographical and numerical

So I've been working on a Python script that combines some information into a "bed" format. Which means that I'm working with features on a genome, my first column is the scaffold name (string), the second the start position on that scaffold (integer) and the third column is the stop position (integer), the other columns contain other information which is not relevant to my question. My issue is that my output is unsorted.

Now I know I can sort my files using this bash command:

$sort -k1,1 -k2,2n -k3,3n infile > outfile

But in the interest efficacy I'd like to know if there's a way to do this in Python. So far I've only seen list based sorts that deal with one either lexicographical or numerical sort. Not a combination of the two. So, do you guys have any ideas?

Snippet of my data (I want to sort by column 1, 2 and 3 (in that order)):

Scf_3R  8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R  8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L  16209309    16236428    FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L  10630469    10632308    FBgn0000018 FBgn0193617 0.073153454539
Scf_3R  12087670    12124207    FBgn0000024 FBgn0022516 -0.023946795475
Scf_X   14395665    14422243    FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R  25163062    25165316    FBgn0000032 FBgn0189058 0.530118698187
Scf_3R  19757441    19808894    FBgn0000036 FBgn0189822 -0.282508464261

Load data, sort them with sorted , write to a new file.

# Load data 
lists = list()
with open(filename, 'r') as f:
    for line in f:
        lists.append(line.rstrip().split())

# Sort data
results = sorted(lists, key=lambda x:(x[0], int(x[1]), int(x[2])))

# Write to a file
import csv
with open(filename, 'w') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerows(results)

To sort by your own sort criteria, just pass the corresponding key function:

with open('infile', 'rb') as file:
    lines = file.readlines()

def sort_key(line):
    fields = line.split()
    try:
        return fields[0], int(fields[1]), int(fields[2])
    except (IndexError, ValueError):
        return () # sort invalid lines together
lines.sort(key=sort_key)

with open('outfile', 'wb') as file:
    file.writelines(lines)

It assumes there is a newline at the end of the input file (append it if needed).

The code sorts text data by its bytes values (it is ok if the first column is ASCII), open the file in a text mode (use io.open() on Python 2) if it is not the case (to sort by Unicode code point values). The result of the sort command in the shell may depend on locale. You could use PyICU collator in Python .

If you need to sort files that do not fit in memory, see Sorting text file by using Python

The solution of @sparkandshine seems to be short and to the point for the specific sort pattern. Also the one offered by @jf-sebastian to me looks excellent, terse and with important hints/links on internationalization and off-memory sorting strategies.

Maybe the following more explicit show case offers additional helpful info for the OP or someone with similar tasks to adapt to their needs. Please see comments in the mostly pep8 conforming code:

#! /usr/bin/env python
"""Provide a show case for hierarchical sort, that offers flexible
hierarchical lexcical, numeric column sort mixes at runtime.
Hopefully this draft solution offers ideas for helping migrate
the sort shell level operation into a pythonic solution - YMMV."""
from __future__ import print_function

from functools import partial  # We use this to tailor the key function


def text_in_lines_gen(text_in_lines):
    """Mock generator simulating a line source for the data."""
    for line in text_in_lines.split('\n'):
        if line:
            yield line.split()


def sort_hier_gen(iterable_lines, hier_sort_spec):
    """Given iterator of text lines, sort all lines based on
    sort specification in hier_sort_spec.

    Every entry in hier_sort_spec is expected to be a pair with first value
    integer for index in columns of text blocks lines and second entry
    type of sorting in ('int', 'float') numeric or any other for text
    (lexical) ordering regime."""

    num_codes = ('int', 'float')
    converter_map = dict(zip(num_codes, (int, float)))

    # Extract facts from sort spec, prepare processing:
    key_ordered = tuple(k for k, _ in hier_sort_spec)

    # Prepare key function: Step 1 ...
    def _key_in(selected, r):
        """Inject the indexing into the key at sort time
        via partial application, as key function in sort
        has single argument only."""
        return tuple(r[k] for k in selected)

    _key = partial(_key_in, key_ordered)  # ... step 2

    convert_these_by = {}
    for k, t in hier_sort_spec:
        if t in num_codes:
            convert_these_by[k] = converter_map[t]

    if not convert_these_by:  # early out
        for row in sorted(iterable_lines, key=_key):
            yield row
    else:
        def flow_converter(row_iter, converter_map):
            """Row based converter - Don't block the flow ;-)."""
            for row in row_iter:
                for k, convert in converter_map.items():
                    row[k] = convert(row[k])
                yield row

        for row in sorted(flow_converter(iterable_lines,
                          convert_these_by), key=_key):
            yield row


def main():
    """Drive the hierarchical text-int-int sort."""

    data_1 = """Scf_3R  8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
    Scf_3R  8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
    Scf_3L  16209309    16236428    FBgn0000017 FBgn0184183 -1.19105585707
    Scf_2L  10630469    10632308    FBgn0000018 FBgn0193617 0.073153454539
    Scf_3R  12087670    12124207    FBgn0000024 FBgn0022516 -0.023946795475
    Scf_X   14395665    14422243    FBgn0000028 FBgn0187465 0.00300558969397
    Scf_3R  25163062    25165316    FBgn0000032 FBgn0189058 0.530118698187
    Scf_3R  19757441    19808894    FBgn0000036 FBgn0189822 -0.282508464261"""

    bar = []
    x = 0
    for a in range(3, 0, -1):
        for b in range(3, 0, -1):
            for c in range(3, 0, -1):
                x += 1
                bar.append('a_%d %d %0.1f %d' % (a, b, c * 1.1, x))
    data_2 = '\n'.join(bar)

    hier_sort_spec = ((0, 't'), (1, 'int'), (2, 'int'))
    print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
        print(sorted_row)

    hier_sort_spec = ((0, 't'), (1, None), (2, False))
    print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
        print(sorted_row)

    hier_sort_spec = ((0, 't'), (2, 'float'), (1, 'int'))
    print("# Test data set 2 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_2), hier_sort_spec):
        print(sorted_row)

if __name__ == '__main__':
    main()

On my machine the three test cases (including the questions sample data) yield:

First:

# Test data set 1 and sort spec=((0, 't'), (1, 'int'), (2, 'int')):
['Scf_2L', 10630469, 10632308, 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', 16209309, 16236428, 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', 8497493, 8503049, 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', 8599253, 8621866, 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_3R', 12087670, 12124207, 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', 19757441, 19808894, 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', 25163062, 25165316, 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_X', 14395665, 14422243, 'FBgn0000028', 'FBgn0187465', '0.00300558969397']

Second:

# Test data set 1 and sort spec=((0, 't'), (1, None), (2, False)):
['Scf_2L', '10630469', '10632308', 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', '16209309', '16236428', 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', '12087670', '12124207', 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', '19757441', '19808894', 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', '25163062', '25165316', 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_3R', '8497493', '8503049', 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', '8599253', '8621866', 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_X', '14395665', '14422243', 'FBgn0000028', 'FBgn0187465', '0.00300558969397']

Third:

# Test data set 2 and sort spec=((0, 't'), (2, 'float'), (1, 'int')):
['a_1', 1, 1.1, '27']
['a_1', 2, 1.1, '24']
['a_1', 3, 1.1, '21']
['a_1', 1, 2.2, '26']
['a_1', 2, 2.2, '23']
['a_1', 3, 2.2, '20']
['a_1', 1, 3.3, '25']
['a_1', 2, 3.3, '22']
['a_1', 3, 3.3, '19']
['a_2', 1, 1.1, '18']
['a_2', 2, 1.1, '15']
['a_2', 3, 1.1, '12']
['a_2', 1, 2.2, '17']
['a_2', 2, 2.2, '14']
['a_2', 3, 2.2, '11']
['a_2', 1, 3.3, '16']
['a_2', 2, 3.3, '13']
['a_2', 3, 3.3, '10']
['a_3', 1, 1.1, '9']
['a_3', 2, 1.1, '6']
['a_3', 3, 1.1, '3']
['a_3', 1, 2.2, '8']
['a_3', 2, 2.2, '5']
['a_3', 3, 2.2, '2']
['a_3', 1, 3.3, '7']
['a_3', 2, 3.3, '4']
['a_3', 3, 3.3, '1']

Updated to use mostly generators for only having one copy of the data "around" as this is needed anyway (in-memory) for the global sort, but no need for having more copies ;-)

Added also functools.partial as this for me is the fastest way, to adapt a key function to the flexible sort order.

One last update removed the remaining non-generator copy cruft in the case where a conversion is implemented, by defining a local generator function ofr the row based conversions. HTH.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM