简体   繁体   中英

sorting csv file with python

I am trying to sort a csv file by column. The file has many columns and looks like:

Tom,01AA01,234.56,334

Alice,01AS01,546.09,3434.3

Sam,01NA01,4574.3,65.45

Joy,01BA01,2897.03,455

Pam,01MA01,434.034,454

John,01AA02,343,24

Alice,01AS02,454,454.54

Tom,02BA01,3434,3454.2

And it continues for about 20 columns and 250 rows.

I want it to be sorted by the second column and ordered alphabetically for AA , AS , BA in the second portion, and numerically for the third section '01', '02', '03', and numerically for the first section '01', '02', '03' . And then create a new csv file from this sort. They are not usually just 6 characters long, others are look like '02BAA', '01MAA', '02NAA' and so on.

So in the end it would hopefully look like this for column 2.

01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01

I'm new to coding and not quite sure how to go about doing this. Thank you in advance.

The default sort order for ASCII strings from Python's sorted function is lexicographic (or 'ASCIIbetical'):

>>> li=['1', '20', '100', '11']
>>> sorted(li)
['1', '100', '11', '20']

Compared to integer magnitude when those list values are integers:

>>> sorted(map(int, li))
[1, 11, 20, 100]

ie, the magnitude of the numbers in strings to the human eye is different than the same strings to the computer's eye. (Written about more extensively in Codinghorror )

To fix it, we need to separate the letters from the numbers and convert the numbers to integers (or floats).

The easiest way is with a regex that captures all numbers, converts to ints then all letters.

This sorts into your target:

li1='''\
01AA01
01AS01
01NA01
01BA01
01MA01
01AA02
01AS02
02BA01'''.splitlines()

tgt='''\
01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01'''.splitlines()


import re

def kf(s):
    nums=map(int, re.findall(r'(\d+)', s)) 
    lets=re.findall(r'([a-zA-Z]+)', s)
    return nums+lets   

print tgt==sorted(li1, key=kf)
# True

Or, one line:

>>> tgt==sorted(li1, key=lambda s: map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s))
True

Edit based on comments

The text of the question states:

I want it to be ordered numerically in the first section 01,02,03... and then alphabetically for AA, AS, BA in the second portion, and numerically again for the third section.

However, the example shows that this is not the case.

We can sort based on the pattern of (int, letters, int) with split:

>>> [re.split(r'(\D+)', e) for e in li1]
[['01', 'AA', '01'], ['01', 'AS', '01'], ['01', 'NA', '01'], ['01', 'BA', '01'], ['01', 'MA', '01'], ['01', 'AA', '02'], ['01', 'AS', '02'], ['02', 'BA', '01']]
>>> sorted(li1, key=lambda s: [int(e) if e.isdigit() else e for e in re.split(r'(\D+)', s)])
['01AA01', '01AA02', '01AS01', '01AS02', '01BA01', '01MA01', '01NA01', '02BA01']
#             ^^        ^^        etc '01AA02', before '01AS01' in the example

By inspection, the pattern of the POSTED example is (int, int, letters) which can be seen here:

>>> [map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s) for s in li1]
[[1, 1, 'AA'], [1, 1, 'AS'], [1, 1, 'NA'], [1, 1, 'BA'], [1, 1, 'MA'], [1, 2, 'AA'], [1, 2, 'AS'], [2, 1, 'BA']]

If the TEXT is correct, use the split form of sort I have; if the EXAMPLE is correct, use the nums+lets form.

sorted() and the list's .sort() method take an optional key argument.

Where:

key specifies a function of one argument that is used to extract a comparison key from each list element: key=str.lower.

In other words, the function ( that you will write ) given to the key argument parses and returns the sortable value for the given object.

So, given your input, "01AS01" , you want to break it down into pieces that can be easily sorted.

As you mentioned, you want the results sorted by ( int, str, int ). Since sorted() and .sort() automatically will sort by number, in the case of ints , and alphabetically, in case of strings , all you need to do for your key function is to break your value, "01AS01" into [1, "AS", 1] and sorted() / .sort() will take care of the rest.

This is a similar example to dawg's but without using map() and re .

col = ['01AA01',
 '01AS01',
 '01NA01',
 '01BA01',
 '01MA01',
 '01AA02',
 '01AS02',
 '02BA01'] 

def create_sort_key(value):
    int_indexes = (0, 4)
    str_indexes = (2,)
    parsed_values = []
    # get the starting index for groups of two
    for i in range(0, 6, 2):
        pair = value[i:i+2]
        if i in int_indexes:
            parsed_value = int(pair)
        elif i in str_indexes:
            parsed_value = str(pair)
        else:
            raise IndexError("unexpected index: {}".format(i))
        parsed_values.append(parsed_value)
    return parsed_values

col.sort(key=create_sort_key)

Assuming this is a csv file, each line is a row and each column is marked with a comma. Since haven't given us an exmaple of your csv, I made up one that has two columns, with your data in col[1].

>>> print open('mycsv.csv').read()
fred, 01AA01
brenda, 01BA01
bob, 01AA02
alice, 01NA01
jane, 01AS01
blane, 02BA01
larry, 01MA01
mary, 01AS02

These can all be read into a list with the csv module. You end up with a list of rows, where the columns are another list

>>> import csv
>>> table=[row for row in csv.reader(open('mycsv.csv')) if row]
>>> print table
[['fred', ' 01AA01'], ['brenda', ' 01BA01'], ['bob', ' 01AA02'], ['alice', ' 01NA01'], ['jane', ' 01AS01'], ['blane', ' 02BA01'], ['larry', ' 01MA01'], ['mary', ' 01AS02']]

You can sort that list. By default, sort starts with the first key, then the second key if the first is the same, and etc. So, it wil lsort by 'fred' etc... But you can select a different sort key. Python calls key function with each list item so that you can transform it into what you want. The transformations can be simple like make lower case, or complex.

Its common to use lambdas for sort keys, but that may be a bit advanced, so here's a function that just grabs the key you want.

>>> def item_1(row):
...     return row[1]
... 
>>> print table
[['fred', ' 01AA01'], ['bob', ' 01AA02'], ['jane', ' 01AS01'], ['mary', ' 01AS02'], ['brenda', ' 01BA01'], ['larry', ' 01MA01'], ['alice', ' 01NA01'], ['blane', ' 02BA01']]
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM