简体   繁体   English

用python对csv文件进行排序

[英]sorting csv file with python

I am trying to sort a csv file by column.我正在尝试按列对 csv 文件进行排序。 The file has many columns and looks like:该文件有很多列,看起来像:

Tom,01AA01,234.56,334汤姆,01AA01,234.56,334

Alice,01AS01,546.09,3434.3爱丽丝,01AS01,546.09,3434.3

Sam,01NA01,4574.3,65.45山姆,01NA01,4574.3,65.45

Joy,01BA01,2897.03,455乔伊,01BA01,2897.03,455

Pam,01MA01,434.034,454帕姆,01MA01,434.034,454

John,01AA02,343,24约翰,01AA02,343,24

Alice,01AS02,454,454.54爱丽丝,01AS02,454,454.54

Tom,02BA01,3434,3454.2汤姆,02BA01,3434,3454.2

And it continues for about 20 columns and 250 rows.它继续大约 20 列和 250 行。

I want it to be sorted by the second column and ordered alphabetically for AA , AS , BA in the second portion, and numerically for the third section '01', '02', '03', and numerically for the first section '01', '02', '03' .我希望它按第二列排序, BA在第二部分按字母顺序为AAASBA排序,并按数字为第三部分'01', '02', '03',和数字为第一部分'01', '02', '03' And then create a new csv file from this sort.然后从这种类型创建一个新的 csv 文件。 They are not usually just 6 characters long, others are look like '02BAA', '01MAA', '02NAA' and so on.它们通常不只是 6 个字符长,其他的看起来像'02BAA', '01MAA', '02NAA'等等。

So in the end it would hopefully look like this for column 2.因此,最终第 2 列可能看起来像这样。

01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01

I'm new to coding and not quite sure how to go about doing this.我是编码新手,不太确定如何去做。 Thank you in advance.先感谢您。

The default sort order for ASCII strings from Python's sorted function is lexicographic (or 'ASCIIbetical'): Python 的sorted函数中 ASCII 字符串的默认排序顺序是词典(或 'ASCIIbetical'):

>>> li=['1', '20', '100', '11']
>>> sorted(li)
['1', '100', '11', '20']

Compared to integer magnitude when those list values are integers:当这些列表值为整数时,与整数幅度相比:

>>> sorted(map(int, li))
[1, 11, 20, 100]

ie, the magnitude of the numbers in strings to the human eye is different than the same strings to the computer's eye.即,人眼中字符串中数字的大小与计算机眼中的相同字符串不同。 (Written about more extensively in Codinghorror ) (在Codinghorror 中有更广泛的描述

To fix it, we need to separate the letters from the numbers and convert the numbers to integers (or floats).要修复它,我们需要将字母与数字分开并将数字转换为整数(或浮点数)。

The easiest way is with a regex that captures all numbers, converts to ints then all letters.最简单的方法是使用捕获所有数字的正则表达式,然后转换为整数,然后转换为所有字母。

This sorts into your target:这排序到你的目标:

li1='''\
01AA01
01AS01
01NA01
01BA01
01MA01
01AA02
01AS02
02BA01'''.splitlines()

tgt='''\
01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01'''.splitlines()


import re

def kf(s):
    nums=map(int, re.findall(r'(\d+)', s)) 
    lets=re.findall(r'([a-zA-Z]+)', s)
    return nums+lets   

print tgt==sorted(li1, key=kf)
# True

Or, one line:或者,一行:

>>> tgt==sorted(li1, key=lambda s: map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s))
True

Edit based on comments根据评论编辑

The text of the question states:问题的正文指出:

I want it to be ordered numerically in the first section 01,02,03... and then alphabetically for AA, AS, BA in the second portion, and numerically again for the third section.我希望它在第一部分 01,02,03... 中按数字顺序排列,然后在第二部分按字母顺序排列 AA、AS、BA,第三部分再次按数字顺序排列。

However, the example shows that this is not the case.但是,该示例表明情况并非如此。

We can sort based on the pattern of (int, letters, int) with split:我们可以根据(int,letters,int)的模式用split进行排序:

>>> [re.split(r'(\D+)', e) for e in li1]
[['01', 'AA', '01'], ['01', 'AS', '01'], ['01', 'NA', '01'], ['01', 'BA', '01'], ['01', 'MA', '01'], ['01', 'AA', '02'], ['01', 'AS', '02'], ['02', 'BA', '01']]
>>> sorted(li1, key=lambda s: [int(e) if e.isdigit() else e for e in re.split(r'(\D+)', s)])
['01AA01', '01AA02', '01AS01', '01AS02', '01BA01', '01MA01', '01NA01', '02BA01']
#             ^^        ^^        etc '01AA02', before '01AS01' in the example

By inspection, the pattern of the POSTED example is (int, int, letters) which can be seen here:通过检查,POSTED 示例的模式是(int, int, letters)可以在这里看到:

>>> [map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s) for s in li1]
[[1, 1, 'AA'], [1, 1, 'AS'], [1, 1, 'NA'], [1, 1, 'BA'], [1, 1, 'MA'], [1, 2, 'AA'], [1, 2, 'AS'], [2, 1, 'BA']]

If the TEXT is correct, use the split form of sort I have;如果TEXT是正确的,请使用我拥有的拆分形式; if the EXAMPLE is correct, use the nums+lets form.如果示例正确,请使用nums+lets形式。

sorted() and the list's .sort() method take an optional key argument. sorted()和列表的.sort()方法采用可选的key参数。

Where:在哪里:

key specifies a function of one argument that is used to extract a comparison key from each list element: key=str.lower. key指定一个参数函数,用于从每个列表元素中提取比较键:key=str.lower。

In other words, the function ( that you will write ) given to the key argument parses and returns the sortable value for the given object.换句话说,提供给 key 参数的函数(您将编写的)解析并返回给定对象的可排序值。

So, given your input, "01AS01" , you want to break it down into pieces that can be easily sorted.因此,鉴于您的输入"01AS01" ,您希望将其分解为易于排序的部分。

As you mentioned, you want the results sorted by ( int, str, int ).正如您所提到的,您希望结果按 ( int, str, int ) 排序。 Since sorted() and .sort() automatically will sort by number, in the case of ints , and alphabetically, in case of strings , all you need to do for your key function is to break your value, "01AS01" into [1, "AS", 1] and sorted() / .sort() will take care of the rest.由于sorted().sort()自动按数字排序,对于 ints和字母顺序,对于 strings ,您需要为key 函数做的就是将您的值"01AS01"分解为[1, "AS", 1]sorted() / .sort()将处理剩下的.sort()

This is a similar example to dawg's but without using map() and re .这是一个类似于 dawg 的例子,但没有使用map()re

col = ['01AA01',
 '01AS01',
 '01NA01',
 '01BA01',
 '01MA01',
 '01AA02',
 '01AS02',
 '02BA01'] 

def create_sort_key(value):
    int_indexes = (0, 4)
    str_indexes = (2,)
    parsed_values = []
    # get the starting index for groups of two
    for i in range(0, 6, 2):
        pair = value[i:i+2]
        if i in int_indexes:
            parsed_value = int(pair)
        elif i in str_indexes:
            parsed_value = str(pair)
        else:
            raise IndexError("unexpected index: {}".format(i))
        parsed_values.append(parsed_value)
    return parsed_values

col.sort(key=create_sort_key)

Assuming this is a csv file, each line is a row and each column is marked with a comma.假设这是一个 csv 文件,每行是一行,每列都用逗号标记。 Since haven't given us an exmaple of your csv, I made up one that has two columns, with your data in col[1].由于没有给我们一个你的 csv 的例子,我编了一个有两列的数据,你的数据在 col[1] 中。

>>> print open('mycsv.csv').read()
fred, 01AA01
brenda, 01BA01
bob, 01AA02
alice, 01NA01
jane, 01AS01
blane, 02BA01
larry, 01MA01
mary, 01AS02

These can all be read into a list with the csv module.这些都可以用 csv 模块读入一个列表。 You end up with a list of rows, where the columns are another list您最终得到一个行列表,其中的列是另一个列表

>>> import csv
>>> table=[row for row in csv.reader(open('mycsv.csv')) if row]
>>> print table
[['fred', ' 01AA01'], ['brenda', ' 01BA01'], ['bob', ' 01AA02'], ['alice', ' 01NA01'], ['jane', ' 01AS01'], ['blane', ' 02BA01'], ['larry', ' 01MA01'], ['mary', ' 01AS02']]

You can sort that list.您可以对该列表进行排序。 By default, sort starts with the first key, then the second key if the first is the same, and etc. So, it wil lsort by 'fred' etc... But you can select a different sort key.默认情况下,排序从第一个键开始,如果第一个键相同,则从第二个键开始,依此类推。因此,它将按 'fred' 等排序...但您可以选择不同的排序键。 Python calls key function with each list item so that you can transform it into what you want. Python 对每个列表项调用 key 函数,以便您可以将其转换为您想要的。 The transformations can be simple like make lower case, or complex.转换可以很简单,比如小写,也可以很复杂。

Its common to use lambdas for sort keys, but that may be a bit advanced, so here's a function that just grabs the key you want.对排序键使用 lambda 是很常见的,但这可能有点高级,所以这里有一个函数,它只是抓取你想要的键。

>>> def item_1(row):
...     return row[1]
... 
>>> print table
[['fred', ' 01AA01'], ['bob', ' 01AA02'], ['jane', ' 01AS01'], ['mary', ' 01AS02'], ['brenda', ' 01BA01'], ['larry', ' 01MA01'], ['alice', ' 01NA01'], ['blane', ' 02BA01']]
>>> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM