简体   繁体   中英

Issue sorting lots of files in python

I have a directory with over 10 000 files all with the same extension. All with the same form, eg,

 20150921(1)_0001.sgy
 20150921(1)_0002.sgy
 20150921(1)_0003.sgy
 20150921(1)_0004.sgy
...
20150921(1)_13290.sgy

The code I'm currently using is:

files = listdir('full data')
files.sort()

However this returns a list that follows:

20150921(1)_0001.sgy
...
20150921(1)_0998.sgy
20150921(1)_0999.sgy
20150921(1)_1000.sgy
20150921(1)_10000.sgy
20150921(1)_10001.sgy
20150921(1)_10002.sgy
20150921(1)_10003.sgy
20150921(1)_10004.sgy
20150921(1)_10005.sgy
20150921(1)_10006.sgy
20150921(1)_10007.sgy
20150921(1)_10008.sgy
20150921(1)_10009.sgy
20150921(1)_1001.sgy
20150921(1)_10010.sgy

The problem only arises when there are more than 1000 files, it seems sort can't order files correctly if they're larger than 10000. Can anyone see a way around this?

This is called a Natural Sort . You can use the natsort package to do this:

from natsort import natsorted
import pprint

files = ['20150921(1)_0001.sgy',
'20150921(1)_0102.sgy',
'20150921(1)_0011.sgy',
'20150921(1)_0003.sgy',
'20150921(1)_0004.sgy',
'20150921(1)_0010.sgy',
'20150921(1)_1001.sgy',
'20150921(1)_0012.sgy',
'20150921(1)_0101.sgy',
'20150921(1)_1003.sgy',
'20150921(1)_0103.sgy',
'20150921(1)_10002.sgy',
'20150921(1)_1002.sgy',
'20150921(1)_10001.sgy',
'20150921(1)_0002.sgy',
]

pprint.pprint(natsorted(files))

This outputs:

['20150921(1)_0001.sgy',
 '20150921(1)_0002.sgy',
 '20150921(1)_0003.sgy',
 '20150921(1)_0004.sgy',
 '20150921(1)_0010.sgy',
 '20150921(1)_0011.sgy',
 '20150921(1)_0012.sgy',
 '20150921(1)_0101.sgy',
 '20150921(1)_0102.sgy',
 '20150921(1)_0103.sgy',
 '20150921(1)_1001.sgy',
 '20150921(1)_1002.sgy',
 '20150921(1)_1003.sgy',
 '20150921(1)_10001.sgy',
 '20150921(1)_10002.sgy']
sorted_filenames = sorted(os.listdir('full data'), key=lambda s: int(s.rsplit('.',1)[0].split("_",1)[1]))

They are sorting alphabetically. If you want to sort them by the number, you will need to do a bit of parsing first:

   def filename_to_tuple(name):
      import re
      match = re.match(r'(\d+)\((\d+)\)_(\d+)\.sgy', name)
      if not match:
         raise ValueError('Filename doesn't match expected pattern')
      else:
         return int(i for i in match.groups())

   sorted_files = sorted(os.listdir('full data'), key=filename_to_tuple)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM