简体   繁体   中英

unix - automatically determine field separator and record (EOL) separator?

Say you have 20 files and you don't won't to look at each one but instead have a script determine the format of the file.

ie bash findFileFormat direcName

Then loops through each file in a directory and print out the filename plus whether it has a delimiter (in which case is it a comma, pipe or otherwise) or fixed with for field separator and then what is the record separator. ie CR, LF, Ctrl+Z character.etc

I was thinking because some files may have a lot of pipes and commas in the data, that it could use a count of each character per line to determine what the delimiter is --> if this process does not produce consistent numbers of the character per line it is safe to assume that the file uses a fixed width field separator.

Is there a command or script that can be used to determine these 2 bits of info for each file?

Here's a small python script that will do as a starting point for what you need:

import sys

separators = [',', '|']
file_name = sys.argv[1]

def sep_cnt(line):
  return {sep:line.count(sep) for sep in separators}

with open(file_name, 'r') as inf:
  lines = inf.readlines()

cnts = [sep_cnt(line) for line in lines]
print(cnts)

def cnts_red(a, b):
  c = {}
  for k, v in a.iteritems():
    if v > 0 and v == b[k]:
      c[k] = v
  return c

final = reduce(cnts_red, cnts[1:], cnts[0])

if len(final) == 0:
  ftype = 'fixed'
else:
  ftype = 'sep by ' + str(final.iteritems().next()[0])

print(ftype)

Name the above heur_sep.py and run this somewhere safe (eg /tmp):

# Prepare
rm *.txt

# Commas
cat >f1.txt <<e
a,a,a,a
b,b,b,b
c,c,c,c
e

# Pipes
cat >f2.txt <<e
a|a|a|a
b|b|b|b
c|c|c|c
e

# Fixed width
cat >f3.txt <<e
1  2  3
1  2  3
1  2  3
e

# Fixed width with commas
cat >f4.txt <<e
1, 2  3
1  2, 3
1  2, 3,
e

for i in *.txt; do
  echo --- $i
  python heur_sep.py $i
done

You would have to do some more work to make this resistant to different kinds of errors, but should be a good starting point. Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM