Parsing a file containing non-printable ASCII characters

Question

I've got a file (possibly binary) that contains mostly non-printable ASCII characters as the output of the octal dump utility, below, shows.

od  -a MyFile.log 
0000000  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
0000020 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
0000040 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000100 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
0000160 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000200 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
0000220 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
0000240 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
0000260 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx
0000300 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

I'd like to do the following:

Parse or break the file into paragraph-like sections that start with either of the characters esc , fs , gs and us (ASCII numbers 27, 28, 29 and 31).
Have the output file contain human-readable ASCII characters like octal dump.
Store the result in a file.

What would be the best way of doing this? I'd prefer to use UNIX/Linux shell utilities, eg grep, to perform this task instead of a C program.

Thanks.

Edit I've used the octal dump utility command od -A n -a -v MyFile.log in order to remove the offsets from the file as follows:

  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx

I'd like to proceed from or perhaps pipe this file to some other utility eg awk.

Answer 1

If you have access to an awk that supports regexes in RS (eg, gawk) you can do:

awk 'BEGIN{ ORS = ""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" }
    { print | cmd; close( cmd ) }' MyFile.log > output

This will dump all of the output to a single file. If you want each "paragraph" in a different output file, you can do:

awk 'BEGIN{ ORS=""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" }
    { print | cmd "> output"NR }' MyFile.log

to write files output1, output2, etc.

Note that the standard for awk states that the behavior is unspecified if RS contains more than one character, but many implementations of awk will support regular expressions like this.

Answer 2

od -a -An -v file | perl -0777ne 's/\n//g,print "$_\n " for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs'

od -a -An -v file → octal dump of file with named characters ( -a ), no addresses ( -An ), and no suppressed duplicate lines ( -v ).
-0777 → slurp whole file (the line separator is the non-existent 0777 character).
-n → use an implicit loop to read input (the whole 1 line).
for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs → for every ( /g ) section that optionally begins in esc , fs , gs or us , and contains a maximal sequence of characters (including newline: /s ) without esc , fs , gs or us .
s/\\n//g → remove newlines from od print "$_\\n " → print the section and a newline (and a space to match od 's formatting)

Answer 3

I think the easier thing to do would be a flex program:

/*
 * This file is part of flex.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 
 * Neither the name of the University nor the names of its contributors
 * may be used to endorse or promote products derived from this software
 * without specific prior written permission.
 * 
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE.
 */

    /************************************************** 
        start of definitions section

    ***************************************************/

%{
/* A template scanner file to build "scanner.c". */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
/*#include "parser.h" */

//put your variables here
char FileName[256];
FILE *outfile;
char inputName[256];


// flags for command line options
static int output_flag = 0;
static int help_flag = 0;

%}


%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section

    *************************************************/


    /* these flex patterns will eat all input */ 
\x1B { fprintf(yyout, "\n\n"); }
\x1C { fprintf(yyout, "\n\n"); }
\x1D { fprintf(yyout, "\n\n"); }
\x1F { fprintf(yyout, "\n\n"); }
[:alnum:] { ECHO; }
.  { }
\n { ECHO; }


%%
    /**************************************************** 
        start of code section


    *****************************************************/

int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */

            {"useStdOut", no_argument,       0, 'o'},
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "ho",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case 'o':
               output_flag = 1;
               break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: cleaner [OPTIONS]... INFILE OUTFILE\n");
        printf("Strips non printable chars from input, adds line breaks on esc fs gs and us\n\n");
        printf("Option list: \n");
        printf("-o                      sets output to stdout\n");
        printf("--help                  print help to screen\n");
        printf("\n");
        printf("If infile is left out, then stdin is used for input.\n");
        printf("If outfile is a filename, then that file is used.\n");
        printf("If there is no outfile, then infile-EDIT is used.\n");
        printf("There cannot be an outfile without an infile.\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin


    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "rb");
        if (!file) {
            fprintf(stderr, "Flex could not open %s\n",argv[optind]);
            exit(1);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //increment current place in argument list
    optind++;


    /********************************************
        if no input name, then output set to stdout
        if no output name then copy input name and add -EDIT.csv
        otherwise use output name

    *********************************************/
    if (optind > argc) {
        yyout = stdout;
    }   
    else if (output_flag == 1) {
        yyout = stdout;
    }
    else if (optind < argc){
        outfile = fopen(argv[optind], "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }
    else {
        strncpy(FileName, argv[optind-1], strlen(argv[optind-1])-4);
        FileName[strlen(argv[optind-1])-4] = '\0';
        strcat(FileName, "-EDIT");
        outfile = fopen(FileName, "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }

    yylex();
    if (output_flag == 0) {
        fclose(yyout);
    }
    printf("Flex program finished running file %s\n", inputName);
    return 0;
}

To compile for windows or linux, use a linux box with flex , and mingw . Then run this make file in the same dir as the above scanner.l file.

TARGET = cleaner.exe
TESTBUILD = cleaner
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = 

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

After compiling and placing on your path somewhere, just use with od -A n -a -v MyFile.log | cleaner od -A n -a -v MyFile.log | cleaner .

Answer 4

I've wrote a simple program
main.c :

#include <stdio.h>

char *human_ch[]=
{
"NILL",
"EOL"
};
char code_buf[3];

// you can implement whatever you want for coversion to human-readable format
const char *human_readable(int ch_code)
{
    switch(ch_code)
    {
    case 0:
        return human_ch[0];
    case '\n':
        return human_ch[1];
    default:
        sprintf(code_buf,"%02x", (0xFF&ch_code) );
        return code_buf;
    }
}

int main( int argc, char **argv)
{
    int ch=0;
    FILE *ofile;
    if (argc<2)
        return -1;

    ofile=fopen(argv[1],"w+");
    if (!ofile)
        return -1;

    while( EOF!=(ch=fgetc(stdin)))
    {

        fprintf(ofile,"%s",human_readable(ch));
        switch(ch)
        {
            case 27:
            case 28:
            case 29:
            case 31:
                fputc('\n',ofile); //paragraph separator
                break;
            default:
                fputc(' ',ofile); //characters separator
                break;
        }
    }

    fclose(ofile);
    return 0;
}

Program reads stdin by bytes and converts every byte using human_readable() function to user specified value. In my example I've implemented jus EOL and NILL values and in all other ways program writes to output file hex code of character
compilation: gcc main.c
program usage: ./a.out outfile <infile

Answer 5

Here's a little Python program that does what you want (at least the splitting bit):

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 3:
        return

    name = sys.argv[1]
    codes = sys.argv[2]

    p = '%s.out.%%.4d' % name
    i = 1

    fIn = open(name, 'r')
    fOut = open(p % i, 'w')

    c = fIn.read(1)
    while c != '':
        fOut.write(c)
        c = fIn.read(1)

        if c != '' and codes.find(c) != -1:
            fOut.close()
            i = i + 1
            fOut = open(p % i, 'w')

    fOut.close()
    fIn.close()

if __name__ == '__main__':
    main()

Usage:

python split.py file codes

eg

On the bash command line:

python split.py input.txt $'\x1B'$'\x1C'

Will produce files input.txt.out.0001 , input.txt.out.0002 , ... after splitting input.txt on any of the codes specified (in this example, 127 and 128).

You can then iterate over these files and convert them to a printable format by passing them to od .

for f in `ls input.txt.out.*`; do od $f > $f.od; done

Parsing a file containing non-printable ASCII characters

Question

5 answers

solution1
2 2012-03-06 14:11:09

solution2
2 ACCPTED 2012-03-24 23:36:59

solution3
1 2012-03-06 14:59:39

solution4
1 2012-03-06 15:23:39

solution5
0 2012-03-06 16:09:50

Parsing a file containing non-printable ASCII characters

Question

5 answers

solution1 2 2012-03-06 14:11:09

solution2 2 ACCPTED 2012-03-24 23:36:59

solution3 1 2012-03-06 14:59:39

solution4 1 2012-03-06 15:23:39

solution5 0 2012-03-06 16:09:50

solution1
2 2012-03-06 14:11:09

solution2
2 ACCPTED 2012-03-24 23:36:59

solution3
1 2012-03-06 14:59:39

solution4
1 2012-03-06 15:23:39

solution5
0 2012-03-06 16:09:50