简体   繁体   中英

Combining multiple fixed width text files and separating data into pipe delimited columns

I need help creating something to process hundreds of split text files (001, 002, 003) that are all in the same fixed width format and designating each field in a specific column separated by a pipe |. For example, raw data might look like:

123456789HA02HANKS       PAUL       123 3rd Ave #2     NEW YORK      NY10023198601042012235245

and defined in a data dictionary as:

Field 1: SSN, start 1, end 9, length 9
Field 2: Name ID, start 10, end 11, length 2 
Field 3: Transaction Number, start 12, end 13, length 2
Field 4: Last Name, start 14, end 29, length 16
Field 5: First Name, start 30, end 41, length 12
Field 6: Mailing Address, start 42, end 76, length 35
Field 7: City, start 77, end 92, length 16
Field 8: State, start 93, end 94, length 2 
Field 9: Zip, start 95, end 99, length 5
Field 10: DOB, start 100, end 107, length 8
Field 11: Phone Number, start 108, end 117, length 10

I need it to look like:

123456789|HA|02|HANKS|PAUL|123 3rd Ave #2|NEW YORK|NY|10023|19860104|2012235245

I have a C# console file reader that combines multiple files, but I do not know how to separate them into columns. Here is my code:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace EmilysFileReader
{
    public class Program
    {
        static void Main(string[] args)
        {
            Program prog = new Program();
            Console.WriteLine("This program will attempt to combine all the     files of a given directory.");
        Console.WriteLine("Enter path to the directory:");
        var path = Console.ReadLine();
        string[] files = prog.CollectFiles(path);
        Console.WriteLine("Name for the new file?");
        string filename = Console.ReadLine();
        prog.DoWork(Path.Combine(path, filename), files);
        Console.WriteLine("Finished new file is " + Path.Combine(path, filename));
        Console.WriteLine("Press enter to close.");
        Console.ReadLine();
    }

    private void DoWork(string path, string[] files)
    {
        string filename = path + ".txt";
        foreach (string file in files)
        {
            File.AppendAllText(filename, GetFileContent(file));
        }
    }

    public string[] CollectFiles(string path)
    {
        string[] files = Directory.GetFiles(path);
        Console.WriteLine("Found Files:");
        foreach (string file in files)
        {
            Console.WriteLine(file);
        }
        return files;
    }

    public string GetFileContent(string file)
    {
        return File.ReadAllText(file);
    }


}
}

I need a way to do this in either C#, Java, SAS, or SSMS. Can anyone point me in the right direction?

If you say that every line is formatted the same, then you can use the following in Java:

char delimiter = '|';

String text = "123456789HA02HANKS       PAUL       123 3rd Ave #2     NEW YORK      NY10023198601042012235245";

StringBuilder sb = new StringBuilder();

sb.append(text.substring(0, 9)).append(delimiter);
sb.append(text.substring(9, 11)).append(delimiter);
sb.append(text.substring(11, 13)).append(delimiter);
sb.append(text.substring(13, 25).trim()).append(delimiter);
sb.append(text.substring(25, 36).trim()).append(delimiter);
sb.append(text.substring(36, 55).trim()).append(delimiter);
sb.append(text.substring(55, 69).trim()).append(delimiter);
sb.append(text.substring(69, 71)).append(delimiter);
sb.append(text.substring(71, 76)).append(delimiter);
sb.append(text.substring(76, 84)).append(delimiter);
sb.append(text.substring(84));

System.out.println(sb);

Granted, there's not really an efficient way of doing this because some words are separated by spaces, some aren't, and some can have multiple elements. Hopefully you'll only need to run this once.

Edit: A better way of doing this might be to insert your delimiter, |, at the indices where you know the end of an element will be, and trim each element.

This is a simple problem in SAS. To read the fix length values from your source files you just need a simple formatted input statement. Just read everything as character strings.

input field1 $10. field2 $2. .... ;

You could build that list of name/informat pairs into a macro variable from your metadata file by using PROC SQL into clause.

proc sql noprint ;
  select catx(' ',field,cats('$',length,'.'))
    into :varlist separated by ' '
    from metadata
  ;
quit;

Now it is easy to build a simple data step that will read all of the input files and write the new delimited file. You can use a single wildcard in the input filename to have SAS read all of the files at once.

data _null_;
   infile '/mypath/*.dat' truncover ;
   input &varlist ;
   file '/myoutpath/newfile.txt' dsd dlm='|' ;
   put (_all_) (:);
run;

It the simple input with a wildcard in the filename doesn't work you could build the list of filenames into a dataset and use that dataset to drive the data step.

data _null_;
   set filelist;
   infile fixed filevar=filename end=eof truncover ;
   do while (not eof);
     input &varlist ;
     file '/myoutpath/newfile.txt' dsd dlm='|' ;
     put (_all_) (:);
   end;
run;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM