简体   繁体   中英

Formatting Text Awk Sed

Hi I have a file that I need to put in a format that I can pull into excel spreadsheet I do not know how to do this and I would appreciate if you could help me out.

Here is the input sample

#1

Indiana University—​Bloomington (Kelley) 
Bloomington, IN

90  58  82  86 
#1

Temple University (Fox) 
Philadelphia, PA

95  66  97  95 
#1

University of North Carolina—​Chapel Hill (Kenan-​Flagler) 
Chapel Hill, NC

73  58  100     75 
#4

Here is the output

#1, Indiana University—​Bloomington (Kelley) Bloomington, IN,   90, 58, 82, 86,
#1, Temple University (Fox) Philadelphia, PA,           95,     66,     97,     95, 

I'm using shell scripting in linux

Thanks

This is rather simple with GNU awk and mawk if you don't try to use it in a line-based manner. We'll use a # at the beginning of a line as record separator and a newline as a field separator. Then:

awk -v RS='(^|\n)#' -F'\n' 'NR > 1 { gsub(/ +/, ", ", $6); print "#" $1 ", " $3 " " $4 ", " $6 }' filename

That is:

NR > 1 {                              # the first record is the empty bit before
                                      # the first separator, so we skip it
  gsub(/ +/, ", ", $6)                # then: insert commas in the number row
  print "#" $1 ", " $3 " " $4 ", " $6 # and reassemble the record in the right
                                      # format for printing.
}

The use of a regex as record separator is not strictly POSIX-conforming, but between gawk and mawk, you'll have most bases covered.

Awk script to solve the problem:

/^#[0-9]/ {current = $0}

/\([A-Za-z ]+\)/ { current = current "," $0}

/[A-Z]+$/ { current = current $0}

/^[0-9]+/ {current = current "," $1 "," $2 "," $3 "," $4; print current}

Usage:

cat yourdatafile | awk -f script.awk > output.csv

Explanation:

Each of the regexes match the patterns on the different lines and executes the action for that line beside the regex.

  • For the #number, initialize/overwrite a current variable with the #number.
  • For the text info without a state, add it to the current variable with a comma at the start
  • For the text info with a State, add it to the current variable without a comma at the start
  • For the list of numbers, add them to the current variable with a comma at the start and in between each then print the current variable

While it's entirely possible you can do that with a bit of awk scripting, I'd recommend you don't do that.

Actually, awk is handy for anything that's not too complex, but here, since you're already planning to use Excel, you might as well just import the plain file, and then process it in excel, pivoting, reshaping, splitting it there.

However, I hate Excels complexity, so here's my python2 approach (saving it as program.py and making it executable as chmod 755 program.py ):

#!/usr/bin/python
import sys

wholefile = open(sys.argv[1], "r").read()
parts = wholefile.split("#")

for item in parts:
    lines = item.split("\n")
    output = [ int(lines[0]), lines[2], lines[3],lines[5].split() ]
    print ";".join(output)

and run this as

program.py input.txt > output.csv

EDIT: typo, and:

I tend to say this too often, but doing something in a shell script that isn't very centered on the wish to invoke a lot of commands is often far less effective than using any general purpose scripting language. Python is so abundant everywhere that I seldom find myself writing bash scripts.

EDIT2: Ok, so no python on your host. scary ;P. Use bash 's built-in read function ( man read ).

 sed '#n;/[0-9 ]/ s/  */, /g;/^ *$/d;H;$!b;g;s/.//;s/\n\([^#]\)/, \1/g;p' YourFile
  • remove and preformat entry line
  • hold the remaining info
  • at the end, load the buffer
  • remove first newline
  • replace any new line not followed by a # by a , and the followed char itself
  • print the result

if the last , is mandatory (normaly not in a csv/excel file) adapt the /[0-9 ]/ s/ */, /g with this ;/[0-9 ]/ {s/ */, /g; s/$/,/;} ;/[0-9 ]/ {s/ */, /g; s/$/,/;}

Here is an alternative way of doing it with awk by only manipulating the output field separator ( OFS ) and the output record separator ( ORS ):

grep -v '^$' infile |      # remove empty lines
awk 'NR%4 { ORS=", "; OFS=" " } NR%4 == 0 { ORS="\n"; OFS=", " } $1=$1'

Output:

#1, Indiana University—​Bloomington (Kelley), Bloomington, IN, 90, 58, 82, 86
#1, Temple University (Fox), Philadelphia, PA, 95, 66, 97, 95
#1, University of North Carolina—​Chapel Hill (Kenan-​Flagler), Chapel Hill, NC, 73, 58, 100, 75
#4, 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM