简体   繁体   English

格式化文本Awk Sed

[英]Formatting Text Awk Sed

Hi I have a file that I need to put in a format that I can pull into excel spreadsheet I do not know how to do this and I would appreciate if you could help me out. 嗨,我有一个文件,我需要将其放入可以插入excel电子表格的格式,我不知道该怎么做,如果您能帮助我,我将不胜感激。

Here is the input sample 这是输入样本

#1

Indiana University—​Bloomington (Kelley) 
Bloomington, IN

90  58  82  86 
#1

Temple University (Fox) 
Philadelphia, PA

95  66  97  95 
#1

University of North Carolina—​Chapel Hill (Kenan-​Flagler) 
Chapel Hill, NC

73  58  100     75 
#4

Here is the output 这是输出

#1, Indiana University—​Bloomington (Kelley) Bloomington, IN,   90, 58, 82, 86,
#1, Temple University (Fox) Philadelphia, PA,           95,     66,     97,     95, 

I'm using shell scripting in linux 我在Linux中使用Shell脚本

Thanks 谢谢

This is rather simple with GNU awk and mawk if you don't try to use it in a line-based manner. 如果您不尝试以基于行的方式使用GNU awk和mawk,则这相当简单。 We'll use a # at the beginning of a line as record separator and a newline as a field separator. 我们将在行的开头使用#作为记录分隔符,将换行符用作字段分隔符。 Then: 然后:

awk -v RS='(^|\n)#' -F'\n' 'NR > 1 { gsub(/ +/, ", ", $6); print "#" $1 ", " $3 " " $4 ", " $6 }' filename

That is: 那是:

NR > 1 {                              # the first record is the empty bit before
                                      # the first separator, so we skip it
  gsub(/ +/, ", ", $6)                # then: insert commas in the number row
  print "#" $1 ", " $3 " " $4 ", " $6 # and reassemble the record in the right
                                      # format for printing.
}

The use of a regex as record separator is not strictly POSIX-conforming, but between gawk and mawk, you'll have most bases covered. 使用正则表达式作为记录分隔符并不严格符合POSIX,但是在gawk和mawk之间,您将了解大多数基础知识。

Awk script to solve the problem: AWK脚本解决问题:

/^#[0-9]/ {current = $0}

/\([A-Za-z ]+\)/ { current = current "," $0}

/[A-Z]+$/ { current = current $0}

/^[0-9]+/ {current = current "," $1 "," $2 "," $3 "," $4; print current}

Usage: 用法:

cat yourdatafile | awk -f script.awk > output.csv

Explanation: 说明:

Each of the regexes match the patterns on the different lines and executes the action for that line beside the regex. 每个正则表达式都匹配不同行上的模式,并在正则表达式旁边执行该行的操作。

  • For the #number, initialize/overwrite a current variable with the #number. 对于#number,请使用#number初始化/覆盖当前变量。
  • For the text info without a state, add it to the current variable with a comma at the start 对于无状态的文本信息,请将其添加到当前变量中,并以逗号开头
  • For the text info with a State, add it to the current variable without a comma at the start 对于带有State的文本信息,请将其添加到当前变量中,并且开头不带逗号
  • For the list of numbers, add them to the current variable with a comma at the start and in between each then print the current variable 对于数字列表,将它们添加到当前变量中,并在每个变量之间以逗号开头,然后打印当前变量

While it's entirely possible you can do that with a bit of awk scripting, I'd recommend you don't do that. 尽管完全可以用一些awk脚本来做到这一点,但我建议您不要这样做。

Actually, awk is handy for anything that's not too complex, but here, since you're already planning to use Excel, you might as well just import the plain file, and then process it in excel, pivoting, reshaping, splitting it there. 实际上, awk对于不太复杂的任何事情都很方便,但是在这里,由于您已经打算使用Excel,因此不妨导入纯文件,然后在excel中对其进行处理,旋转,整形,分割。

However, I hate Excels complexity, so here's my python2 approach (saving it as program.py and making it executable as chmod 755 program.py ): 但是,我讨厌 Excel的复杂性,因此这是我的python2方法(将其保存为program.py并使其可执行为chmod 755 program.py ):

#!/usr/bin/python
import sys

wholefile = open(sys.argv[1], "r").read()
parts = wholefile.split("#")

for item in parts:
    lines = item.split("\n")
    output = [ int(lines[0]), lines[2], lines[3],lines[5].split() ]
    print ";".join(output)

and run this as 并运行为

program.py input.txt > output.csv

EDIT: typo, and: 编辑:错别字和:

I tend to say this too often, but doing something in a shell script that isn't very centered on the wish to invoke a lot of commands is often far less effective than using any general purpose scripting language. 我倾向于经常说这种话,但是在shell脚本中做一些不十分希望调用许多命令的事情通常比使用任何通用脚本语言要有效得多。 Python is so abundant everywhere that I seldom find myself writing bash scripts. Python无处不在,以至于我很少发现自己编写bash脚本。

EDIT2: Ok, so no python on your host. EDIT2:好的,因此主机上没有python。 scary ;P. 吓人的 Use bash 's built-in read function ( man read ). 使用bash的内置read功能( man read )。

 sed '#n;/[0-9 ]/ s/  */, /g;/^ *$/d;H;$!b;g;s/.//;s/\n\([^#]\)/, \1/g;p' YourFile
  • remove and preformat entry line 删除并预格式化输入行
  • hold the remaining info 保留剩余信息
  • at the end, load the buffer 最后,加载缓冲区
  • remove first newline 删除第一个换行符
  • replace any new line not followed by a # by a , and the followed char itself 更换任何新行后面没有#通过,并遵循自身烧焦
  • print the result 打印结果

if the last , is mandatory (normaly not in a csv/excel file) adapt the /[0-9 ]/ s/ */, /g with this ;/[0-9 ]/ {s/ */, /g; s/$/,/;} 如果最后,是强制性的(normaly不是在CSV / excel文件)适应/[0-9 ]/ s/ */, /g与此;/[0-9 ]/ {s/ */, /g; s/$/,/;} ;/[0-9 ]/ {s/ */, /g; s/$/,/;}

Here is an alternative way of doing it with awk by only manipulating the output field separator ( OFS ) and the output record separator ( ORS ): 这是通过仅操作输出字段分隔符( OFS )和输出记录分隔符( ORS )来使用awk的另一种方法:

grep -v '^$' infile |      # remove empty lines
awk 'NR%4 { ORS=", "; OFS=" " } NR%4 == 0 { ORS="\n"; OFS=", " } $1=$1'

Output: 输出:

#1, Indiana University—​Bloomington (Kelley), Bloomington, IN, 90, 58, 82, 86
#1, Temple University (Fox), Philadelphia, PA, 95, 66, 97, 95
#1, University of North Carolina—​Chapel Hill (Kenan-​Flagler), Chapel Hill, NC, 73, 58, 100, 75
#4, 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM