简体   繁体   中英

Regular expression as field delimiter in awk

I have a large dataset with 586696 lines and 40 columns. However, I'm only interested in some of these columns. One has names in it and the other has numbers.

I'm having a hard time dealing with the field delimiters in this file. All of the columns delimiters are spaces. If you suppose that my file is called test.txt and it has 5 people in it, it looks like this:

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

Hence, if I run

awk '{print $1 " " $2}' test.txt

the result is

Name Salary
FirstName01 LastName01
FirstName02 MiddleName02
FirstName03 MiddleName03
FirstName04 LastName04
FirstName05 MiddleName05

but what I want is

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

For the sake of this problem, assume there are columns before the column Name and after column Salary .

How can I solve my problem? I guess I have to use some regular expression as the field delimiter to use awk here, but I couldn't find a way to do it.

Edit: I think I wasn't clear in the original post. I know awk is giving me exactly what I ask. My problem is that my full dataset is something like

Column01 Column02 Column03 Name Salary Column06 ...
Text0101 Text0102 Text0103 FirstName01 LastName01 Salary01 ...
Text0201 Text0202 Text0203 FirstName02 MiddleName02 LastName02 Salary02 ...
Text0301 Text0302 Text0303 FirstName03 MiddleName03 LastName03 Salary03 ...
Text0401 Text0402 Text0403 FirstName04 LastName04 Salary04 ...
Text0501 Text0502 Text0503 FirstName05 MiddleName05 LastName05 Salary05 ...

Given the above table, I want an awk code that can produce the following result:

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

Sorry about my misleading question.

According to @jas comment: You can check the number of columns with the NF variable in awk. So something like this should do the trick for your test.txt

awk '{name=$4; for (i = 5; i <= NF - 2; i++) name=name " " $i; salary=$i; print name " " salary}' test.txt

This prints the name (starting at column 4) and adds every column up to the third last to the name. The second last column will then be the salary.

Of course you must adjust the values in 'name=$4', 'i = 5' and 'NF - 2' to your needs.

As others pointet out, it would be better to change the algorithm generating the data set in a way such that you get a unique field delimiter.

Your problem is bad original format! If Name is the only column expanding to multiple fields you can check the number of fields in each row and modify the column selection.

awk 'NR==1{c=NF} {t=$4; for(i=5;i<6+(NF-c);i++) t=t " " $i; print t}' badformat.txt

If none of your other "columns" contain spaces and there's always the same number of "columns" in each row then the way to approach this is to start at field X and print fields to (NF-Y). That way it doesn't matter how many fields are contained in each "column" of the name since the end point is dictated by how many columns should remain after the name.

If your input isn't like that - edit your question to show us what it's really like!

This would seem to work on the sample input you provided but may be completely wrong for your real input since the sample you provide doesn't contain values that would exist in your real input and is inconsistent internally between the first and the rest of the records in terms of field positions:

$ awk '{e=NF-1; for (i=4;i<=e;i++) printf "%s%s", $i, (i<e?OFS:ORS)}' file
Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

The above was run on this input file which has the first line modified to make that at least consistent with your subsequent lines:

$ cat file
Column01 Column02 Column03 Name Salary ...
Text0101 Text0102 Text0103 FirstName01 LastName01 Salary01 ...
Text0201 Text0202 Text0203 FirstName02 MiddleName02 LastName02 Salary02 ...
Text0301 Text0302 Text0303 FirstName03 MiddleName03 LastName03 Salary03 ...
Text0401 Text0402 Text0403 FirstName04 LastName04 Salary04 ...
Text0501 Text0502 Text0503 FirstName05 MiddleName05 LastName05 Salary05 ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM