How to subset all rows that mach pattern in given columns using Bash?

Question

Given the tab delimited file:

1    cat      1|1    0|1    0|0    0|0
2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0
4    dog      0|0    0|0    0|0    0|0
5    human    0|0    0|0    0|0    0|0

How can I subset all rows that have one or more "1|1" in the last three columns only? ie the subset should return:

2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0

The file I need to subset has 2500 columns and 100000 rows. Columns 9 to 2500 contain either 0|0 1|1 1|0 or 0|1. How can I subset all rows that have one or more of the string 1|1 in any of the columns from 9 to 2500 using Bash?

I have tried:

awk '/^1|1$/' dummy.vcf > dummy.vcf1

However, this does not seem to work. Furthermore, it considers all columns as opposed to columns 9 to 2500.

If anyone is able to help it will be greatly appreciated!

Thanks

Answer 1

This may be what you want:

$ awk '{ for (i=4;i<=NF;i++) if ($i == "1|1") { print; next } }' file
2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0

For your real data just change 4 to 9:

awk '{ for (i=9;i<=NF;i++) if ($i == "1|1") { print; next } }' file

or given your sample data:

$ awk 'match($0,/^([^\t]+\t){3}.*1\|1/)' file
2       mouse   0|1     1|1     1|1     0|0
3       horse   0|1     0|1     1|1     1|0

and change the 3 to 8 for your real data. That last assumes there's ONLY single digits with | between in every field, you can't have 11|10 for example.

Answer 2

You can use grep:

grep $'^\([^\t]*\t\)\{7\}.*\t1|1' file

$'' interprets \\t as a tab
\\{7\\} means the previous token is repeated seven times
[^\\t]* matches non-tabs zero or more times, ie the columns
^ matches the start of a line
.* here follows the seven previous columns and is followed by a tab, ie column at least 9 starts after it

How to subset all rows that mach pattern in given columns using Bash?

Question

2 answers

solution1
2 ACCPTED 2018-10-30 21:32:54

solution2
1 2018-10-30 21:27:24

How to subset all rows that mach pattern in given columns using Bash?

Question

2 answers

solution1 2 ACCPTED 2018-10-30 21:32:54

solution2 1 2018-10-30 21:27:24

solution1
2 ACCPTED 2018-10-30 21:32:54

solution2
1 2018-10-30 21:27:24