简体   繁体   中英

How to subset all rows that mach pattern in given columns using Bash?

Given the tab delimited file:

1    cat      1|1    0|1    0|0    0|0
2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0
4    dog      0|0    0|0    0|0    0|0
5    human    0|0    0|0    0|0    0|0

How can I subset all rows that have one or more "1|1" in the last three columns only? ie the subset should return:

2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0

The file I need to subset has 2500 columns and 100000 rows. Columns 9 to 2500 contain either 0|0 1|1 1|0 or 0|1. How can I subset all rows that have one or more of the string 1|1 in any of the columns from 9 to 2500 using Bash?

I have tried:

awk '/^1|1$/' dummy.vcf > dummy.vcf1

However, this does not seem to work. Furthermore, it considers all columns as opposed to columns 9 to 2500.

If anyone is able to help it will be greatly appreciated!

Thanks

This may be what you want:

$ awk '{ for (i=4;i<=NF;i++) if ($i == "1|1") { print; next } }' file
2    mouse    0|1    1|1    1|1    0|0
3    horse    0|1    0|1    1|1    1|0

For your real data just change 4 to 9:

awk '{ for (i=9;i<=NF;i++) if ($i == "1|1") { print; next } }' file

or given your sample data:

$ awk 'match($0,/^([^\t]+\t){3}.*1\|1/)' file
2       mouse   0|1     1|1     1|1     0|0
3       horse   0|1     0|1     1|1     1|0

and change the 3 to 8 for your real data. That last assumes there's ONLY single digits with | between in every field, you can't have 11|10 for example.

You can use grep:

grep $'^\([^\t]*\t\)\{7\}.*\t1|1' file
  • $'' interprets \\t as a tab
  • \\{7\\} means the previous token is repeated seven times
  • [^\\t]* matches non-tabs zero or more times, ie the columns
  • ^ matches the start of a line
  • .* here follows the seven previous columns and is followed by a tab, ie column at least 9 starts after it

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM