Given the tab delimited file:
1 cat 1|1 0|1 0|0 0|0
2 mouse 0|1 1|1 1|1 0|0
3 horse 0|1 0|1 1|1 1|0
4 dog 0|0 0|0 0|0 0|0
5 human 0|0 0|0 0|0 0|0
How can I subset all rows that have one or more "1|1" in the last three columns only? ie the subset should return:
2 mouse 0|1 1|1 1|1 0|0
3 horse 0|1 0|1 1|1 1|0
The file I need to subset has 2500 columns and 100000 rows. Columns 9 to 2500 contain either 0|0 1|1 1|0 or 0|1. How can I subset all rows that have one or more of the string 1|1 in any of the columns from 9 to 2500 using Bash?
I have tried:
awk '/^1|1$/' dummy.vcf > dummy.vcf1
However, this does not seem to work. Furthermore, it considers all columns as opposed to columns 9 to 2500.
If anyone is able to help it will be greatly appreciated!
Thanks
This may be what you want:
$ awk '{ for (i=4;i<=NF;i++) if ($i == "1|1") { print; next } }' file
2 mouse 0|1 1|1 1|1 0|0
3 horse 0|1 0|1 1|1 1|0
For your real data just change 4 to 9:
awk '{ for (i=9;i<=NF;i++) if ($i == "1|1") { print; next } }' file
or given your sample data:
$ awk 'match($0,/^([^\t]+\t){3}.*1\|1/)' file
2 mouse 0|1 1|1 1|1 0|0
3 horse 0|1 0|1 1|1 1|0
and change the 3 to 8 for your real data. That last assumes there's ONLY single digits with |
between in every field, you can't have 11|10
for example.
You can use grep:
grep $'^\([^\t]*\t\)\{7\}.*\t1|1' file
$''
interprets \\t
as a tab \\{7\\}
means the previous token is repeated seven times [^\\t]*
matches non-tabs zero or more times, ie the columns ^
matches the start of a line .*
here follows the seven previous columns and is followed by a tab, ie column at least 9 starts after it
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.