[英]Removing rows that don't contain strings from csv file, using one-line reg exp grep/sed
I have idsfile.csv which is a comma separated file of ids (with no new line characters in), and I would like to grab only the lines from a second datafile.txt file which have one of those ids in (surrounded by tabs). 我有idsfile.csv ,它是一个用逗号分隔的ID文件(其中没有换行符),我只想从第二个datafile.txt文件中抓取其中一个ID(由制表符包围)的行。 。
Sample idsfile.csv: 样本idsfile.csv:
000001,000002,000005,000007,000008,000009,000011,000021,000029,000040,...
Sample datafile.txt: 样本datafile.txt:
titl e1 000001 description1
title2 000003 descr iption2
ti tle3 000021 des cripti on3
title4 000023 description4
If I was doing this without having to read in the ids from a file I would try: 如果我这样做而不必从文件中读取ID,则可以尝试:
grep -Ev '/\t000001\t|\t000002\t|\t000003\t/' datafile.txt > output.txt
but I am unsure how to read in the comma separated values in a way that I could then use them in the regular expression. 但是我不确定如何以逗号分隔的值的形式读取它们,然后在正则表达式中使用它们。
Does anyone know how I might assemble this as a one line command query please? 有谁知道我如何将其汇编为一个单行命令查询? Perhaps with textscan?
也许使用textscan?
Edit: Actually, if I changed idsfile.csv to have an id on each line (with a tab before and after), then would I line similar to this work please or, I expect, is the syntax quite wrong: 编辑:实际上,如果我将idsfile.csv更改为每行都有一个id(在前后都有一个选项卡),那么我是否可以将行与此项工作类似,或者,我希望语法很错误:
grep -Evf idsfile.csv datafile.txt > output.txt
使用sed
将idsfile.csv的内容转换为与grep一起使用的正则表达式。
The single line of data in idsfile.csv is hostile to this workflow - you will have to transform it into a series of lines. idsfile.csv中的单行数据对此工作流不利-您将不得不将其转换为一系列行。 The Unix toolset is based around lines!
Unix工具集基于各行!
So, we need to transliterate the commas into newlines: 因此,我们需要将逗号音译为换行符:
tr , '\012' < idsfile.csv > idsfile.lines
fgrep -f idsfile.lines datafile.txt
A POSIX-compliant 'grep' will also recognize: 符合POSIX的“ grep”还将识别:
grep -F -f idsfile.lines datafile.txt
You might even be able to get away with: 您甚至可以摆脱:
tr , '\012' < idsfile.csv |
grep -F -f - datafile.txt
This tells 'grep' to read the list of names to search for from its standard input. 这告诉'grep'从其标准输入中读取要搜索的名称列表。
Finally, if you're using GNU grep, you could add ' -w
' to search for words - it will require the pattern to be surrounded by non-alphanumeric characters (spaces in the examples). 最后,如果您使用的是GNU grep,则可以添加“
-w
”来搜索单词-它将要求模式用非字母数字字符(示例中的空格)包围。 The ' -w'
option means that if a line in datatfile.txt contains '
-w'
选项表示如果datatfile.txt中的一行包含
something 000002100 kkkk
the entry '000021' will not select that line (without the ' -w
', it would be selected). 条目'000021'将不会选择该行(如果没有'
-w
',则会被选择)。
The following 1-liner uses awk
to turn each field of the csv file into a list of regex for grep
to match via the -f
option. 以下1-liner使用
awk
将csv文件的每个字段转换为正则表达式列表,以便grep
通过-f
选项进行匹配。 We then use Bash's process substitution syntax <( )
to treat the output of the awk
command as a file (named pipe). 然后,我们使用Bash的进程替换语法
<( )
将awk
命令的输出视为文件(命名管道)。
$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
$ cat sample.csv
000001,000003,000005,000007,000008,000009,000011,000023,000029
$ cat title.txt
titl e1 000001 description1
title2 000003 descr iption2
ti tle3 000021 des cripti on3
title4 000023 description4
$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
titl e1 000001 description1
title2 000003 descr iption2
title4 000023 description4
Note that the line containing 000021
did not match. 请注意 ,包含
000021
的行不匹配。 Also not apparent is that each 6-digit number in title.txt are surrounded by tabs, not spaces. 同样不明显的是title.txt中的每个6位数字都用制表符而不是空格包围。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.