使用单行reg exp grep / sed从csv文件中删除不包含字符串的行

Question

I have idsfile.csv which is a comma separated file of ids (with no new line characters in), and I would like to grab only the lines from a second datafile.txt file which have one of those ids in (surrounded by tabs). 我有idsfile.csv ，它是一个用逗号分隔的ID文件（其中没有换行符），我只想从第二个datafile.txt文件中抓取其中一个ID（由制表符包围）的行。。

Sample idsfile.csv: 样本idsfile.csv：

000001,000002,000005,000007,000008,000009,000011,000021,000029,000040,...

Sample datafile.txt: 样本datafile.txt：

titl e1   000001   description1 
title2   000003   descr iption2 
ti tle3   000021   des cripti on3 
title4   000023   description4

If I was doing this without having to read in the ids from a file I would try: 如果我这样做而不必从文件中读取ID，则可以尝试：

grep -Ev '/\t000001\t|\t000002\t|\t000003\t/' datafile.txt > output.txt

but I am unsure how to read in the comma separated values in a way that I could then use them in the regular expression. 但是我不确定如何以逗号分隔的值的形式读取它们，然后在正则表达式中使用它们。

Does anyone know how I might assemble this as a one line command query please? 有谁知道我如何将其汇编为一个单行命令查询？ Perhaps with textscan? 也许使用textscan？

Edit: Actually, if I changed idsfile.csv to have an id on each line (with a tab before and after), then would I line similar to this work please or, I expect, is the syntax quite wrong: 编辑：实际上，如果我将idsfile.csv更改为每行都有一个id（在前后都有一个选项卡），那么我是否可以将行与此项工作类似，或者，我希望语法很错误：

grep -Evf idsfile.csv datafile.txt > output.txt

Answer 1

使用sed将idsfile.csv的内容转换为与grep一起使用的正则表达式。

Answer 2

The single line of data in idsfile.csv is hostile to this workflow - you will have to transform it into a series of lines. idsfile.csv中的单行数据对此工作流不利-您将不得不将其转换为一系列行。 The Unix toolset is based around lines! Unix工具集基于各行！

So, we need to transliterate the commas into newlines: 因此，我们需要将逗号音译为换行符：

tr , '\012' < idsfile.csv > idsfile.lines
fgrep -f idsfile.lines datafile.txt

A POSIX-compliant 'grep' will also recognize: 符合POSIX的“ grep”还将识别：

grep -F -f idsfile.lines datafile.txt

You might even be able to get away with: 您甚至可以摆脱：

tr , '\012' < idsfile.csv |
grep -F -f - datafile.txt

This tells 'grep' to read the list of names to search for from its standard input. 这告诉'grep'从其标准输入中读取要搜索的名称列表。

Finally, if you're using GNU grep, you could add ' -w ' to search for words - it will require the pattern to be surrounded by non-alphanumeric characters (spaces in the examples). 最后，如果您使用的是GNU grep，则可以添加“ -w ”来搜索单词-它将要求模式用非字母数字字符（示例中的空格）包围。 The ' -w' option means that if a line in datatfile.txt contains ' -w'选项表示如果datatfile.txt中的一行包含

something 000002100  kkkk

the entry '000021' will not select that line (without the ' -w ', it would be selected). 条目'000021'将不会选择该行（如果没有' -w '，则会被选择）。

Answer 3

The following 1-liner uses awk to turn each field of the csv file into a list of regex for grep to match via the -f option. 以下1-liner使用awk将csv文件的每个字段转换为正则表达式列表，以便grep通过-f选项进行匹配。 We then use Bash's process substitution syntax <( ) to treat the output of the awk command as a file (named pipe). 然后，我们使用Bash的进程替换语法<( )将awk命令的输出视为文件（命名管道）。

$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt

Input 输入

$ cat sample.csv
000001,000003,000005,000007,000008,000009,000011,000023,000029

$ cat title.txt
titl e1 000001  description1
title2  000003  descr iption2
ti tle3 000021  des cripti on3
title4  000023  description4

Output 产量

$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
titl e1 000001  description1
title2  000003  descr iption2
title4  000023  description4

Note that the line containing 000021 did not match. 请注意 ，包含000021的行不匹配。 Also not apparent is that each 6-digit number in title.txt are surrounded by tabs, not spaces. 同样不明显的是title.txt中的每个6位数字都用制表符而不是空格包围。

使用单行reg exp grep / sed从csv文件中删除不包含字符串的行

问题描述

3 个解决方案

解决方案1
1 2010-12-04 17:26:27

解决方案2
1 已采纳 2010-12-04 17:30:53

解决方案3
1 2010-12-04 17:40:26

Input 输入

Output 产量

使用单行reg exp grep / sed从csv文件中删除不包含字符串的行

问题描述

3 个解决方案

解决方案1 1 2010-12-04 17:26:27

解决方案2 1 已采纳 2010-12-04 17:30:53

解决方案3 1 2010-12-04 17:40:26

Input 输入

Output 产量

解决方案1
1 2010-12-04 17:26:27

解决方案2
1 已采纳 2010-12-04 17:30:53

解决方案3
1 2010-12-04 17:40:26