简体   繁体   English

使用单行reg exp grep / sed从csv文件中删除不包含字符串的行

[英]Removing rows that don't contain strings from csv file, using one-line reg exp grep/sed

I have idsfile.csv which is a comma separated file of ids (with no new line characters in), and I would like to grab only the lines from a second datafile.txt file which have one of those ids in (surrounded by tabs). 我有idsfile.csv ,它是一个用逗号分隔的ID文件(其中没有换行符),我只想从第二个datafile.txt文件中抓取其中一个ID(由制表符包围)的行。 。

Sample idsfile.csv: 样本idsfile.csv:

000001,000002,000005,000007,000008,000009,000011,000021,000029,000040,...

Sample datafile.txt: 样本datafile.txt:

titl e1   000001   description1 
title2   000003   descr iption2 
ti tle3   000021   des cripti on3 
title4   000023   description4 

If I was doing this without having to read in the ids from a file I would try: 如果我这样做而不必从文件中读取ID,则可以尝试:

grep -Ev '/\t000001\t|\t000002\t|\t000003\t/' datafile.txt > output.txt

but I am unsure how to read in the comma separated values in a way that I could then use them in the regular expression. 但是我不确定如何以逗号分隔的值的形式读取它们,然后在正则表达式中使用它们。

Does anyone know how I might assemble this as a one line command query please? 有谁知道我如何将其汇编为一个单行命令查询? Perhaps with textscan? 也许使用textscan?

Edit: Actually, if I changed idsfile.csv to have an id on each line (with a tab before and after), then would I line similar to this work please or, I expect, is the syntax quite wrong: 编辑:实际上,如果我将idsfile.csv更改为每行都有一个id(在前后都有一个选项卡),那么我是否可以将行与此项工作类似,或者,我希望语法很错误:

grep -Evf idsfile.csv datafile.txt > output.txt

使用sed将idsfile.csv的内容转换为与grep一起使用的正则表达式。

The single line of data in idsfile.csv is hostile to this workflow - you will have to transform it into a series of lines. idsfile.csv中的单行数据对此工作流不利-您将不得不将其转换为一系列行。 The Unix toolset is based around lines! Unix工具集基于各行!

So, we need to transliterate the commas into newlines: 因此,我们需要将逗号音译为换行符:

tr , '\012' < idsfile.csv > idsfile.lines
fgrep -f idsfile.lines datafile.txt

A POSIX-compliant 'grep' will also recognize: 符合POSIX的“ grep”还将识别:

grep -F -f idsfile.lines datafile.txt

You might even be able to get away with: 您甚至可以摆脱:

tr , '\012' < idsfile.csv |
grep -F -f - datafile.txt

This tells 'grep' to read the list of names to search for from its standard input. 这告诉'grep'从其标准输入中读取要搜索的名称列表。

Finally, if you're using GNU grep, you could add ' -w ' to search for words - it will require the pattern to be surrounded by non-alphanumeric characters (spaces in the examples). 最后,如果您使用的是GNU grep,则可以添加“ -w ”来搜索单词-它将要求模式用非字母数字字符(示例中的空格)包围。 The ' -w' option means that if a line in datatfile.txt contains ' -w'选项表示如果datatfile.txt中的一行包含

something 000002100  kkkk

the entry '000021' will not select that line (without the ' -w ', it would be selected). 条目'000021'将不会选择该行(如果没有' -w ',则会被选择)。

The following 1-liner uses awk to turn each field of the csv file into a list of regex for grep to match via the -f option. 以下1-liner使用awk将csv文件的每个字段转换为正则表达式列表,以便grep通过-f选项进行匹配。 We then use Bash's process substitution syntax <( ) to treat the output of the awk command as a file (named pipe). 然后,我们使用Bash的进程替换语法<( )awk命令的输出视为文件(命名管道)。

$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt

Input 输入

$ cat sample.csv
000001,000003,000005,000007,000008,000009,000011,000023,000029

$ cat title.txt
titl e1 000001  description1
title2  000003  descr iption2
ti tle3 000021  des cripti on3
title4  000023  description4

Output 产量

$ grep -w -f <(awk -v 'RS=,|\n' '{print "\t"$0"\t"}' sample.csv) title.txt
titl e1 000001  description1
title2  000003  descr iption2
title4  000023  description4

Note that the line containing 000021 did not match. 请注意 ,包含000021的行不匹配。 Also not apparent is that each 6-digit number in title.txt are surrounded by tabs, not spaces. 同样不明显的是title.txt中的每个6位数字都用制表符而不是空格包围。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM