简体   繁体   English

使用关键字文件的部分行中的GREP

[英]GREP in part of line using keywords file

I need to check for multiple phrases in txt files, and if file contains them in particular line, remove the line from txt fie. 我需要检查txt文件中的多个词组,如果文件在特定行中包含它们,请从txt fie中删除该行。

Using inverse grep with file containing phrases that needs to be removed works as a charm. 将反grep与包含需要删除的短语的文件一起使用是一种魅力。

THE PROBLEM is that I need to search in part of the each line, rather than the whole line. 问题是我需要搜索每行的一部分,而不是整行。

I need to check only part of the line until 10th comma character. 我只需要检查部分行,直到第10个逗号为止。 If grep finds phrase after that I want to keep the line, if grep matches before that point I want to remove the line. 如果grep在那之后找到短语,我想保留该行,如果grep在该点之前匹配,我想删除该行。

I can't figure out how I could use regex alongside phrases file. 我不知道如何在短语文件中使用正则表达式。 Any suggestions welcome. 任何建议欢迎。

#!/bin/bash 

shopt -s globstar

for f in /uploads/txt/original/**/*.txt ; do

  grep -i -v -w -f phrase.txt "$f" > tmp
  mv tmp $f

done  

echo "Finished!"

EDIT 编辑

   # Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

Does this mean that every column is run agains PATS? 这是否意味着每一列都再次运行PATS?

Would it be possible to merge 10 columns into one string and then check agains all patterns instead of checking 10 columns against all patterns? 是否可以将10列合并为一个字符串,然后再次检查所有模式,而不是对照所有模式检查10列?

Input data /tmp/test 输入数据/ tmp / test

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
foo,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, BAR,  Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, FOO,   Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, BAR,   Val11, Val12

Phrases /tmp/phrases 短语/ tmp /短语

FOO
BAR

Awk Script with comments 带注释的Awk脚本

#!/usr/bin/gawk -f

BEGIN {
    FS         = " *, *" # Field Separator regex to split words
    IGNORECASE = 1       # ignore case for regex match

    # read phrases file in an array
    # prepend '^' and append '$' to the phrase for exact match
    while (getline a < "/tmp/phrases") PATS["^"a"$"]
}

# Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

# Rule to actual print if flag is set
ok {print}

# Debugging rule. Get rid for actual code.
END { for (p in PATS) print p }

# One liner
#  gawk 'BEGIN{FS=" *, *";IGNORECASE=1;while(getline a < "/tmp/phrases")PATS["^"a"$"]}{ok=1;for(i=1;i<=10;i++){for(p in PATS){if($i ~ p){ok=0}}}} ok {print}' /tmp/test

Output: 输出:

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12

Credit goes to this answer https://stackoverflow.com/a/14471194/2032943 归功于这个答案https://stackoverflow.com/a/14471194/2032943

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM