简体   繁体   English

在文本文件行中提取部分重复的模式

[英]Extracting partially repeating patterns in lines of text file

Given a text file of the form: 给定以下形式的文本文件:

firstword<number1>,<string1>:<number2>,<string2>:<number3>,<string>:<number4>...
firstword<number1>,<string1>:<number2>,<string2>:<number3>,<string>:<number4>...
firstword<number1>,<string1>:<number2>,<string2>:<number3>,<string>:<number4>...
...

where each line can differ from each other, and can have any number of string:number pairs. 每行可以彼此不同,并且可以具有任意数量的string:number对。 "firstword" is always the same. “第一字”始终是相同的。 The contents of the strings and numbers can change, eg numbers could be "12345", string could be "abc" (without the quotes). 字符串和数字的内容可以更改,例如数字可以是“ 12345”,字符串可以是“ abc”(不带引号)。

In addition, a line can have multiple times the same string (how many times is unknown and different per line), each with a different associated number. 此外,同一行可以有多次相同的字符串(多少行是未知的,每行不同),每条都有不同的关联编号。 For example: 例如:

firstword123,abc:123,cde:234,abc:345,def:456

If one now wants to only extract the first word and number (in this case firstword123), as well as all string:number pairs in a line for a specific string, how can one do this? 如果现在只想提取第一个单词和数字(在本例中为firstword123)以及特定字符串的一行中的所有string:number对,那么该怎么做? In the above example, if one choses for the string the value "abc", then the extracted line should look like: 在上面的示例中,如果为字符串选择值“ abc”,则提取的行应如下所示:

firstword123,abc:123,abc:345

I am looking for a solution which works with Bash (and possibly other commands). 我正在寻找一种与Bash(以及其他命令)一起使用的解决方案。

you can use perl for this 您可以为此使用perl

#!/usr/bin/perl
my $first='firstword123';
my $str='abc';

while (<DATA>) {
    next if not /^$first/;
    print "$first";
    print ",$_" for ($_ =~ /$str:\d+/g);
}

__DATA__
firstword123,abc:123,cde:234,abc:345,def:456

out: 出:

firstword123,abc:123,abc:345

Not a one-liner, but an all-bash solution. 不是单线的,而是全力以赴的解决方案。 If you need faster code we can write something in awk or perl ... 如果您需要更快的代码,我们可以用awkperl编写一些东西。

$: cat keyscan
#! /bin/env bash

key="$1"
while read line
do start=${line//,*/}
   line=${line#$start}
   line=${line#,}
   while [[ -n "$line" ]]
   do case "$line" in
      $key:[0-9]*) lead="${line//,*/}"
                   start="$start,$lead"
                   line="${line#$lead}"
                   line="${line#,}"  ;;
              *,*) line="${line#*,}" ;;
                *) line='' ;;
      esac
   done
   printf "$start\n"
done

$: cat data
firstword123,abc:123,cde:234,abc:345,def:456

$: ./keyscan abc < data
firstword123,abc:123,abc:345

$: ./keyscan def < data
firstword123,def:456

$: ./keyscan cde < data
firstword123,cde:234

It will not be fast because it has a processing loop on every line of input, but it works on the sample line of data you gave. 它不会很快,因为它在输入的每一行上都有一个处理循环,但是可以在您提供的数据样本行上工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM