简体   繁体   English

sed在两个模式之间获取字符串

[英]sed to get string between two patterns

I am working on a latex file from which I need to pick out the references marked by \\citep{}. 我正在处理一个乳胶文件,我需要从中选择\\ citep {}标记的引用。 This is what I am doing using sed. 这就是我使用sed所做的事情。

    cat file.tex | grep citep | sed 's/.*citep{\(.*\)}.*/\1/g'

Now this one works if there is only one pattern in a line. 现在,如果一行中只有一种模式,则此方法有效。 If there are more than one patterns ie \\citep in a line, it fails. 如果一行中有多个模式(例如\\ citep),它将失败。 It fails even when there is only one pattern but more than one closing bracket }. 即使只有一种模式但有多个结束括号},它也会失败。 What should I do, so that it works for all the patterns in a line and also for the exclusive bracket I am looking for? 我应该怎么做才能使它适用于一行中的所有图案以及我要寻找的专用括号?

I am working on bash. 我正在做bash。 And a part of the file looks like this: 文件的一部分看起来像这样:

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of 

On one line, I get answer like this 在一行上,我得到这样的答案

    BilhamE01, TapponnierM76} and by distributed seismicity across the region (Fig. \ref{fig1_2

whereas I am looking for 而我正在寻找

    BilhamE01, TapponnierM76

Another example with more than one /citep patterns gives output like this 具有多个/ citep模式的另一个示例给出了这样的输出

    Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study

whereas I am looking for 而我正在寻找

    Pauletal2015 Mitraetal2005

Can anyone please help? 谁能帮忙吗?

it's a greedy match change the regex match the first closing brace 这是一个贪婪的比赛,更改正则表达式的比赛,第一个右括号

.*citep{\([^}]*\)}

test 测试

$ echo "\citep{string} xyz {abc}" |  sed 's/.*citep{\([^}]*\)}.*/\1/'
string

note that it will only match one instance per line. 请注意,每行仅匹配一个实例。

If you are using grep anyway, you can as well stick with it (assuming GNU grep ): 如果仍然使用grep ,那么也可以坚持使用(假设GNU grep ):

$ echo $str | grep -oP '(?<=\\citep{)[^}]+(?=})'
BilhamE01, TapponierM76

For what it's worth, this can be done with sed : 对于它的价值, 可以sed完成:

echo "\citep{string} xyz {abc} \citep{string2},foo" | \
  sed 's/\\citep{\([^}]*\)}/\n\1\n\n/g; s/^[^\n]*\n//; s/\n\n[^\n]*\n/, /g; s/\n.*//g'

output: 输出:

string, string2

But wow, is that ugly. 但是哇,那是丑陋的。 The sed script is more easily understood in this form, which happens to be suitable to be fed to sed via a -f argument: sed脚本以这种形式更容易理解,它恰好适合通过-f参数输入sed

# change every \citep{string} to <newline>string<newline><newline>
s/\\citep{\([^}]*\)}/\n\1\n\n/g

# remove any leading text before the first wanted string
s/^[^\n]*\n//

# replace text between wanted strings with comma + space
s/\n\n[^\n]*\n/, /g

# remove any trailing unwanted text
s/\n.*//

This makes use of the fact that sed can match and sub the newline character, even though reading a new line of input will not result in a newline initially appearing in the pattern space. 这利用了sed可以匹配并替换换行符的事实,即使读取新的输入行也不会导致换行符最初出现在模式空间中。 The newline is the one character that we can be certain will appear in the pattern space (or in the hold space) only if sed puts it there intentionally. 换行是一个字符,我们可以肯定将出现在模式空间(或保留空间)只有sed把它有故意。

The initial substitution is purely to make the problem manageable by simplifying the target delimiters. 最初的替换纯粹是通过简化目标定界符来使问题可管理。 In principle, the remaining steps could be performed without that simplification, but the regular expressions involved would be horrendous. 原则上,可以在不进行简化的情况下执行其余步骤,但是所涉及的正则表达式将令人生畏。

This does assume that the string in every \\citep{string} contains at least one character; 这并假设string在每个\\citep{string}包含至少一个字符; if the empty string must be accommodated, too, then this approach needs a bit more refinement. 如果也必须容纳空字符串,则此方法需要进一步完善。

Of course, I can't imagine why anyone would prefer this to @Lev's straight grep approach, but the question does ask specifically for a sed solution. 当然,我无法想象为什么有人会喜欢@Lev的直接grep方法,但是这个问题确实要求使用sed解决方案。

f.awk 奥克

BEGIN {
    pat = "\\citep"
    latex_tok = "\\\\[A-Za-z_][A-Za-z_]*" # match \aBcD
}

{
    f = f $0 # store content of input file as a sting
}

function store(args,   n, k, i) { # store `keys' in `d'
    gsub("[ \t]", "", args) # remove spaces
    n = split(args, keys, ",")
    for (i=1; i<=n; i++) {
      k = keys[i]
      d[k]
    }
}

function ntok() { # next token
    if (match(f, latex_tok)) {
      tok = substr(f, RSTART          ,RLENGTH)
      f   = substr(f, RSTART+RLENGTH-1        )
      return 1
    }
    return 0
}

function parse(    i, rc, args) {
    for (;;) { # infinite loop
      while ( (rc = ntok()) && tok != pat ) ;
      if (!rc) return

      i = index(f, "{")
      if (!i) return # see `pat' but no '{'
      f = substr(f, i+1)

      i = index(f, "}")
      if (!i) return # unmatched '}'

      # extract `args' from \citep{`args'}
      args = substr(f, 1, i-1)
      store(args)
    }
}

END {
    parse()
    for (k in d)
      print k
}

f.example f。例子

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of

Usage: 用法:

awk -f f.awk f.example

Expected ouput: 预期输出:

BendickB2001
Arunetal2010
Pauletal2015
Mitraetal2005
BilhamE01
Mukuletal2009
TapponnierM76
WangLiu2009
BhattacharyaM2009
Mitraetal2010
Actonetal2011
Vernantetal2014

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM