简体   繁体   English

使用带有 sed 的正则表达式删除多行文本

[英]Removing multi-line texts using regex with sed

I have the following sample text file with all my references which I use for citation in another software (LaTex).我有以下示例文本文件,其中包含我在另一个软件(LaTex)中用于引用的所有参考资料。 I want to remove the "abstract" field and its contents to help reduce the file-size and make its content more relevant.我想删除“抽象”字段及其内容,以帮助减小文件大小并使其内容更相关。

The sample text is given below:示例文本如下:

    doi = {10.3389/fsufs.2021.575056},
    abstract = {Agriculture has come under pressure to meet global food demands, whilst having to meet economic and ecological targets. This has opened newer avenues for investigation in unconventional protein sources. Current agricultural practises manage marginal lands mostly through animal husbandry, which; although effective in land utilisation for food production, largely contributes to global green-house gas (GHG) emissions. Assessing the revalorisation potential of invasive plant species growing on these lands may help encourage their utilisation as an alternate protein source and partially shift the burden from livestock production; the current dominant source of dietary protein, and offer alternate means of income from such lands. Six globally recognised invasive plant species found extensively on marginal lands; Gorse (
              Ulex europaeus
              ), Vetch (
              Vicia sativa
              ), Broom (
              Cytisus scoparius
              ), Fireweed (
              Chamaenerion angustifolium
              ), Bracken (
              Pteridium aquilinum
              ), and Buddleia (
              Buddleja davidii
              ) were collected and characterised to assess their potential as alternate protein sources. Amino acid profiling revealed appreciable levels of essential amino acids totalling 33.05 ± 0.04 41.43 ± 0.05, 33.05 ± 0.11, 32.63 ± 0.04, 48.71 ± 0.02 and 21.48 ± 0.05 mg/g dry plant mass for Gorse, Vetch, Broom Fireweed, Bracken, and Buddleia, respectively. The availability of essential amino acids was limited by protein solubility, and Gorse was found to have the highest soluble protein content. It was also high in bioactive phenolic compounds including cinnamic- phenyl-, pyruvic-, and benzoic acid derivatives. Databases generated using satellite imagery were used to locate the spread of invasive plants. Total biomass was estimated to be roughly 52 Tg with a protein content of 5.2 Tg with a total essential amino acid content of 1.25 Tg ({\textasciitilde}24\%). Globally, Fabaceae was the second most abundant family of invasive plants. Much of the spread was found within marginal lands and shrublands. Analysis of intrinsic agricultural factors revealed economic status as the emergent factor, driven predominantly by land use allocation, with shrublands playing a pivotal role in the model. Diverting resources from invasive plant removal through herbicides and burning to leaf protein extraction may contribute toward sustainable protein, effective land use, and achieving emission targets, while simultaneously maintaining conservation of native plant species.},

    doi = {10.1186/s12864-016-3367-x},
    abstract = {Background: Propionibacterium freudenreichii is an Actinobacterium widely used in the dairy industry as a ripening culture for Swiss-type cheeses, for vitamin B12 production and some strains display probiotic properties. It is reportedly a hardy bacterium, able to survive the cheese-making process and digestive stresses.
Results: During this study, P. freudenreichii CIRM-BIA 138 (alias ITG P9), which has a generation time of five hours in Yeast Extract Lactate medium at 30 °C under microaerophilic conditions, was incubated for 11 days (9 days after entry into stationary phase) in a culture medium, without any adjunct during the incubation. The carbon and free amino acids sources available in the medium, and the organic acids produced by the strain, were monitored throughout growth and survival. Although lactate (the preferred carbon source for P. freudenreichii) was exhausted three days after inoculation, the strain sustained a high population level of 9.3 log10 CFU/mL. Its physiological adaptation was investigated by RNA-seq analysis and revealed a complete disruption of metabolism at the entry into stationary phase as compared to exponential phase.
Conclusions: P. freudenreichii adapts its metabolism during entry into stationary phase by down-regulating oxidative phosphorylation, glycolysis, and the Wood-Werkman cycle by exploiting new nitrogen (glutamate, glycine, alanine) sources, by down-regulating the transcription, translation and secretion of protein. Utilization of polyphosphates was suggested.},
    language = {en},

I want to prune out the abstract and all its contents.我想删减摘要及其所有内容。 So the corresponding output should look like:所以对应的 output 应该是这样的:

doi = {10.3389/fsufs.2021.575056},

doi = {10.1186/s12864-016-3367-x},
language = {en},

I am trying to achieve this using the following 'sed' command: sed 's/\s*abstract.*(\n*.*)*.*[$}]// gm' Test.txt我正在尝试使用以下“sed”命令来实现这一点: sed 's/\s*abstract.*(\n*.*)*.*[$}]// gm' Test.txt

But it does not seem to work.但这似乎不起作用。 I have checked using online tools such as https://regex101.com/ , and it seems to select the relevant text.我已经使用https://regex101.com/等在线工具进行了检查,似乎 select 相关文本。 But when I try to execute it on my laptop, it doesn't work properly.但是当我尝试在我的笔记本电脑上执行它时,它不能正常工作。

I am running this on a Lenovo Thinkpad, MXLinux.我在联想 Thinkpad MXLinux 上运行它。

In GNU awk you could try following awk code.在 GNU awk ,您可以尝试遵循awk代码。 Written and tested in GNU awk .在 GNU awk中编写和测试。 Using RS variable of GNU awk to mention regex in it and get the required output as per OP's request.使用 GNU awkRS变量在其中提及正则表达式,并根据 OP 的请求获取所需的 output。

awk -v RS='(^[[:space:]]*|\n[[:space:]]*)doi = {[^}]*},|[[:space:]]+language = {en},' '
RT{ print RT }
' Input_file

Here is the Online demo for above code(NOTE: Regex online uses non-capturing group, which is not supported by awk , that's mentioned in their only for understanding purposes).这是上述代码的在线演示(注意:正则表达式在线使用非捕获组, awk不支持,仅出于理解目的而提及)。

This might work for you (GNU sed):这可能对您有用(GNU sed):

 sed -n '/abstract = {/{:a;/},$/b;n;ba};p' file

Turn off implicit printing -n .关闭隐式打印-n

If a line contains abstract = { , as long as the current line does not end in }, , replace the current line with the next and if it does match, then effectively delete it.如果一行包含abstract = { ,只要当前行不以},结尾,则将当前行替换为下一行,如果匹配,则有效地将其删除。

Otherwise print all other lines.否则打印所有其他行。

Using GNU sed使用 GNU sed

$ sed -Ez 's/abstract =[^}]*}([^}]*\.})?,\n +?//g' input_file
    doi = {10.3389/fsufs.2021.575056},

    doi = {10.1186/s12864-016-3367-x},
    language = {en},

Enabling extended functionality -E and separating lines by nul chars -z , you can then find the match starting from abstract =启用扩展功能-E并通过 nul chars -z分隔行,然后您可以从abstract =开始找到匹配项

  • [^}]*} - Match up till then next occurrence of } and include the curly brace [^}]*} - 匹配直到下一次出现}并包含大括号
  • ([^}]*\.)? - This is an optional condition, as above, match till the next occurance of curly brace, but this time, ensure there is a full stop before the curly brace. - 这是一个可选条件,如上,匹配直到下一次出现大括号,但这一次,确保大括号之前有一个句号。
  • \n - Include the newline in the match to be removed. \n - 在要删除的匹配中包含换行符。
  • +? - Another optional condition, if there is one or more spaces after the newline, remove them also. - 另一个可选条件,如果换行符后有一个或多个空格,也将它们删除。

The g flag at the end will repeat the removal of the match as many times as it finds it.最后的g标志将在找到匹配项时重复删除匹配项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM