简体   繁体   English

从文本文件中提取匹配结果的行

[英]Extract lines matching result from text file

I need to extract the filename from a text file whereas the output on the text file doesn't have fonts. 我需要从文本文件中提取文件名,而文本文件上的输出没有字体。

So as you can see from the output file below I need to print out results where they are no fonts after the first results? 因此,正如您从下面的输出文件中看到的那样,我需要打印出第一个结果之后没有字体的结果? So only the last result has fonts in this output 所以只有最后一个结果在此输出中有字体

Does this make sense - Would Grep, Sed or Awk be the answer 这有意义吗-Grep,Sed或Awk是答案吗

So need a output from the text file below that shows that no fonts are present in that PDf within the **START and **END 因此,需要以下文本文件的输出,该输出表明** START和** END中的PDf中没有字体

******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp1.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp2.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
BAAAAA+TimesNewRomanPS-BoldMT        TrueType          yes yes yes     14  0
CAAAAA+TimesNewRomanPSMT             TrueType          yes yes yes      9  0
/home/user3/Documents/temp file.pdf
******************END***********************

This prints any line containing ".pdf" if the previous line starts with - . 如果前一行以-开头,则将打印包含“ .pdf”的任何行。

[me@home]$ awk '{if (st && match($0,".pdf")){print $0}; st=match($0,"^-")}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

It is not a generic solution, but will work with the input data you've given. 它不是通用解决方案,但可以处理您提供的输入数据。 I can imagine several edge cases where this might fail but it's all down to the specifications of your input file. 我可以想象几种可能失败的极端情况,但这完全取决于您的输入文件的规范。


Update 更新资料

(Based on the script you've posted in the comments below) If what you're trying to do is simply to identify PDF files that have no embedded fonts, this might work: (基于您在下面的评论中发布的脚本)如果您要尝试仅识别没有嵌入字体的PDF文件,则可能会起作用:

MAGNUM="/mnt/network/User\ 1\ PDF\ 06.12.11/"
has_no_fonts() {
    COUNT=$(pdffonts "$1" 2> /dev/null | wc -l)
    exit $(( $COUNT - 4 ))
}
export -f has_no_fonts
find "$MAGNUM" -type f -name "*.pdf" -exec bash -c 'has_no_fonts "{}"' \; -print

Here's a breakdown of the script: 这是脚本的细分:

  • Detecting embedded font count. 检测嵌入式字体计数。 Would have been simple if pdffonts returned a specific value if no fonts were embedded but that is not so. 如果pdffonts返回一个特定的值(如果没有嵌入任何字体的话)会很简单,但事实并非如此。 We therefore count the number of output lines and deduct 2 (header lines) to determine the number of embedded fonts 因此,我们计算输出行数并减去2(标题行)以确定嵌入字体的数量。

     COUNT=$(pdffonts "$1" 2> /dev/null | wc -l) # number of output lines # exactly 2 if no fonts # exactly 0 if there are errors exit $(( $COUNT - 2 )) # exit 0 (success) if and only if PDF has no fonts 
  • bash function exported so it can be used in subshell. bash函数已导出,因此可以在subshel​​l中使用。

     export -f has_no_fonts 
  • Locate pdf files and only print out name if PDF valid and has no fonts 找到pdf文件,并且仅在PDF有效且没有字体时才打印出名称

     find ..... -exec bash -c 'has_no_fonts "{}"' \\; -print ------- ------- | | -exec cannot run bash functions Will only print so run in a bash subshell filename if prev command exit with 0 

If you prefer a one-line, the whole script can be written as: 如果您喜欢单行,则整个脚本可以编写为:

find "$MAGNUM" -name "*.pdf" \
    -exec bash -c 'exit $(($(pdffonts "{}" 2> /dev/null |wc -l) - 2))' \; -print

This might work for you: 这可能对您有用:

sed -n '/^\*/,//{H;/\*END\*/{x;s/\n/&/6;t;s|[^/]*\([^\n]*\).*|\1|p}}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

Explanation: 说明:

  1. Focus on lines between lines beginning with * 专注于以*开头的行之间的行
  2. Store such lines in the hold space (HS). 将这样的线存放在存放空间(HS)中。
  3. When we reach the closing delimiter swap to the HS. 当我们到达结束定界符时,请交换到HS。
  4. Check for 6 or more newlines ie entries that must have fonts and if so bailout. 检查6个或更多的换行符,即必须具有字体的条目,如果有,请进行紧急救助。
  5. Delete all non-essential text and print out. 删除所有不必要的文字并打印出来。

Or at a pinch: 或紧要关头:

sed -n '/^\*/,//{H;/\*END\*/{x;s|[^/]*-\n\(/[^\n]*\).*|\1|p}}' in.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM