简体   繁体   English

正则表达式查询:如何在PDF中搜索一个短语,其中该短语中的单词出现在多行上?

[英]Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

I am trying to set up an index page for the weekly magazine I work on. 我正在尝试为我工作的每周杂志建立索引页。 It is to show readers the names of companies mentioned in that weeks' issue, plus the page numbers they are appear on. 这是为了向读者显示该周刊中提到的公司名称,以及它们出现的页码。

I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in Adobe InDesign CS3 and Adobe InCopy CS3). 我想搜索一周中的所有PDF文件,其中一个PDF =一个杂志页面(最初是在Adobe InDesign CS3和Adobe InCopy CS3中制作的)。

I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular expressions, I am able to find most page numbers where a company is mentioned. 我已经建立了我要搜索的公司列表,并且使用PowerGREP和带分隔符的正则表达式,可以找到提到公司的大多数页码。 However, where a company name contains two or more words, the search I am running will not pick up instances where the name appears over more than one line. 但是,在公司名称包含两个或多个单词的情况下,我正在运行的搜索不会选择名称出现在多行上的实例。

For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the text appeared like this: 例如,当查找“ CB Richard Ellis”和“ Cushman&Wakefield”时,出现以下文本时我没有结果:

DTZ beat BNP PRE, CB [line break here] DTZ击败BNP PRE,CB [在这里换行]

Richard Ellis and Cushman & [line break here] 理查德·埃利斯(Richard Ellis)和库什曼(Cushman)&[在这里换行]

Wakefield to secure the contract. 韦克菲尔德获得合同。 [line end here] [此行结束]

Could someone advise me on how to write a regular expression that will ignore white space between words and ignore line endings OR one that will look for the words including all types of white space (ie uneven spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is imbedded somehow in PDF files). 有人可以给我建议如何编写一个正则表达式,该正则表达式将忽略单词之间的空白并忽略行尾,或者将查找包含所有类型的空白(即单词之间的不均匀空间;行尾的空格或行尾;和制表符 (我想此信息以某种方式嵌入到PDF文件中)。

Here is a sample of the set of terms I have asked PowerGREP to search for: 这是我要求PowerGREP搜索的一组术语的示例:

\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b 
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b

[Note that there is a delimited hard return between each \\b at the end of each phrase and beginnong of the next phrase.] [请注意,每个短语结尾处的每个\\ b与下一个短语的开头之间都有定界的硬返回。]

By the way, I am a production journalist and not usually involved in finding IT-type solutions and am finding it difficult to get to grips with the technical language on the PowerGREP site. 顺便说一句,我是一名生产新闻记者,通常不参与查找IT型解决方案,并且发现很难掌握PowerGREP网站上的技术语言。

Thanks for assistance 感谢您的协助

Alison 艾莉森

You have hard-coded spaces in your names. 您的姓名中有硬编码的空格。 Replace them with \\s+ and you should be OK. \\s+替换它们,您应该可以。

Eg: 例如:

CB\s+Richard\s+Ellis

What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. 发生的情况是,当您使用强制换行符时,它不再具有该空格(“”)字符。 Instead it has \\n or \\r\\n . 相反,它具有\\n\\r\\n Using \\s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more. 使用\\s+表示您要查找的空白字符(包括回车符和换行符)的数量为一个或多个。

The regex for matching spaces is \\s , so it would be 匹配空格的正则表达式是\\s ,所以它将是

\bCB\s+Richard\s+Ellis\b

( \\s+ = match at least one whitespace). \\s+ =至少匹配一个空格)。 Line breaks are \\n (newline) and \\r (return), depending on your OS. 换行符是\\n (换行符)和\\r (返回符),具体取决于您的操作系统。 So form a group using [] including all [\\r\\n\\s] would result in: 因此,使用[]组成一个包含所有[\\r\\n\\s]将导致:

\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM