使用正则表达式从一个大字符串中提取数据

Question

Consider the following String, which is a table of content extracted from a pdf, like in the following example, two topics can be on one line, there is one line break at the end of each line (like in the example) 考虑以下字符串，它是从pdf中提取的内容表，例如在以下示例中，两个主题可以在一行上，每行的末尾有一个换行符（例如在示例中）

A — N° 1 2 janvier 2013

TABLE OF CONTENT

Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34

I want to extract the section's name 'Topic à one', 'Second Topic', 'Third -one', 'Topic.with.dots', 'One more line' and 'last topic' 我要提取该部分的名称“主题à一个”，“第二个主题”，“第三个”，“ Topic.with.dots”，“多一行”和“最后一个主题”

Any insights for a matching regex? 对匹配的正则表达式有什么见解？

Answer 1

# -*- coding: utf-8 -*-
string = "A — N° 1 2 janvier 2013

TABLE OF CONTENT

Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34"
puts string.scan(/(\p{l}[\p{l} \.-]*)\s+\.+\s+\d+/i).flatten

This does what you want. 这就是您想要的。 It also matches single letter titles. 它还与单个字母标题匹配。

Answer 2

The following (unoptimized yet) regex works on your example: 以下（尚未优化的）正则表达式适用于您的示例：

(?i)(?=[A-Z])(?:\.[A-Z-]+|[A-Z -]+)+\b

It needs improvements, though, for example if non-ASCII letters should be matched, and there are some possible performance optimizations that depend on the exact regex flavor being used. 但是，它需要改进，例如是否应匹配非ASCII字母，并且有一些可能的性能优化取决于所使用的正则表达式风格。

See it on regex101 . 在regex101上看到它。

For Ruby 2, I would suggest /(?=\\p{L})(?:\\.[\\p{L}-]++|[\\p{L} -]+)+\\b/ 对于Ruby 2，我建议/(?=\\p{L})(?:\\.[\\p{L}-]++|[\\p{L} -]+)+\\b/

Answer 3

string.scan(/(\S.*?)\s+\.{2,}\s+\d+/).flatten
# =>
[
  "Topic one",
  "Second Topic",
  "Third one",
  "Topic.with.dots",
  "One more line",
  "last topic"
]

Answer 4

Similar to @sawa's: 类似于@sawa的：

puts text.scan(/([a-zA-Z .]+?) \.\.++ \d+/).flatten.map(&:strip)
# >> Topic one
# >> Second Topic
# >> Third one
# >> Topic.with.dots
# >> One more line
# >> last topic

(I like his pattern better though.) （不过，我更喜欢他的样式。）

Answer 5

Here is a solution in Perl: 这是Perl中的解决方案：

 $ cat tmp
 Topic one ......... 30 Second Topic .......... 33 Third one ......... 3   Topic.with.dots ..........   33 One more line ......................... 27 last topic ...... 34


$ cat tmp  | perl -ne 'while (m/((?:\w|[. ])+?) [.]+ \d+/g) { print "$1\n" }' 
Topic one
Second Topic
Third one
 Topic.with.dots
One more line
last topic

A little explanation of what I am doing here, the inner set of parens (?:...) are non capturing, so they are only for grouping, and they group a word-char ( \\w ) or a space or dot [. ] 关于我在这里所做的操作的一些解释是，内部的括号(?:...)无法捕获，因此它们仅用于分组，并且将字字符（ \\w ）或空格或点[. ] [. ] and then, since you have more dots, the match is non-greedy +? [. ] ，然后，由于您有更多点，因此匹配为非贪婪+? and the whole match goes into $1 , which is printed. 整个匹配项进入$1 ，将其打印出来。

HTH HTH

--EDIT-- - 编辑 -

Ruby has almost all constructs of Perl, including regex, and it is a straight forward conversion! Ruby几乎具有Perl的所有构造，包括regex，这是直接的转换！ (not sure why it had to be voted down!) FWIW, here it is in Ruby: （不确定为什么必须将其否决！）FWIW，在Ruby中：

while ARGF.gets
  puts $_.scan(/((?:\w|[. ])+?) [.]+ \d+/)
end

使用正则表达式从一个大字符串中提取数据

问题描述

5 个解决方案

解决方案1
2 已采纳 2013-07-12 08:24:20

解决方案2
1 2013-07-12 08:20:39

解决方案3
1 2013-07-12 08:40:54

解决方案4
1 2013-07-12 08:51:00

解决方案5
-1 2013-07-12 08:26:27

使用正则表达式从一个大字符串中提取数据

问题描述

5 个解决方案

解决方案1 2 已采纳 2013-07-12 08:24:20

解决方案2 1 2013-07-12 08:20:39

解决方案3 1 2013-07-12 08:40:54

解决方案4 1 2013-07-12 08:51:00

解决方案5 -1 2013-07-12 08:26:27

解决方案1
2 已采纳 2013-07-12 08:24:20

解决方案2
1 2013-07-12 08:20:39

解决方案3
1 2013-07-12 08:40:54

解决方案4
1 2013-07-12 08:51:00

解决方案5
-1 2013-07-12 08:26:27