[英]Extract data from one big string with regex
Consider the following String, which is a table of content extracted from a pdf, like in the following example, two topics can be on one line, there is one line break at the end of each line (like in the example) 考虑以下字符串,它是从pdf中提取的内容表,例如在以下示例中,两个主题可以在一行上,每行的末尾有一个换行符(例如在示例中)
A — N° 1 2 janvier 2013
TABLE OF CONTENT
Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34
I want to extract the section's name 'Topic à one', 'Second Topic', 'Third -one', 'Topic.with.dots', 'One more line' and 'last topic' 我要提取该部分的名称“主题à一个”,“第二个主题”,“第三个”,“ Topic.with.dots”,“多一行”和“最后一个主题”
Any insights for a matching regex? 对匹配的正则表达式有什么见解?
# -*- coding: utf-8 -*-
string = "A — N° 1 2 janvier 2013
TABLE OF CONTENT
Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34"
puts string.scan(/(\p{l}[\p{l} \.-]*)\s+\.+\s+\d+/i).flatten
This does what you want. 这就是您想要的。 It also matches single letter titles.
它还与单个字母标题匹配。
The following (unoptimized yet) regex works on your example: 以下(尚未优化的)正则表达式适用于您的示例:
(?i)(?=[A-Z])(?:\.[A-Z-]+|[A-Z -]+)+\b
It needs improvements, though, for example if non-ASCII letters should be matched, and there are some possible performance optimizations that depend on the exact regex flavor being used. 但是,它需要改进,例如是否应匹配非ASCII字母,并且有一些可能的性能优化取决于所使用的正则表达式风格。
See it on regex101 . 在regex101上看到它 。
For Ruby 2, I would suggest /(?=\\p{L})(?:\\.[\\p{L}-]++|[\\p{L} -]+)+\\b/
对于Ruby 2,我建议
/(?=\\p{L})(?:\\.[\\p{L}-]++|[\\p{L} -]+)+\\b/
string.scan(/(\S.*?)\s+\.{2,}\s+\d+/).flatten
# =>
[
"Topic one",
"Second Topic",
"Third one",
"Topic.with.dots",
"One more line",
"last topic"
]
Similar to @sawa's: 类似于@sawa的:
puts text.scan(/([a-zA-Z .]+?) \.\.++ \d+/).flatten.map(&:strip)
# >> Topic one
# >> Second Topic
# >> Third one
# >> Topic.with.dots
# >> One more line
# >> last topic
(I like his pattern better though.) (不过,我更喜欢他的样式。)
Here is a solution in Perl: 这是Perl中的解决方案:
$ cat tmp
Topic one ......... 30 Second Topic .......... 33 Third one ......... 3 Topic.with.dots .......... 33 One more line ......................... 27 last topic ...... 34
$ cat tmp | perl -ne 'while (m/((?:\w|[. ])+?) [.]+ \d+/g) { print "$1\n" }'
Topic one
Second Topic
Third one
Topic.with.dots
One more line
last topic
A little explanation of what I am doing here, the inner set of parens (?:...)
are non capturing, so they are only for grouping, and they group a word-char ( \\w
) or a space or dot [. ]
关于我在这里所做的操作的一些解释是,内部的括号
(?:...)
无法捕获,因此它们仅用于分组,并且将字字符( \\w
)或空格或点[. ]
[. ]
and then, since you have more dots, the match is non-greedy +?
[. ]
,然后,由于您有更多点,因此匹配为非贪婪+?
and the whole match goes into $1
, which is printed. 整个匹配项进入
$1
,将其打印出来。
HTH HTH
--EDIT-- - 编辑 -
Ruby has almost all constructs of Perl, including regex, and it is a straight forward conversion! Ruby几乎具有Perl的所有构造,包括regex,这是直接的转换! (not sure why it had to be voted down!) FWIW, here it is in Ruby:
(不确定为什么必须将其否决!)FWIW,在Ruby中:
while ARGF.gets
puts $_.scan(/((?:\w|[. ])+?) [.]+ \d+/)
end
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.