I am trying to capture "Rio Grande Do Leste" from:
...
<h1>Rio Grande Do Leste<br />
...
using
var myregexp = /<h1>()<br/;
var nomeAldeiaDoAtaque = myregexp.exec(document);
what am I doing wrong?
update:
2 questions remain:
1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?
2) I had to change it to: myregexp.exec(document.body.innerHTML) [1] ; to get what I want, otherwise it would give me some result which includes <h1>
. why is that?
3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?
尝试/<h1>(.*?)<br/
。
A capturing group attempts to capture what it matches . This has some important consequences:
Here's a simple pattern that contains 2 capturing groups:
(\d+) (cats|dogs)
\___/ \_________/
1 2
Given i have 16 cats, 20 dogs, and 13 turtles
, there are 2 matches ( as seen on rubular.com ):
16 cats
is a match: group 1 captures 16
, group 2 captures cats
20 dogs
is a match: group 1 captures 20
, group 2 captures dogs
Now consider this slight modification on the pattern:
(\d)+ (cats|dogs)
\__/ \_________/
1 2
Now group 1 matches \\d
, ie a single digit. In most flavor, a group that matches repeatedly (thanks to the +
in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 ( as seen on rubular.com ):
16 cats
is a match: group 1 captures 6
, group 2 captures cats
20 dogs
is a match: group 1 captures 0
, group 2 captures dogs
Now let's consider the problem of matching "everything between A
and ZZ
". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ
yields 1 match: AiiZooAuuZZeeeZZ
( as seen on ideone.com )
iiZooAuuZZeee
A(.*?)ZZ
yields 1 match: AiiZooAuuZZ
( as seen on ideone.com )
iiZooAuu
A([^Z]*)ZZ
yields 1 match: AuuZZ
( as seen on ideone.com )
uu
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
See related question for a more in-depth treatment on the difference between these 3 techniques.
.*?
and .*
for regex
So let's go back to the question and see what's wrong with pattern:
<h1>()<br
\/
1
Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br
, and group 1 can only match the empty string.
One can try to "fix" this in many different ways. The 3 obvious ones to try are:
<h1>(.*)<br
; greedy <h1>(.*?)<br
; reluctant <h1>([^<]*)<br
; negated character class You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.