The line is approximately 7915621 in length and is actually the view state value of an ASPX website.
I get the original HTML of the site, then pass it line by line to the extract function, and as soon as it reaches the view_state line containing that long string, the regex become stuck.
Here is the regex pattern that get stuck,
/[\w\.]+\@[\w]+(?:\.[\w]{3}|\.[\w]{2}\.[\w]{2})\b/gi
I thought about setting a maximum line length to skip this line or any other lines like that but I can't think of a optimal size as I care about false positives.
[\w\.]+
is found so many times in your document that it becomes a problem to process them with your expression.
Reducing the amount of places to start searching at is a possible solution. Eg using a word boundary.
(?:\.\w{3}|\.\w{2}\.\w{2})
can be streamlined as \.\w{2}(?:\w|\.\w{2})
.
Use
/\b[\w.]+@\w+\.\w{2}(?:\w|\.\w{2})\b/gi
Or, get rid of the brackets
/\b\w+(?:\.\w+)*@\w+\.\w{2}(?:\w|\.\w{2})\b/gi
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
@ '@'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\w{2} word characters (a-z, A-Z, 0-9, _) (2
times)
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\w{2} word characters (a-z, A-Z, 0-9, _) (2
times)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.