简体   繁体   中英

How can I verify URLs are correct, and/or extract valid URLs from arbitrary text?

Sometimes I've had a text entry form, where I want to disable the "Accept" button until a user has entered a valid URL. Searching here or the web turns up a huge number of regular expressions, but given the complexity of the URL specification (RFC-3986), its neigh impossible to write your own verification test suite for them. Once my app is in the App store, how would I ever know how many false negatives I got due to defects in the regular expression?

Other times I've had the need to extract all the valid URLs from a web site or some other text, and want to get an array of them so I can filter it down to say just those that point to an image file. Faulty regular expressions are less likely to be a problem in this case, since if I miss an image or two, or get a bogus URL, its not a major problem. In any case, the better the regular expression, the more correct the returned list of images.

So, how can I with virtual certainty validate a presented string as a proper URL? Also it would be nice to have the means to extract valid URLs out of arbitrary text.

There are a huge number of regular expression on the web that claim to verify URLs. The problem with most is while they may work, they have no credentials - that is, there does not exist any way to prove their correctness one way or the other.

The reference spec on URLs is RFC-3986 , and while on a long search for the best regular expression I tripped over Jeff Roberson's regular expression page . What he did was start from the spec, constructing small regular expressions to match the low level parts of the RFC, and gradually building them up into a full expression.

For instance, this is how one gets the full scheme :

# From http://jmrware.com/articles/2009/uri_regexp/URI_regex.html Copyright @ Jeff Roberson
(⌽[A-Za-z][A-Za-z0-9+\-.]*)
# DFH Addition: change ⌽ from "?:" to "" to get capture groups of the various components

The unicode character after the first "(" gets changed to either "?:", meaning non-capture group, or "" to turn it into a capture group. Note that this matches a single character with one or more of the characters contained in the second "[]" group,

The full authority is found using this expression:

# RFC-3986 URI component:  relative-part
(?: //                                                          # ( "//"
  (?: (⌽(?:[A-Za-z0-9\-._~!$&'()*+,;=:]|%[0-9A-Fa-f]{2}☯)* ) @)?     # authority DFH modified to grab the authority without '@'
  (⌽
    \[
    (?:
      (?:
        (?:                                                    (?:[0-9A-Fa-f]{1,4}:){6}
        |                                                   :: (?:[0-9A-Fa-f]{1,4}:){5}
        | (?:                            [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){4}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,1} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){3}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,2} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){2}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,3} [0-9A-Fa-f]{1,4})? ::    [0-9A-Fa-f]{1,4}:
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,4} [0-9A-Fa-f]{1,4})? ::
        ) (?:
            [0-9A-Fa-f]{1,4} : [0-9A-Fa-f]{1,4}
          | (?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3}
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
          )
      |   (?: (?:[0-9A-Fa-f]{1,4}:){0,5} [0-9A-Fa-f]{1,4})? ::    [0-9A-Fa-f]{1,4}
      |   (?: (?:[0-9A-Fa-f]{1,4}:){0,6} [0-9A-Fa-f]{1,4})? ::
      )
    | [Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&'()*+,;=:]+
    )
    \]
  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
  | 
  (?:[A-Za-z0-9\-._~!$&'()*+,;=]|%[0-9A-Fa-f]{2}☯)*
  )

  (?: : (⌽[0-9]*) )? # DFH addition to grab just the port

 (⌽   # DFH addition to get one capture group
  (⌽ / (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2}☯)* )*    # path-abempty
| /                                                             # / path-absolute
  (⌽:    (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2}☯)+
    (?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2}☯)* )*
  )?
| (⌽        (?:[A-Za-z0-9\-._~!$&'()*+,;=@] |%[0-9A-Fa-f]{2}☯)+     # / path-noscheme
    (?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2}☯)* )*
   ) # DFH Wrapper
|                                                            # / path-empty
      (⌽) # DFH addition so constant number of capture groups
 )
)                                                               # )

# DFH Addition: change ☯ to "|[\u0080-\U0010ffff]" to get inline Unicode detection (making this an IRI, not a URI, but you can later hex encode it), or "" for standard behavior
# DFH Addition: change ⌽ from "?:" to "" to get capture groups of the various components

If you read the above, you can see that this expression can be extended to find Unicode characters by the addition of "|[\€-\\U0010ffff]" in just a few places.

Because he actually started with the RFC, and all portions of his expression fully reference the ABNF specification, I have great confidence in them.

However, when I started testing, I found that a URL verifier for say http:// passed! It turns out that the spec allows virtually everything to be an empty string! Sort of hard to use that for a UI form verifier.

So I took his expressions, and made some small additions. First, I found I could change the path specifier from a '*' to a '?', so that in form entry, a user would be forced to type at least one '/' after 'http://'. This makes the validator more stringent than it needs to be, but a more realistic one.

Jeff's regular expressions just use non-capturing groups, so I looked at ways to support capturing groups, so all components of a URL could be extracted if need be.

Also, think of non-USA users, who often need to enter non-ASCII characters into a URL - they want to enter an accented character - but the normal validator would reject a Unicode character. It would be nice to validate a string containing unicode characters, then convert the unicode to '%' encoded hex before actually using it. This requires extending the expressions to accept unicode characters by adding |[\\\€-\\\\U0010ffff] to the sections accepting ASCII.

The whole problem begged putting together a test harness that could construct one or more regular expressions with the options a given app might need, and that could test those against various test strings; thus was borne URLFinderAndVerifier .

The test harness uses extended expression strings taken from Jeff's page, with all their spaces and comments intact, and with additional comments made by me. Those make the expressions much easier to read and understand. The test app reads the text files and removes all comments and spaces, preprocess them based on the options selected in the UI, and then sets those for use or for pasting (so you can use them in your app). The test app also lets you use it in a interactive mode, where it validates as you modify the input text.

Options:

  • look for http/https, http/https/ftp, or any scheme

  • for form entry, require a "/" after "scheme://", it makes the toggling of an "Accept" button more realistic (also requires at least one character after query's "?" and frament's "#")

  • enable capture groups, so for each URL extract the scheme, userinfo, host, port, path, and optionally the query and/or the fragment)

  • in extract mode, include or exclude the query and/or fragment

USAGE:

  • clone the project, and determine what regular expression you want, then paste it into the results window and use it in your app (suitable for a text file or NSString in code)

  • copy the URLFinder interface and implementation files into your project

  • instantiate a URLFinder and supply it the regular expression from the first step.

Surely the easiest way to validate a url is by constructing an NSURL object.

NSURL *url = [NSURL URLWithString:urlString];

According to the documentation :

Must be a URL that conforms to RFC 2396.

If the string was malformed, returns nil.

Ultimately you'll likely want to convert the url into an NSURL object anyway, so it's probably in the best position to decide whether your string is valid or not.

Then to find urls in a block of text you can perform a very simple regex search, just looking for potential candidates. For example, something like this:

[^\s]+://[^\s]+

Then use the NSURL construction technique described above to validate whether those candidates are genuine matches or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM