简体   繁体   中英

Is it possible to detect links within an NSString that have spaces in them with NSDataDetector?

First off, I have no control over the text I am getting. Just wanted to put that out there so you know that I can't change the links.

The text I am trying to find links in using NSDataDetector contains the following:

<h1>My main item</h1>
<img src="http://www.blah.com/My First Image Here.jpg">
<h2>Some extra data</h2>

The detection code I am using is this, but it will not find this link:

NSDataDetector *linkDetector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeLink error:nil];
NSArray *matches = [linkDetector matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];

for (NSTextCheckingResult *match in matches) 
{
   if ([match resultType] == NSTextCheckingTypeLink)
   {
      NSURL *url = [match URL];
      // does some stuff
   }
}

Is this a bug with Apple's link detection here, where it can't detect links with spaces, or am I doing something wrong?

Does anyone have a more reliable way to detect links regardless of whether they have spaces or special characters or whatever in them?

I just got this response from Apple for a bug I filed on this:

We believe this issue has been addressed in the latest iOS 9 beta. This is a pre-release iOS 9 update.

Please refer to the release notes for complete installation instructions.

Please test with this release. If you still have issues, please provide any relevant logs or information that could help us investigate.

iOS 9 https://developer.apple.com/ios/download/

I will test and let you all know if this is fixed with iOS 9.

You could split the strings into pieces using the spaces so that you have an array of strings with no spaces. Then you could feed each of those strings into your data detector.

// assume str = <img src="http://www.blah.com/My First Image Here.jpg">
NSArray *components = [str componentsSeparatedByString:@" "];
for (NSString *strWithNoSpace in components) {
    // feed strings into data detector
}

Another alternative is to look specifically for that HTML tag. This is a less generic solution, though.

// assume that those 3 HTML strings are in a string array called strArray
for (NSString *htmlLine in strArray) {
    if ([[htmlLine substringWithRange:NSMakeRange(0, 8)] isEqualToString:@"<img src"]) {
        // Get the url from the img src tag
        NSString *urlString = [htmlLine substringWithRange:NSMakeRange(10, htmlLine.length - 12)];
    }
}

I've found a very hacky way to solve my issue. If someone comes up with a better solution that can be applied to all URLs, please do so.

Because I only care about URLs ending in .jpg that have this problem, I was able to come up with a narrow way to track this down.

Essentially, I break out the string into components based off of them beginning with "http:// into an array. Then I loop through that array doing another break out looking for .jpg"> . The count of the inner array will only be > 1 when the .jpg"> string is found. I then keep both the string I find, and the string I fix with %20 replacements, and use them to do a final string replacement on the original string.

It's not perfect and probably inefficient, but it gets the job done for what I need.

- (NSString *)replaceSpacesInJpegURLs:(NSString *)htmlString
{
    NSString *newString = htmlString;

    NSArray *array = [htmlString componentsSeparatedByString:@"\"http://"];
    for (NSString *str in array)
    {
        NSArray *array2 = [str componentsSeparatedByString:@".jpg\""];

        if ([array2 count] > 1)
        {
            NSString *stringToFix = [array2 objectAtIndex:0];
            NSString *fixedString = [stringToFix stringByReplacingOccurrencesOfString:@" " withString:@"%20"];

            newString = [newString stringByReplacingOccurrencesOfString:stringToFix withString:fixedString];
        }
    }

    return newString;
}

You can use NSRegularExpression to fix all URLs by using a simple regex to detect the links and then just encode the spaces (if you need more complex encoding you can look into CFURLCreateStringByAddingPercentEscapes and there are plenty of examples out there). The only thing that might take you some time if you haven't worked with NSRegularExpression before is how to iterate the results and do the replacing, the following code should do the trick:

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"src=\".*\"" options:NSRegularExpressionCaseInsensitive error:&error];
if (!error)
{
    NSInteger offset = 0;
    NSArray *matches = [regex matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];
    for (NSTextCheckingResult *result in matches)
    {
        NSRange resultRange = [result range];
        resultRange.location += offset;

        NSString *match = [regex replacementStringForResult:result inString:myHTML offset:offset template:@"$0"];
        NSString *replacement = [match stringByReplacingOccurrencesOfString:@" " withString:@"%20"];

        myHTML = [myHTML  stringByReplacingCharactersInRange:resultRange withString:replacement];
        offset += ([replacement length] - resultRange.length);
    }
}

Try this regex pattern: @"<img[^>]+src=(\\"|')([^\\"']+)(\\"|')[^>]*>" with ignore case ... Match index=2 for source url.

regex demo in javascript: (Try for any help)

Demo

Give this snippet a try (I got the regexp from your first commentator user3584460) :

NSError *error = NULL;
NSString *myHTML = @"<http><h1>My main item</h1><img src=\"http://www.blah.com/My First Image Here.jpg\"><h2>Some extra data</h2><img src=\"http://www.bloh.com/My Second Image Here.jpg\"><h3>Some extra data</h3><img src=\"http://www.bluh.com/My Third-Image Here.jpg\"></http>";
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"src=[\"'](.+?)[\"'].*?>" options:NSRegularExpressionCaseInsensitive error:&error];

NSArray *arrayOfAllMatches = [regex matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];

NSTextCheckingResult *match = [regex firstMatchInString:myHTML options:0 range:NSMakeRange(0, myHTML.length)];



for (NSTextCheckingResult *match in arrayOfAllMatches) {
    NSRange  range = [match rangeAtIndex:1];

    NSString* substringForMatch = [myHTML substringWithRange:range];
    NSLog(@"Extracted URL : %@",substringForMatch);

}

In my log, I have :

Extracted URL  : http://www.blah.com/My First Image Here.jpg
Extracted URL  : http://www.bloh.com/My Second Image Here.jpg
Extracted URL  : http://www.bluh.com/My Third-Image Here.jpg

You should not use NSDataDetector with HTML. It is intended for parsing normal text (entered by an user), not computer-generated data (in fact, it has many heuristics to actually make sure it does not detect computer-generated things which are probably not relevant to the user).

If your string is HTML, then you should use an HTML parsing library. There are a number of open-source kits to help you do that. Then just grab the href attributes of your anchors, or run NSDataDetector on the text nodes to find things not marked up without polluting the string with tags.

URLs really shouldn't contain spaces. I'd remove all spaces from the string before doing anything URL-related with it, something like the following

// Custom function which cleans up strings ready to be used for URLs
func cleanStringForURL(string: NSString) -> NSString {
    var temp = string
    var clean = string.stringByReplacingOccurrencesOfString(" ", withString: "")
    return clean
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM