简体   繁体   中英

Parse NSString with regex in iOS

I have the following input:

<table class="fiche_table_caracter"><tbody>
<tr>
    <td class="caracteristique"><strong>Design</strong></td>
    <td>Classique (full tactile)</td>
</tr>

<tr>
    <td class="caracteristique"><strong>Système d'exploitation (OS)</strong></td>
    <td>iOS</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Ecran</strong></td>
    <td>4,7'' (1334 x 750 pixels)<br />16 millions de couleurs</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Mémoire interne</strong></td>
    <td>128 Go, 1 Go RAM</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Appareil photo</strong></td>
    <td>8 mégapixels</td>
</tr>
</tbody>
</table>

I need to extract only the content of the <td> tags. This is what I did:

NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"<tr*>(.*?)</tr>" options:NSRegularExpressionCaseInsensitive error:NULL];

            NSArray *myArray = [regex matchesInString:str options:0 range:NSMakeRange(0, [str length])] ;
            UA_log(@"counttt: %d", [myArray count]);
            NSMutableArray *matches = [NSMutableArray arrayWithCapacity:[myArray count]];

            for (NSTextCheckingResult *match in myArray) {
                NSRange matchRange = [match rangeAtIndex:1];
                [matches addObject:[str substringWithRange:matchRange]];
                NSLog(@"Regex output:%@", [matches lastObject]);
                NSString * str2 = [matches lastObject];
                NSRegularExpression *regex2 = [NSRegularExpression regularExpressionWithPattern:@"<td*>(<strong>)?(.*?)(</strong>)?</td>" options:NSRegularExpressionCaseInsensitive error:NULL];

                NSArray *myArray2 = [regex2 matchesInString:str2 options:0 range:NSMakeRange(0, [str2 length])] ;
                UA_log(@"counttt: %d", [myArray2 count]);
                NSMutableArray *matches2 = [NSMutableArray arrayWithCapacity:[myArray2 count]];

                for (NSTextCheckingResult *match2 in myArray2) {
                    NSRange matchRange2 = [match2 rangeAtIndex:1];
                    [matches2 addObject:[str2 substringWithRange:matchRange2]];
                    NSLog(@"Regex2 output:%@", [matches2 lastObject]);
                    NSString * lastObject2 = [matches2 lastObject];

                }

            }

The issue I get is that I would like to set the tag <Strong> as optional but it doesn't work. With this code, I could extract the "tr" but not the content of the "td".

Please help!

I would like to extract:

1-

Design

Classique (full tactile)

2-

Système d'exploitation (OS)

iOS

3-

Ecran

16 millions de couleurs

4-

Mémoire interne

128 Go, 1 Go RAM

Use XMLParser to read the string by

import "XMLReader.h"

NSData *data = [str dataUsingEncoding:NSUTF8StringEncoding];
NSError *error = nil;
NSDictionary *dict = [XMLReader dictionaryForXMLData:data error:&error];
NSArray *trArray = [dict valueForKeyPath:@"table.tbody.tr"];
NSArray *tdArray = [trArray valueForKey:@"td"];
NSInteger i = 1;
for (NSArray *tdItems in tdArray) {
    NSString *stringValue = @"";
    for (NSDictionary *td in tdItems) {
        if ([td valueForKey:@"strong"]) {
            NSDictionary *strong = [td valueForKey:@"strong"];
            if ([strong valueForKey:@"text"]) {
                stringValue = [stringValue stringByAppendingString:[NSString stringWithFormat:@"\n %@", [strong valueForKey:@"text"]]];
            }
        } else if ([td valueForKey:@"text"]) {
            stringValue = [stringValue stringByAppendingString:[NSString stringWithFormat:@"\n %@", [td valueForKey:@"text"]]];
        }
    }
    NSLog(@"%d- %@", i, stringValue);
    i++;
}

THE "RIGHT WAY" WITH HTML PARSER

You should know that whenever you have arbitrary HTML, you will need a HTML parser to get information from the HTML code, eg Ray Wenderlich 's parser. Here is an example of using it (note that you want to get the contents of td nodes that have class attribute set to caracteristique - thus, XPath to be used is @"//tr/td[@class='caracteristique']" ):

- (void)loadDataFromHtml {
    NSURL *url = [NSURL URLWithString:stringUrl];
    NSData *data = [NSData dataWithContentsOfURL:url];
    TFHpple *parser = [TFHpple hppleWithHTMLData:data];
    NSString *XpathQueryString = @"//tr/td[@class='caracteristique']"; // Here, we use the XPath
    NSArray *nodes = [parser searchWithXPathQuery:XpathQueryString];
    for (TFHppleElement *element in nodes) {
        NSLog(@"%@", [element content]);
    }
}

See more on this at Parse HTML in objective C , and How to Parse HTML on iOS .

REGEX FIX (SINCE OP REQUIRES IT )

Here are fixes for your regular expressions:

The first one should be

(?s)<tr[^<]*>(.*?)</tr>

With [^<]* we make sure we are still inside <tr> tag and match all its attributes.

The second regex:

(?s)(?:<td\\b[^<]*>|\\G(?!^))(?:<[^<]+>)?(?!\\s+)([^<]*)(?:<[^<]+>)?

It matches all texts skipping tags. See demo .

Explanation:

  • (?s) - force single line mode when . matches a newline character
  • (?:<td\\\\b[^<]*>|(?!^)\\\\G) - sets the starting range location at <td...> or the end of previous match ( (?!^)\\\\G ).
  • (?:<[^<]+>)? - optionally matches a node element of type <...>
  • (?!\\\\s+)([^<]*) - matches text outside tags that is not whitespace
  • (?:<[^<]+>)? - optionally matches a node element of type <...>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM