简体   繁体   English

使用 strstr 查找 substring 的所有实例导致奇怪的字符串格式

[英]Using strstr to find all instances of substring results in weird string formatting

I'm making a web scraper and i'm at the point where I need to parse the incoming data.我正在制作一个 web 刮板,我正处于需要解析传入数据的位置。 Everything was going fine until I had to find all instances of a substring in a string.一切都很顺利,直到我不得不在字符串中找到 substring 的所有实例。 I was able to get something working but it doesn't give me the full string I want (which is a full <p></p> tag).我能够得到一些工作,但它没有给我我想要的完整字符串(这是一个完整的<p></p>标签)。

done = 0;

while (done == 0) {
    if ((findSpan = strstr(serverResp, "<p")) != NULL) {
        printf("%s\n", findSpan);
        if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
            strcpy(serverResp, findSpanEnd);
            strcpy(findSpanEnd+4, "");
            printf("after end tag formattng %s\n", findSpan);
        }
    } else {
        done = 1;
    }
}

After end tag formatting should give me a result along the lines of <p>insert text here</p> but instead, I get something like this:结束标记格式化后应该给我一个类似于<p>insert text here</p>的结果,但相反,我得到的是这样的:

        <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>

after end tag formatting <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>

after end tag formatting dy>
</html>

The site's code looks like this:该网站的代码如下所示:

<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <h1>ignore this</h1>
        <p>This should be printed</p>
        <h3>ignore</h3>
        <p>and so should this</p>
    </body>
</html>
        if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
            strcpy(serverResp, findSpanEnd);

This makes no sense.这是没有意义的。 strstr finds "</p>" as requested; strstr按要求查找"</p>" however you can't pass that to strcpy like that.但是你不能像那样将它传递给strcpy strstr doesn't allocate a new string at all; strstr根本不分配新字符串; it only returns the location within the old one.它只返回旧位置中的位置。

A routine to print out all <p> tags would look like this (note that this assumes no nested <p> tags):打印出所有<p>标签的例程如下所示(请注意,这假定没有嵌套的<p>标签):

    for (char *ptr = serverResp; ptr = strstr(ptr, "<p");)
    {
        char *finger = strchr(ptr, '>');
        if (!finger) break;
        ++finger;
        ptr = strstr(finger, "</p>");
        if (!ptr) {
            fwrite(finger, 1, strlen(finger), stdout);
        } else {
            fwrite(finger, 1, ptr - finger, stdout);
        }
        fputs("\r\n", stdout);
    }

The technique: the call to strstr in the for loop locates the next <p> tag, strchr finds the end of it, then another strstr finds the closing </p> Because the return pointers are into the originating string, we use fwrite instead of printf to produce output.技术:for 循环中对strstr的调用定位下一个<p>标记, strchr找到它的结尾,然后另一个strstr找到结束</p>因为返回指针指向原始字符串,所以我们改用fwrite printf生成 output。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM