简体   繁体   English

解析大字符串(HTML代码)

[英]Parsing big string (HTML code)

I'm looking to parse some information on my application. 我正在寻找一些有关我的应用程序的信息。 Let's say we have somewhere in that string: 假设我们在该字符串中的某处:

<tr class="tablelist_bg1">

<td>Beja</td>

<td class="text_center">---</td>

<td class="text_center">19.1</td>

<td class="text_center">10.8</td>

<td class="text_center">NW</td>

<td class="text_center">50.9</td>

<td class="text_center">0</td>

<td class="text_center">1016.6</td>

<td class="text_center">---</td>

<td class="text_center">---</td>

</tr>

All rest that's above or below this doesn't matter. 高于或低于此值的所有其余部分都无关紧要。 Remember this is all inside a string. 请记住,这些都在字符串中。 I want to get the values inside the td tags: ---, 19.1, 10.8, etc. Worth knowing that there are many entries like this on the page. 我想获取td标签内的值:---,19.1、10.8等。值得知道的是,在页面上有很多这样的条目。 Probably also a good idea to link the page here . 在此处链接页面可能也是一个好主意。

As you probably guessed I have absolutely no idea how to do this... none of the functions I know I can perform over the string (split etc.) help. 正如您可能猜到的那样,我绝对不知道如何执行此操作...我所知道的所有功能都无法通过字符串(拆分等)帮助执行。

Thanks in advance 提前致谢

Just use String.IndexOf(string, int) to find a "<td", again to find the next ">", and again to find "</td>". 只需使用String.IndexOf(string,int)查找“ <td”,再次查找下一个“>”,然后再次查找“ </ td>”即可。 Then use String.Substring to pull out a value. 然后使用String.Substring提取一个值。 Put this in a loop. 将此循环。

    public static List<string> ParseTds(string input)
    {
        List<string> results = new List<string>();

        int index = 0;

        while (true)
        {
            string next = ParseTd(input, ref index);

            if (next == null)
                return results;

            results.Add(next);
        }
    }

    private static string ParseTd(string input, ref int index)
    {
        int tdIndex = input.IndexOf("<td", index);
        if (tdIndex == -1)
            return null;
        int gtIndex = input.IndexOf(">", tdIndex);
        if (gtIndex == -1)
            return null;
        int endIndex = input.IndexOf("</td>", gtIndex);
        if (endIndex == -1)
            return null;

        index = endIndex;

        return input.Substring(gtIndex + 1, endIndex - gtIndex - 1);
    }

Assuming your string is valid XHTML, you can use use an XML parser to get the content you want. 假设您的字符串是有效的XHTML,则可以使用XML解析器来获取所需的内容。 There's a simple example here that shows how to use XmlTextReader to parse XML content. 这里有一个简单的示例 ,显示了如何使用XmlTextReader解析XML内容。 The example reads from a file, but you can change it to read from a string: 该示例从文件读取,但是您可以将其更改为从字符串读取:

new XmlTextReader(new StringReader(someString));

You specifically want to keep track of td element nodes, and the text node that follows them will contain the values you want. 您特别想跟踪td元素节点, td的文本节点将包含所需的值。

  • Use a loop to load each non empty line from the file into a String 使用循环将文件中的每个非空行加载到字符串中
  • Process the string character by character 逐个字符处理字符串
    • Check for characters indicating the the begining of a td tag 检查指示td标签开头的字符
    • use a substring function or just bulild a new string character by character to get all the content until the </td> tag begins. 使用子字符串函数,或者只是逐个字符地新建一个字符串,以获取所有内容,直到</td>标记开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM