解析html代码或在Java中使用正则表达式？

Question

I'm trying to extract the values of this piece of html code: 我正在尝试提取这段html代码的值：

<ul id="tree-dotlrn_class_instance">
<li>
      <a href="/dotlrn/classes/c033/13000/c12c033a13000gA/">**2011-12 Ampl.Arquit.Computadors Gr.A  (13000)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13022/c12c033a13022gA/c12c033a13022gAsT00/">**2011-12 Entorns d'Usuari Gr.A  Sgr.T00 (13022)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13036/c12c033a13036gA/c12c033a13036gAsT00/">**2011-12 Eng.Serv.Telemàtics Gr.A  Sgr.T00 (13036)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13038/c12c033a13038gA/">**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A  (13038)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/">**2011-12 Processad.Llenguatge Gr.A  (13048)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsL01/">**2011-12 Processad.Llenguatge Gr.A  Sgr.L01 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsT00/">**2011-12 Processad.Llenguatge Gr.A  Sgr.T00 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13052/c12c033a13052gA/c12c033a13052gAsL02/">**2011-12 Sist.Basats Microprocessadors Gr.A  Sgr.L02 (13052)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13055/c12c033a13055gAA/">**2011-12 Sist.Informàtics Gr.AA (13055)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/14009/c12c033a14009gA/">**2011-12 Administrac. Gestió de Xarxes Gr.A  (14009)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/15656/c12c033a15656gA/">**2011-12 Transmissió de Dades Gr.A**  (15656)</a>        
</li>
</ul>

All that it's in strong black (between**)with his href value into a HashMap. 所有内容都是黑色（介于**之间），其href值位于HashMap中。 First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly. 首先，我尝试使用jericho html解析器，但我认为它是如此复杂，然后我尝试使用Regex，但我不知道该怎么做。 Can you help me ?? 你能帮助我吗？？

Thanks! 谢谢！

Update: I'm trying this, but it's not the right way. 更新：我正在尝试，但这不是正确的方法。

Source s = new Source(answer);
    List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
    int tam1 = Form1.size();
        for(int j = 0; j < tam1; j++){
            Element e1 = Form1.get(j);
            if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
                List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
                for (int k = 0; k < L1.size(); k++){
                    Element e2 = L1.get(k);
                    System.out.println("Elemento de la lista L1: "+e2.getContent());
                    List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
                    for(int m = 0; m < L2.size(); m++){
                        Element e3 = L2.get(m);
                        System.out.println("Elemento de la lista L2: "+e3.getContent());
                        asignaturas.add(e3.getContent().toString());
                        System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
                    }
                }

            }
        }

Answer 1

Take a look at JSoup's selector syntax . 看一下JSoup的选择器语法。

If you are looking for all a elements with an href attribute, you can find them like this: 如果您正在寻找所有a与元素href属性，你可以找到他们喜欢这样的：

String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");

From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap . 从那里，您应该能够提取元素的文本和href属性的值来创建HashMap 。

Answer 2

Regex: 正则表达式：

\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>

Java String: Java字串：

"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"

You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied. 您可能会发现它不可靠，但是如果页面与您提供的内容匹配，则第一个匹配组将是您想要的字符串。

Here is a place to test Java regular expressions 这是测试Java正则表达式的地方

Answer 3

Why not use the DOM API? 为什么不使用DOM API？ You can get attributes and values fairly trivially with it. 您可以使用它相当简单地获取属性和值。

Answer 4

假设输入HTML格式正确，您当然可以尝试使用XML Pull Parsing或DOM。

解析html代码或在Java中使用正则表达式？

问题描述

4 个解决方案

解决方案1
5 2013-01-08 16:39:55

解决方案2
0 2013-01-08 16:52:57

解决方案3
0 2013-01-08 16:54:58

解决方案4
0 2013-01-08 16:57:51

解析html代码或在Java中使用正则表达式？

问题描述

4 个解决方案

解决方案1 5 2013-01-08 16:39:55

解决方案2 0 2013-01-08 16:52:57

解决方案3 0 2013-01-08 16:54:58

解决方案4 0 2013-01-08 16:57:51

解决方案1
5 2013-01-08 16:39:55

解决方案2
0 2013-01-08 16:52:57

解决方案3
0 2013-01-08 16:54:58

解决方案4
0 2013-01-08 16:57:51