使用正则表达式解析HTML表行

Question

i have gone though this post why not use regular expression for HTML . 我走过这篇文章，为什么不对HTML使用正则表达式。 As a part of the task given to me, i had no choice but to use regular expression for HTML. 作为给我的任务的一部分，我别无选择，只能对HTML使用正则表达式。

i have HTML code and separately tried like 我有HTML代码，并分别尝试过

 <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

i have been able to get the 13 using following regular expression : 我已经能够使用以下正则表达式获得13 ：

<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

and similarly from 并且类似地

<td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

got 5 star using the regular expression 使用正则表达式获得5 星

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(.*)</a>\s*</td>

But when both of the HTML code is combined like, 但是当两个HTML代码组合在一起时，

<table id="histogramTable" class="a-normal a-align-middle a-spacing-base">

  <tr class="a-histogram-row">



        <td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

        <td class="a-span10">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 69.1358024691358%;"></div></div></a>

        </td>

        <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

  </tr>
  <td class="a-nowrap">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href="">1 star</a><span class="a-letter-space"></span>          

    </td>

    <td class="a-span10">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 2.46913580246914%;"></div></div></a>

    </td>

    <td class="a-nowrap">

      <span class="a-letter-space"></span><span>2</span>

    </td>


</table>

how to extract 5 star and 13 using regular expression? 如何使用正则表达式提取5星和13星 ？

Answer 1

If you don't want to use HTML parser, use one regex after another or add .* this between two patterns, I have modified a bit your star regex as it didn't work properly: 如果您不想使用HTML解析器，请一个接一个地使用正则表达式，或者在两个模式之间添加.* ，则我对您的星型正则表达式做了一些修改，因为它无法正常工作：

First enable dotall flag (s) and then use this: 首先启用dotall标志，然后使用此标志：

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(\d star).*<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

Output: 输出：

Group 1: 5 star 第一组：5星

Group 2: 13 第2组：13

EDIT: 编辑：

I have made shorter regex: 我做了更短的正则表达式：

REGEX: 正则表达式：

>(\d star)<.+?>(\d+?)<

Which used on pythonregex.com with the edited input you have provided gives: 在pythonregex.com上将其与您提供的已编辑输入一起使用的结果如下：

OUTPUT: 输出：

>>> regex.findall(string)
[(u'5 star', u'13'), (u'1 star', u'2')]

使用正则表达式解析HTML表行

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-11-08 12:11:21

使用正则表达式解析HTML表行

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-11-08 12:11:21

解决方案1
1 已采纳 2013-11-08 12:11:21