使用lxml解析html

Question

I have the follow html code 我有以下html代码

<a name="Audio-Encoders"></a>
<h1 class="chapter"><a href="ffmpeg.html#toc-Audio-Encoders">14. Audio Encoders</a></h1>

<p>A description of some of the currently available audio encoders
follows.
</p>
<a name="ac3-and-ac3_005ffixed"></a>
<h2 class="section"><a href="ffmpeg.html#toc-ac3-and-ac3_005ffixed">14.1 ac3 and     ac3_fixed</a></h2>

<p>AC-3 audio encoders.
</p>
<p>These encoders implement part of ATSC A/52:2010 and ETSI TS 102 366, as well as
the undocumented RealAudio 3 (a.k.a. dnet).
</p>
<p>The <var>ac3</var> encoder uses floating-point math, while the <var>ac3_fixed</var>
encoder only uses fixed-point integer math. This does not mean that one is
always faster, just that one or the other may be better suited to a
particular system. The floating-point encoder will generally produce better
quality audio for a given bitrate. The <var>ac3_fixed</var> encoder is not the
default codec for any of the output formats, so it must be specified explicitly
using the option <code>-acodec ac3_fixed</code> in order to use it.
</p>
<a name="AC_002d3-Metadata"></a>
<h3 class="subsection"><a href="ffmpeg.html#toc-AC_002d3-Metadata">14.1.1 AC-3     Metadata</a></h3>

<p>The AC-3 metadata options are used to set parameters that describe the audio,
but in most cases do not affect the audio encoding itself. Some of the options
do directly affect or influence the decoding and playback of the resulting
bitstream, while others are just for informational purposes. A few of the
options will add bits to the output stream that could otherwise be used for
audio data, and will thus affect the quality of the output. Those will be
indicated accordingly with a note in the option list below.
</p>
<p>These parameters are described in detail in several publicly-available
documents.
</p><ul>

How the text could be extracted from every after <hX class="foobar"> ? 如何从<hX class="foobar">之后的每个文本中提取文本？

for example <h1 class="chapter"> the content after h1 tag is "A description of some of the currently available audio encoders follows." 例如， <h1 class="chapter"> h1标记后的内容是“以下是一些当前可用的音频编码器的说明”。

Answer 1

Use : 用途：

/*/h1/following-sibling::p
         [name(preceding-sibling::*[starts-with(name(),'h')][1])
         = 'h1'
         ]//text()

Evaluating this XPath expression against the following XML document (obtained from the non-well-formed provided fragment bu deleting the trailing unclosed ul and wrapping the result into a single top element): 根据以下XML文档 （从格式不正确的提供的片段bu中删除尾随未闭合的ul并将结果包装到单个top元素中获得）来评估此XPath表达式 ：

<html>
    <a name="Audio-Encoders"></a>
    <h1 class="chapter">
        <a href="ffmpeg.html#toc-Audio-Encoders">14. Audio Encoders</a>
    </h1>
    <p>A description of some of the currently available audio encoders follows. </p>
    <a name="ac3-and-ac3_005ffixed"></a>
    <h2 class="section">
        <a href="ffmpeg.html#toc-ac3-and-ac3_005ffixed">14.1 ac3 and     ac3_fixed</a>
    </h2>
    <p>AC-3 audio encoders. </p>
    <p>These encoders implement part of ATSC A/52:2010 and ETSI TS 102 366, as well as the undocumented RealAudio 3 (a.k.a. dnet). </p>
    <p>The 
        <var>ac3</var> encoder uses floating-point math, while the 
        <var>ac3_fixed</var> encoder only uses fixed-point integer math. This does not mean that one is always faster, just that one or the other may be better suited to a particular system. The floating-point encoder will generally produce better quality audio for a given bitrate. The 
        <var>ac3_fixed</var> encoder is not the default codec for any of the output formats, so it must be specified explicitly using the option 
        <code>-acodec ac3_fixed</code> in order to use it. 
    </p>
    <a name="AC_002d3-Metadata"></a>
    <h3 class="subsection">
        <a href="ffmpeg.html#toc-AC_002d3-Metadata">14.1.1 AC-3     Metadata</a>
    </h3>
    <p>The AC-3 metadata options are used to set parameters that describe the audio, but in most cases do not affect the audio encoding itself. Some of the options do directly affect or influence the decoding and playback of the resulting bitstream, while others are just for informational purposes. A few of the options will add bits to the output stream that could otherwise be used for audio data, and will thus affect the quality of the output. Those will be indicated accordingly with a note in the option list below. </p>
    <p>These parameters are described in detail in several publicly-available documents. </p>
</html>

Selects the wanted text nodes (just one in this case): 选择所需的文本节点 （在本例中为一个）：

A description of some of the currently available audio encoders follows.

Similarly, this XPath expression : 同样，此XPath表达式 ：

/*/h2/following-sibling::p
         [name(preceding-sibling::*[starts-with(name(),'h')][1])
         = 'h2'
         ]//text()

Selects these text nodes : 选择以下文本节点 ：

AC-3 audio encoders. These encoders implement part of ATSC A/52:2010 and ETSI TS 102 366, as well as the undocumented RealAudio 3 (a.k.a. dnet). The 
        ac3 encoder uses floating-point math, while the 
        ac3_fixed encoder only uses fixed-point integer math. This does not mean that one is always faster, just that one or the other may be better suited to a particular system. The floating-point encoder will generally produce better quality audio for a given bitrate. The 
        ac3_fixed encoder is not the default codec for any of the output formats, so it must be specified explicitly using the option 
        -acodec ac3_fixed in order to use it.

And finally, this XPath expression : 最后，这个XPath表达式 ：

/*/h3/following-sibling::p
         [name(preceding-sibling::*[starts-with(name(),'h')][1])
         = 'h3'
         ]//text()

selects these text nodes : 选择以下文本节点 ：

The AC-3 metadata options are used to set parameters that describe the audio, but in most cases do not affect the audio encoding itself. Some of the options do directly affect or influence the decoding and playback of the resulting bitstream, while others are just for informational purposes. A few of the options will add bits to the output stream that could otherwise be used for audio data, and will thus affect the quality of the output. Those will be indicated accordingly with a note in the option list below. These parameters are described in detail in several publicly-available documents.

使用lxml解析html

问题描述

1 个解决方案

解决方案1
0 2012-09-23 14:58:55

使用lxml解析html

问题描述

1 个解决方案

解决方案1 0 2012-09-23 14:58:55

解决方案1
0 2012-09-23 14:58:55