I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/ ) with python 3 xpath. please give your hand to help me solve this issue. Thanks in advance!
<title> ufffff</title>
<div class="hiragana">2015<br>Dec 1st</br></div>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">007</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">bruce</a>
</dd>
</dl>
</div>
</div>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">wind love</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">tom</a>
</dd>
</dl>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">river war</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">July</a>
</dd>
</dl>
</div>
</div>
<div class="mwb">
<div class="hiraganaLocalNavi">
<ul class="page_12">
<li class="text">o</li>
<li><a class="m01" href="/list/2015/01/">1月</a></li>
<li><a class="m02" href="/list/2015/02/">2月</a></li>
<li><a class="m03" href="/list/2015/03/">3月</a></li>
<li><a class="m04" href="/list/2015/04/">4月</a></li>
<li><a class="m05" href="/list/2015/05/">5月</a></li>
<li><a class="m06" href="/list/2015/06/">6月</a></li>
<li><a class="m07" href="/list/2015/07/">7月</a></li>
<li><a class="m08" href="/list/2015/08/">8月</a></li>
<li><a class="m09" href="/list/2015/09/">9月</a></li>
<li><a class="m10" href="/list/2015/10/">10月</a></li>
<li><a class="m11" href="/list/2015/11/">11月</a></li>
<li><a class="m12" href="/list/2015/12/">12月</a></li>
</ul>
</div>
</div>
..................
Definitively use lxml
for this instead. Like this:
from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', @class, ' '), ' movies')]')
This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.
You can read more about it here . Cheers!
Read the link provided by alecxe. You are having that issue.
Regex and HTML are a match destined for madness.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.