简体   繁体   中英

How to use the xpath to parse the director part from the html with python 3

I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/ ) with python 3 xpath. please give your hand to help me solve this issue. Thanks in advance!

  <title> ufffff</title>
  <div class="hiragana">2015<br>Dec 1st</br></div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">007</a></h3>
  <dl class="directorList">
  <dt>director</dt>
  <dd>
  <a href="/person/152394/" title="">bruce</a>
  </dd>
  </dl>
  </div>
  </div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">wind love</a></h3>
  <dl class="directorList">
  <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">tom</a>
   </dd>
   </dl>
   <div class="movies">
   <div class="movie">
   <h3><a href="/mv57512/">river war</a></h3>
   <dl class="directorList">
   <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">July</a>
   </dd>
   </dl>
   </div>
   </div>
   <div class="mwb">
   <div class="hiraganaLocalNavi">
   <ul class="page_12">
   <li class="text">o</li>
   <li><a class="m01" href="/list/2015/01/">1月</a></li>
   <li><a class="m02" href="/list/2015/02/">2月</a></li>
   <li><a class="m03" href="/list/2015/03/">3月</a></li>
   <li><a class="m04" href="/list/2015/04/">4月</a></li>
   <li><a class="m05" href="/list/2015/05/">5月</a></li>
   <li><a class="m06" href="/list/2015/06/">6月</a></li>
   <li><a class="m07" href="/list/2015/07/">7月</a></li>
   <li><a class="m08" href="/list/2015/08/">8月</a></li>
   <li><a class="m09" href="/list/2015/09/">9月</a></li>
   <li><a class="m10" href="/list/2015/10/">10月</a></li>
   <li><a class="m11" href="/list/2015/11/">11月</a></li>
   <li><a class="m12" href="/list/2015/12/">12月</a></li>
   </ul>
    </div>
    </div>
..................

Definitively use lxml for this instead. Like this:

from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', @class, ' '), ' movies')]')

This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.

You can read more about it here . Cheers!

Read the link provided by alecxe. You are having that issue.

  1. You have spaces in your raw string that do not occur in the sample html
  2. Quotes are special characters and need to be escaped or replaced by '.'
  3. You need to set the re.M flag for multiline strings '.' by default does not match newlines

Regex and HTML are a match destined for madness.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM