How to use the xpath to parse the director part from the html with python 3

Question

I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/ ) with python 3 xpath. please give your hand to help me solve this issue. Thanks in advance!

  <title> ufffff</title>
  <div class="hiragana">2015<br>Dec 1st</br></div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">007</a></h3>
  <dl class="directorList">
  <dt>director</dt>
  <dd>
  <a href="/person/152394/" title="">bruce</a>
  </dd>
  </dl>
  </div>
  </div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">wind love</a></h3>
  <dl class="directorList">
  <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">tom</a>
   </dd>
   </dl>
   <div class="movies">
   <div class="movie">
   <h3><a href="/mv57512/">river war</a></h3>
   <dl class="directorList">
   <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">July</a>
   </dd>
   </dl>
   </div>
   </div>
   <div class="mwb">
   <div class="hiraganaLocalNavi">
   <ul class="page_12">
   <li class="text">o</li>
   <li><a class="m01" href="/list/2015/01/">1月</a></li>
   <li><a class="m02" href="/list/2015/02/">2月</a></li>
   <li><a class="m03" href="/list/2015/03/">3月</a></li>
   <li><a class="m04" href="/list/2015/04/">4月</a></li>
   <li><a class="m05" href="/list/2015/05/">5月</a></li>
   <li><a class="m06" href="/list/2015/06/">6月</a></li>
   <li><a class="m07" href="/list/2015/07/">7月</a></li>
   <li><a class="m08" href="/list/2015/08/">8月</a></li>
   <li><a class="m09" href="/list/2015/09/">9月</a></li>
   <li><a class="m10" href="/list/2015/10/">10月</a></li>
   <li><a class="m11" href="/list/2015/11/">11月</a></li>
   <li><a class="m12" href="/list/2015/12/">12月</a></li>
   </ul>
    </div>
    </div>
..................

Answer 1

Definitively use lxml for this instead. Like this:

from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', @class, ' '), ' movies')]')

This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.

You can read more about it here . Cheers!

Answer 2

Read the link provided by alecxe. You are having that issue.

You have spaces in your raw string that do not occur in the sample html
Quotes are special characters and need to be escaped or replaced by '.'
You need to set the re.M flag for multiline strings '.' by default does not match newlines

Regex and HTML are a match destined for madness.

How to use the xpath to parse the director part from the html with python 3

Question

2 answers

solution1
2 ACCPTED 2016-06-03 09:25:45

solution2
1 2016-06-03 05:16:17

How to use the xpath to parse the director part from the html with python 3

Question

2 answers

solution1 2 ACCPTED 2016-06-03 09:25:45

solution2 1 2016-06-03 05:16:17

solution1
2 ACCPTED 2016-06-03 09:25:45

solution2
1 2016-06-03 05:16:17