通過PyQuery分割抓取的數據

Question

我有以下情況：

<div class="entry">
<p>one</p>
<p>two<br />three<br />four</p>
<p>five<br />six</p>
</div>

我想產生['one','two','three','four','five','six'] 。

到目前為止，我有：

import PyQuery as pq
s = pq(html)
list = [i.text() for i in s('div.entry').find('p').items()]

這只會將其分為標記，而完全忽略 標記。 我嘗試了以下方法：

list = [i.text() for i in s(table).find('p').find('br').items()]
list = [i.text() for i in s(table).find('p').find('br').prevAll().items()]
list = [i.split('\n') for i in s(table).find('br').replaceWith('\n')]

這些都不起作用。 此外， PyQuery API將.replaceWith()列為有效函數，但是當我執行test = s(table).find('br').replaceWith('anytext') ，它不會替換為任何東西，我沒有錯誤，只是它們之間帶有 標簽的相同項目列表。 .replaceWith()對 和 區別嗎？

更復雜的例子

<div class="entry">
<p>122 E. Washington St.<br />
734-665-8767</p>
<p>Amadeus is offering both pricing options.</p>
<p><strong>Lunch 2 for $15 </strong><br />
Choice of:<br />
<strong>Soup<br />
Green salad </strong></p>
<p>Choice of lunch dish:<br />
<strong>1 Golabek<br />
3 Piergies<br />
3 Placeki<br />
Kielbsa<br />
Kapusta salad<br />
Warsaw salad<br />
Artichoke salad<br />
Potato salad</strong></p>
<p><strong>Lunch $15</strong><br />
Three Course Meal<br />
Choice of lunch entrée with green salad and dessert</p>
<p><strong>Dinner 2 for $28</strong></p>
<p>Choice of:<br />
<strong>Cup of soup<br />
Green salad </strong></p>
<p>Choice of entrée:<br />
<strong>2 Potato Snitzel<br />
4 Potato Placeki<br />
6 Piergis<br />
2 Golabki<br />
Bigos<br />
Grilled Kielbsa<br />
Vegetarian combo<br />
Krakow Chicken </strong>(one breast)<br />
<strong>Tilapia<br />
Cold salad</strong></p>
<p><strong>Dinner $28</strong><br />
Four Course Meal<br />
Choice of soup + green salad + Dinner Entrée + Dessert </p>
<p><strong>Sunday Brunch $15</strong><br />

預期結果

[122 E. Washington St','734-665-8767','Amadeus is offering both pricing options.','Lunch 2 for $15','Choice of:','Soup','Green salad','Choice of lunch dish:','1 Golabek','3 Piergies','3 Placeki','Kielbsa',' Kapusta salad','Warsaw salad','Artichoke salad','Potato salad','Lunch $15','Three Course Meal','Choice of lunch entrée with green salad and dessert','Dinner 2 for $28','Choice of:','Cup of soup','Green salad','Choice of entrée:','2 Potato Snitzel','4 Potato Placeki','6 Piergis','2 Golabki','Bigos','Grilled Kielbsa','Vegetarian combo','Krakow Chicken (one breast)','Tilapia','Cold salad','Dinner $28','Four Course Meal','Choice of soup + green salad + Dinner Entrée + Dessert','Sunday Brunch $15']

Answer 1

看起來pyquery的行為不符合預期。 使用.contents()解決方法：

>>> import lxml

>>> [e for ptag in s('div.entry').find('p').items()
       for e in ptag.contents()
       if isinstance(e, lxml.etree._ElementStringResult)]
['one', 'two ', 'three', 'four', 'five', 'six']

通過PyQuery分割抓取的數據

問題描述

更復雜的例子

預期結果

1 個解決方案

解決方案1
0 2015-01-21 15:16:26

通過PyQuery分割抓取的數據

問題描述

更復雜的例子

預期結果

1 個解決方案

解決方案1 0 2015-01-21 15:16:26

解決方案1
0 2015-01-21 15:16:26