python如何在br之后提取文本？

Question

我正在使用2.7.8，有點驚訝，因為我正在獲取所有文本，但是最后一個<br>之后包含的文本未獲取。 像我的html頁面：

<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>

<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p>   <!--not getting-->

<p> more </p>

<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!

</div>
</body>
</html>

和我的代碼：

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    next = br.nextSibling
    if not (next and isinstance(next,NavigableString)):
        continue
    next2 = next.nextSibling
    if next2 and isinstance(next2,Tag) and next2.name == 'br':
        text = str(next).strip()
        if text:
            print "Found:", next.encode('utf-8')
           # print '...........sfsdsds.............',answ[0].encode('utf-8')   #

輸出：

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit

但是我沒有得到最后的“文本”，例如：

 d) int $main
    and 
 d) Variable can be of any length

在<“ br”>之后

和我想得到的輸出：

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;
Found:
d) int $main

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
d) Variable can be of any length

Answer 1

這是因為BeautifulSoup通過關閉</p>之前的<br>標簽將文本強制為有效xml。 美化的版本對此很清楚：

<p>
 Which of the following is not a valid C variable name?
 <br>
  a) int number;
  <br>
   b) float rate;
   <br>
    c) int variable_count;
    <br>
     d) int $main;
    </br>
   </br>
  </br>
 </br>
</p>

因此，文本d) int $main; 不是最后一個<br>標記的兄弟 ，而是該標記的文本 。

代碼可能是（此處）：

...
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    if len(br.contents) > 0:  # avoid errors if a tag is correctly closed as <br/>
        print 'Found', br.contents[0]

它給出了預期的結果：

Found 
a) int number;
Found 
b) float rate;
Found 
c) int variable_count;
Found 
d) int $main;
Found 
a) They can contain alphanumeric characters as well as special characters
Found 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found 
c) Variable names cannot start with a digit
Found 
d) Variable can be of any length

Answer 2

您可以使用Requests代替urllib2，並通過lxml的html模塊提取xml。

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')

上面的代碼返回<a>元素中包含的文檔中所有文本的數組。
這樣，您可以簡單地索引到數組中以打印所需的內容。

python如何在br之后提取文本？

問題描述

2 個解決方案

解決方案1
1 2015-12-09 16:39:27

解決方案2
1 已采納 2015-12-09 17:12:53

python如何在br之后提取文本？

問題描述

2 個解決方案

解決方案1 1 2015-12-09 16:39:27

解決方案2 1 已采納 2015-12-09 17:12:53

解決方案1
1 2015-12-09 16:39:27

解決方案2
1 已采納 2015-12-09 17:12:53