從html文本中提取標簽信息

Question

我正在嘗試抓取網頁。我收到了以下文字。 如何從下面的字符串中提取src信息。 誰能告訴我該過程我們如何從文本中提取任何鍵值數據

<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>

和textarea標記內的文本。

  <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>

Answer 1

既然您在標記中提到了beautifulsoup ，我假設您想使用它來解析html內容。

import bs4

content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""

soup = bs4.BeautifulSoup(content, 'lxml')

img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag

print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag

Answer 2

beautifulsoup可以幫助：

標簽可以具有任意數量的屬性。 標簽具有屬性“ class”，其值是“ boldest”。 您可以通過將標簽視為字典來訪問標簽的屬性：

tag['class']

# u'boldest'

您可以直接以.attrs身份訪問該詞典：

tag.attrs
# {u'class': u'boldest'}

你可以通過.text從標簽中刪除文本

tag.text

從html文本中提取標簽信息

問題描述

2 個解決方案

解決方案1
0 已采納 2016-12-09 12:16:17

解決方案2
0 2016-12-09 12:27:03

從html文本中提取標簽信息

問題描述

2 個解決方案

解決方案1 0 已采納 2016-12-09 12:16:17

解決方案2 0 2016-12-09 12:27:03

解決方案1
0 已采納 2016-12-09 12:16:17

解決方案2
0 2016-12-09 12:27:03