簡體   English   中英

從html文本中提取標簽信息

[英]extract tag info from html text

我正在嘗試抓取網頁。我收到了以下文字。 如何從下面的字符串中提取src信息。 誰能告訴我該過程我們如何從文本中提取任何鍵值數據

<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>

和textarea標記內的文本。

  <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>

既然您在標記中提到了beautifulsoup ,我假設您想使用它來解析html內容。

import bs4

content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""

soup = bs4.BeautifulSoup(content, 'lxml')

img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag

print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag

beautifulsoup可以幫助:

標簽可以具有任意數量的屬性。 標簽具有屬性“ class”,其值是“ boldest”。 您可以通過將標簽視為字典來訪問標簽的屬性:

tag['class']

# u'boldest'

您可以直接以.attrs身份訪問該詞典:

tag.attrs
# {u'class': u'boldest'}

你可以通過.text從標簽中刪除文本

tag.text

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM