简体   繁体   English

我需要使用html页面中的python提取一些数据

[英]I need to extract some data using python from a html page

This is part of a html page from which i need to extract the following items: name from the strong tag, classification type (Actor and Singer), born and died location. 这是html页面的一部分,我需要从中提取以下各项:强烈标记的名称,分类类型(演员和歌手),生死地点。

<li class="clearfix">
   <div style="margin-top:10px;">
      <div class="float-left" style="margin-bottom:10px;">
         <a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
         <strong>Mr. Elvis Presley</strong></a>
      </div>
      <div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
         <div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
      </div>
      <span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> &nbsp; (15 vots)</span>
      <div class="clear"></div>
      <a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
      <img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley"  />
      </a>
      <br/>
      <p>
         <b>Classification:</b>
         <a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
         ,                      <a href="" title="Singer" name="Singer" class="underline">Singer</a>
         <br />
         <b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
         <b>Died:</b>
         Memphis,
         <!--<b>City:</b>-->
         <a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
      </p>
      <div class="clk"></div>
   </div>
</li>

I had try using the BeautifulSoup but i'm a newbie on python : 我曾尝试使用BeautifulSoup,但我是python的新手:

    data2 = soup.find_all('li',{'class':'clearfix'})

for container in data2:
    if container.find('a', {'class':'float-left'}):
        name = container.a.text
        print (name)

    if container.find('a', {'class':'underline'}):
        classification=container.div.p.a.text
        print (classification)


flag

Although I didn't get any errors from the script, I managed to extract only the name and the first classification. 尽管我没有从脚本中得到任何错误,但是我设法仅提取名称和第一个分类。 How do I target the rest of the elements that I need: classification("Singer") and the born and died location? 如何确定我需要的其余要素:分类(“歌手”)以及出生和死亡的地点?

You can use beautiful soup for html parser , I am showing you both first with beautiful soup and second with regex and catch the results with group capturing : 您可以将漂亮的汤用于html解析器,我首先向您展示漂亮的汤,再向您展示正则表达式,然后通过组捕获来捕获结果:

First with Beautiful soup: 首先搭配美丽的汤:

string_1="""<li class="clearfix">
   <div style="margin-top:10px;">
      <div class="float-left" style="margin-bottom:10px;">
         <a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
         <strong>Mr. Elvis Presley</strong></a>
      </div>
      <div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
         <div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
      </div>
      <span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> &nbsp; (15 vots)</span>
      <div class="clear"></div>
      <a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
      <img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley"  />
      </a>
      <br/>
      <p>
         <b>Classification:</b>
         <a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
         ,                      <a href="" title="Singer" name="Singer" class="underline">Singer</a>
         <br />
         <b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
         <b>Died:</b>
         Memphis,
         <!--<b>City:</b>-->
         <a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
      </p>
      <div class="clk"></div>
   </div>
</li>"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(string_1,"html.parser")
for a in soup.find_all('a'):
    print(a['name'])

Output: 输出:

Elvis Presley
Mr. Elvis Presley
Actor 
Singer
Tupelo
Memphis

Second with regex: 用正则表达式第二:

Use it if the form code is same as you shown there : 如果表单代码与您在此处显示的相同,请使用它:

import re
string_1="""<li class="clearfix">
   <div style="margin-top:10px;">
      <div class="float-left" style="margin-bottom:10px;">
         <a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
         <strong>Mr. Elvis Presley</strong></a>
      </div>
      <div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
         <div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
      </div>
      <span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> &nbsp; (15 vots)</span>
      <div class="clear"></div>
      <a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
      <img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley"  />
      </a>
      <br/>
      <p>
         <b>Classification:</b>
         <a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
         ,                      <a href="" title="Singer" name="Singer" class="underline">Singer</a>
         <br />
         <b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
         <b>Died:</b>
         Memphis,
         <!--<b>City:</b>-->
         <a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
      </p>
      <div class="clk"></div>
   </div>
</li>"""
pattern=r'<strong>(\w.+)<\/strong>|<b>Classification:<\/b>(\s.+)(\s.+)|(Born:.+)|(Died:.+\s.+\s.+\s.+)'
pattern_2=r'name=["](\w.+?)["]'


match=re.finditer(pattern,string_1,re.M)
for find in match:
    if find.group(1):
        print("Name {}".format(find.group(1)))
    if find.group(2):
        print("Classificiation first {}".format(re.search(pattern_2,str(find.group(2))).group(1)))
        print("Classification second {}".format(re.search(pattern_2,str(find.group(3))).group(1)))
    if find.group(4):
        print("Born {}".format(re.search(pattern_2, str(find.group(4))).group(1)))
    if find.group(5):
        print("Dead {}".format(re.search(pattern_2, str(find.group(5))).group(1)))

output: 输出:

Name Mr. Elvis Presley
Classificiation first Actor 
Classification second Singer
Born Tupelo
Dead Memphis

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:从一些讨厌的HTML中提取数据 - Python : extract data from some nasty html 需要一些信息从亚马逊页面产品 python 3 beautifulsoup 中提取 - Need some information to extract from amazon page product python 3 beautifulsoup Python脚本从HTML页面提取数据 - Python script extract data from HTML page 如何使用Python从html表中通过Web抓取数据并将其存储在csv文件中。 我可以提取某些部分,但不能提取其他部分 - How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others 如何使用Python从html标记提取数据? - How can I extract data from a html tag using Python? 使用Beautifulsoup从html页面提取数据 - Extract data from html page using Beautifulsoup Python:需要使用正则表达式从 html 页面提取标签内容,但不是 BeautifulSoup - Python: Need to extract tag content from html page using regex, but not BeautifulSoup 我使用 python pandas 来提取一些数据(页面标题),但输出的顺序与我放入代码中的 URL 的顺序不同 - im using python pandas to extract some data(page titles) but outputs are not in the same order as the URLs i put in the code 我正在尝试使用 python 从 html 网站中提取一些数据 - im trying to extract some data out of html website using python 使用Python从HTML表中提取数据 - Extract data from HTML table using Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM