简体   繁体   中英

How to auto extract data from a html file with python?

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).

The html snippet code:

  <body><div class="tab_content-wrapper noPrint"><div class="tab_content_card"> <div class="card-header"> <strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong> <span class="tel" title="Phone contacts">Phone contacts</span> </div> <div class="card-content"> <table> <tbody> <tr> <td colspan="4"> <label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label> </td> </tr> <tr> <td width="20">&nbsp;</td> <td width="245">&nbsp;</td> <td width="50">&nbsp;</td> <td width="80">&nbsp;</td> </tr> <tr> <td colspan="2"> 59 Wall St</td> <td></td> <td></td> </tr> <tr> <td colspan="2">NJ 07105&nbsp;&nbsp; <label class="downdrill-sbi" title="New York">New York</label> </td> <td></td> <td></td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td>&nbsp;</td> <td>&nbsp;</td> </tr> <tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr> <tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr> <tr> <td colspan="2"> <a href="http://www.liberty.edu/" target="_blank">www.liberty.edu</a> </td> <td>Active:</td> <td>Yes</td> </tr> </tbody> </table> </div> </div></div></body> 

How it looks like on a webpage: 在此处输入图片说明

Right now im using the following script to extract the desired information:

from lxml import html

str = open('html1.txt', 'r').read()
tree = html.fromstring(str)

for variable in tree.xpath('/html/body/div/div'):
    company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
    location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
    website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
    print(company_name, location, website)

Printed result:

('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')

So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:

Liberty Associates LLC | New York    | +1 973-344-8300 | www.liberty.edu
Company B              | Los Angeles | +1 213-802-1770 | perchla.com 

I know I can use [0] , [1] , [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.

So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?

我想你想要的是

print(company_name, location, website,'\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM