简体   繁体   English

从HTML标头中抓取值并在Python中另存为CSV文件

[英]Scraping values from HTML header and saving as a CSV file in Python

All, 所有,

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements. 我刚开始使用Python(v 2.7.1),而我的第一个程序就是尝试使用标准库和BeautifulSoup处理HTML元素,从包含发电站数据的网站上抓取信息。

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. 我想访问的数据可以在HTML的“ Head”部分中获取,也可以作为主体中的表获取。 The website will generate a CSV file from it data if the CSV link is clicked. 如果单击CSV链接,则网站将根据其数据生成CSV文件。

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \\n designators. 通过使用此网站上的几个资源,我设法将下面的代码拼凑在一起,该代码将提取数据并将其保存到文件中,但是其中包含\\ n指示符。 Try as I might, I can't get a correct CSV file to save out. 请尝试尝试,我无法保存正确的CSV文件。

I am sure it's something simple but need a bit of help if possible! 我相信这很简单,但如果可能的话,需要一些帮助!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

Don't turn it back into a string and then use replace. 不要将其转换为字符串,然后使用replace。 That completely defeats the point of using BeautifulSoup! 这完全打败了使用BeautifulSoup的意义!

Try starting like this: 尝试像这样开始:

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

Then you can use: 然后,您可以使用:

  1. partition('=')[2] to cut off the "var gs_csv" bit. partition('=')[2]切断“ var gs_csv”位。
  2. strip(' \\n"') to remove unwanted characters at each end (space, newline, " ) strip(' \\n"')删除两端的多余字符(空格,换行符, "
  3. replace("\\\\n","\\n") to sort out the new lines. replace("\\\\n","\\n")整理新行。

Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(... . 顺便说一下,replace是一个字符串方法,因此您不必单独导入它,只需执行data.replace(...

Finally, you need to separate it as csv. 最后,您需要将其分隔为csv。 You could save it and reopen it, then load it into a csv.reader. 您可以保存并重新打开它,然后将其加载到csv.reader中。 You could use the StringIO module to turn it into something you can feed directly to csv.reader (ie without saving a file first). 您可以使用StringIO模块将其转换为可以直接馈送到csv.reader的内容(即,无需先保存文件)。 But I think this data is simple enough that you can get away with doing: 但是我认为这些数据非常简单,您可以从中摆脱出来:

for line in data.splitlines():
    row = line.split(",")

SOLUTION

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM