从HTML标头中抓取值并在Python中另存为CSV文件

Question

All, 所有，

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements. 我刚开始使用Python（v 2.7.1），而我的第一个程序就是尝试使用标准库和BeautifulSoup处理HTML元素，从包含发电站数据的网站上抓取信息。

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. 我想访问的数据可以在HTML的“ Head”部分中获取，也可以作为主体中的表获取。 The website will generate a CSV file from it data if the CSV link is clicked. 如果单击CSV链接，则网站将根据其数据生成CSV文件。

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \\n designators. 通过使用此网站上的几个资源，我设法将下面的代码拼凑在一起，该代码将提取数据并将其保存到文件中，但是其中包含\\ n指示符。 Try as I might, I can't get a correct CSV file to save out. 请尝试尝试，我无法保存正确的CSV文件。

I am sure it's something simple but need a bit of help if possible! 我相信这很简单，但如果可能的话，需要一些帮助！

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

Answer 1

Don't turn it back into a string and then use replace. 不要将其转换为字符串，然后使用replace。 That completely defeats the point of using BeautifulSoup! 这完全打败了使用BeautifulSoup的意义！

Try starting like this: 尝试像这样开始：

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

Then you can use: 然后，您可以使用：

partition('=')[2] to cut off the "var gs_csv" bit. partition('=')[2]切断“ var gs_csv”位。
strip(' \\n"') to remove unwanted characters at each end (space, newline, " ) strip(' \\n"')删除两端的多余字符（空格，换行符， " ）
replace("\\\\n","\\n") to sort out the new lines. replace("\\\\n","\\n")整理新行。

Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(... . 顺便说一下，replace是一个字符串方法，因此您不必单独导入它，只需执行data.replace(...

Finally, you need to separate it as csv. 最后，您需要将其分隔为csv。 You could save it and reopen it, then load it into a csv.reader. 您可以保存并重新打开它，然后将其加载到csv.reader中。 You could use the StringIO module to turn it into something you can feed directly to csv.reader (ie without saving a file first). 您可以使用StringIO模块将其转换为可以直接馈送到csv.reader的内容（即，无需先保存文件）。 But I think this data is simple enough that you can get away with doing: 但是我认为这些数据非常简单，您可以从中摆脱出来：

for line in data.splitlines():
    row = line.split(",")

Answer 2

SOLUTION 解

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter

从HTML标头中抓取值并在Python中另存为CSV文件

问题描述

2 个解决方案

解决方案1
4 已采纳 2011-02-06 17:32:01

解决方案2
1 2011-02-06 21:34:29

从HTML标头中抓取值并在Python中另存为CSV文件

问题描述

2 个解决方案

解决方案1 4 已采纳 2011-02-06 17:32:01

解决方案2 1 2011-02-06 21:34:29

解决方案1
4 已采纳 2011-02-06 17:32:01

解决方案2
1 2011-02-06 21:34:29