I wrote a program in bash (which calls another program in Python) to pull information from http://www.wsj.com/mdc/public/page/2_3021-activnyse-actives.html into an .html file, which then converts it into .xhtml and then .csv. This runs through a loop so that it repeats the process every minute or so for an hour. Below is the bash code:
#!/bin/bash
n=0
while [ $n -lt 60 ]
do
filename="$(date +"%Y-%m-%d-%H-%M-%S")"
wget -O - http://www.wsj.com/mdc/public/page/2_3021-activnyse-actives.html > "$filename.html"
java -jar tagsoup-1.2.1.jar --files "$filename.html"
python xhtmlToCsv.py "$filename.xhtml" > "$filename.csv"
((n++))
sleep 60
done
And here is the Python program it calls:
import sys
import xml.dom.minidom
document = xml.dom.minidom.parse(sys.argv[1])
tableElements = document.getElementsByTagName('table')
print "exchange,symbol,company,volume,price,change"
lines = tableElements[2].getElementsByTagName('td')
n = 0
data = [None] * 6
for i in lines:
if n % 6 == 1:
del data[:]
data = [None] * 6
for node in i.childNodes:
if n % 6 + 1 < 6:
data[n%6+1] = node.nodeValue
if n%6+1 == 3:
data[n%6+1] = data[n%6+1].replace(",", "")
for items in i.getElementsByTagName('a'):
j = i.getElementsByTagName('a')[0]
for node in j.childNodes:
data[0] = 'NYSE'
data[1] = node.nodeValue[node.nodeValue.index('(')+1:node.nodeValue.index(')')]
data[2] = node.nodeValue[0:node.nodeValue.index(" (")]
if n % 6 == 5 and n > 6:
print data[0] + "," + data[1] + "," + data[2] + "," + data[3] + "," + data[4] + "," + data[5]
n+=1
What I don't get, though, is why approximately every third .html file generated by the code returns an empty file. Is there something wrong with the code, or is it just my connection? If it is just the connection, is there a way I can throw out the empty file and try again?
Update: I figured it out. All I had to do was do a line count for the resulting .csv file, and if it was only 1, that would indicate no data got transferred, and so the files for that iteration would be thrown out.
a=($(wc $filename.csv))
x=${a[0]}
if [ $x -eq 1 ]
then
rm $filename.html
rm $filename.xhtml
rm $filename.csv
else
((n++))
sleep 60
fi
Thanks everyone for your input.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.