Cleaning Up HTML Parse in Python

Question

My code below scrapes the td elements inside the tr, align='center' tags from ( http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ), separates each element with a comma, and writes the results to a text file:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):    
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.strip())

    acct.write(", ".join(stack))

However, when writing to a text file there are plenty of blank lines (which i would like to eliminate) and each line does not start with the proper element.

Here is what my .txt file looks like with my current code:

Here is what I would like it to look like: 在此处输入图片说明

How can I alter my code to get rid of all the blank lines and have each line starting with "OPEN", and so on?

Answer 1

The problem is that you have newline characters inside the td.text . Replace it with an empty string and add newline at the end. Also tab characters can be replaced in order to match your desired output:

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')

Produces:

STATUS, CRN, SUBJECT, SECT, COURSE, CREDIT, INSTR., BLDG/RM, DAY/TIME, FROM / TO, 
OPEN, 41552, ACCY 2001, 10, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 251, TR09:35AM - 10:50AM, 01/12/15 - 04/27/15, 
OPEN, 40002, ACCY 2001, 11, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 353, TR11:10AM - 12:25PM, 01/12/15 - 04/27/15, 
...

Cleaning Up HTML Parse in Python

Question

1 answers

solution1
1 ACCPTED 2014-10-21 19:12:25

Cleaning Up HTML Parse in Python

Question

1 answers

solution1 1 ACCPTED 2014-10-21 19:12:25

solution1
1 ACCPTED 2014-10-21 19:12:25