简体   繁体   中英

Cleaning Up HTML Parse in Python

My code below scrapes the td elements inside the tr, align='center' tags from ( http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ), separates each element with a comma, and writes the results to a text file:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):    
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.strip())

    acct.write(", ".join(stack))

However, when writing to a text file there are plenty of blank lines (which i would like to eliminate) and each line does not start with the proper element.

Here is what my .txt file looks like with my current code:

Here is what I would like it to look like: 在此处输入图片说明

How can I alter my code to get rid of all the blank lines and have each line starting with "OPEN", and so on?

The problem is that you have newline characters inside the td.text . Replace it with an empty string and add newline at the end. Also tab characters can be replaced in order to match your desired output:

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')

Produces:

STATUS, CRN, SUBJECT, SECT, COURSE, CREDIT, INSTR., BLDG/RM, DAY/TIME, FROM / TO, 
OPEN, 41552, ACCY 2001, 10, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 251, TR09:35AM - 10:50AM, 01/12/15 - 04/27/15, 
OPEN, 40002, ACCY 2001, 11, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 353, TR11:10AM - 12:25PM, 01/12/15 - 04/27/15, 
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM