My code below scrapes the td elements inside the tr, align='center' tags from ( http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ), separates each element with a comma, and writes the results to a text file:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.strip())
acct.write(", ".join(stack))
However, when writing to a text file there are plenty of blank lines (which i would like to eliminate) and each line does not start with the proper element.
Here is what my .txt file looks like with my current code:
Here is what I would like it to look like:
How can I alter my code to get rid of all the blank lines and have each line starting with "OPEN", and so on?
The problem is that you have newline characters inside the td.text
. Replace it with an empty string and add newline at the end. Also tab characters can be replaced in order to match your desired output:
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
Produces:
STATUS, CRN, SUBJECT, SECT, COURSE, CREDIT, INSTR., BLDG/RM, DAY/TIME, FROM / TO,
OPEN, 41552, ACCY 2001, 10, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 251, TR09:35AM - 10:50AM, 01/12/15 - 04/27/15,
OPEN, 40002, ACCY 2001, 11, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 353, TR11:10AM - 12:25PM, 01/12/15 - 04/27/15,
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.