I wrote a script that pulls paragraphs from articles and writes them to a file. For some articles, it won't pull every paragraph. This is where I am lost. Any guidance would be deeply appreciated. I have included a link to a particular article where it isn't pulling all of the information. It scrapes everything up until the first quoted sentence.
URL: http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
# Open txt document for output
txt = open('ctp_output.txt', 'w')
# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())
# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
txt.write(tag.get_text() + '\n' + '\n')
This is what works for me:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'w') as f:
for tag in soup.find_all('p'):
f.write(tag.text.encode('utf-8') + '\n')
Note that you should use with
context manager while working with files. Also you can pass urllib2.urlopen(url)
directly to the BeautifulSoup
constructor since urlopen
returns a file-like object.
Hope that helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.