I am trying to extract text data from website using regex but problem is it is not completely extracting. I am following this tutorial: https://pythonprogramming.net/parse-website-using-regular-expressions-urllib but i dont know where i am wrong. The website i am extracting text from is http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/ and its relevant sublinks as well.
Code:
import urllib2
from urllib2 import Request
import re
#url = "http://www.tutorialspoint.com/cplusplus/cpp_basic_syntax.htm"
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
req = Request(url)
resp = urllib2.urlopen(req)
respData = resp.read()
regex = '<p.*?>(.*?)<\/p>'
paragraphs = re.findall(regex,str(respData))
for eachP in paragraphs:
print(eachP)
any idea???
You should use BeautifulSoup
for this. This is easy and better than regex.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls, 'html.parser')
for p in soup.find_all('p'):
print p.get_text().encode('utf-8') + "\n\n"
The utf-8
is used as you have unicode text on your source url.
Here you'll find how to install BeautifulSoup .
I found a good one for extracting paragraphs from How to Scrape Paragraphs using Python?
With example:
# import module
from bs4 import BeautifulSoup
# Html doc
html_doc = """
<html>
<head>
<title>Geeks</title>
</head>
<body>
<h2>paragraphs</h2>
<p>Welcome geeks.</p>
<p>Hello geeks.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# traverse paragraphs from soup
for data in soup.find_all("p"):
print(data.get_text())
Output:
Welcome geeks.
Hello geeks.
Extract Paragraphs from the given URL:
# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup
# link for extract html data
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata("https://www.geeksforgeeks.org/")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
print(data.get_text())
Output:
Multiple Urls:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup(["script", "style"]):
s.replaceWith('')
# find body and extract text
for p in soup.find_all('p'):
txt = print(p.get_text())
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'https://www.geeksforgeeks.org/how-to-download-install-nltk-on-windows/',
'https://www.geeksforgeeks.org/how-to-scrape-paragraphs-using-python/'
]
txt = [getPageText(url) for url in urls]
for t in txt:
print(t)
if __name__=="__main__":
main()
Output:
NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.Installing NLTK on Windows using PIP:In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Writing code in comment?
Please use ide.geeksforgeeks.org,
generate link and share the link here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.