简体   繁体   中英

Extract text data from website using Python:

I am trying to extract text data from website using regex but problem is it is not completely extracting. I am following this tutorial: https://pythonprogramming.net/parse-website-using-regular-expressions-urllib but i dont know where i am wrong. The website i am extracting text from is http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/ and its relevant sublinks as well.

Code:

import urllib2
from urllib2 import Request
import re
#url = "http://www.tutorialspoint.com/cplusplus/cpp_basic_syntax.htm"
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"

req = Request(url)
resp = urllib2.urlopen(req)
respData = resp.read()

regex = '<p.*?>(.*?)<\/p>'

paragraphs = re.findall(regex,str(respData))


for eachP in paragraphs:
    print(eachP)

any idea???

You should use BeautifulSoup for this. This is easy and better than regex.

# -*- coding: utf-8 -*- 
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmls, 'html.parser')
for p in soup.find_all('p'):
    print p.get_text().encode('utf-8') + "\n\n"

The utf-8 is used as you have unicode text on your source url.

Here you'll find how to install BeautifulSoup .

I found a good one for extracting paragraphs from How to Scrape Paragraphs using Python?

With example:

# import module
from bs4 import BeautifulSoup
  
# Html doc
html_doc = """
<html>
<head>
<title>Geeks</title>
</head>
<body>
<h2>paragraphs</h2>
  
<p>Welcome geeks.</p>
  
  
<p>Hello geeks.</p>
  
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
  
# traverse paragraphs from soup
for data in soup.find_all("p"):
    print(data.get_text())

Output:

Welcome geeks.
Hello geeks.

Extract Paragraphs from the given URL:

# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup
  
# link for extract html data
def getdata(url):
    r = requests.get(url)
    return r.text
  
htmldata = getdata("https://www.geeksforgeeks.org/")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
    print(data.get_text())

Output:

输出图像

Multiple Urls:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urlopen(url).read()
    # parse as html structured document
    soup = BeautifulSoup(data, 'html.parser')
    # kill javascript content
    for s in soup(["script", "style"]):
        s.replaceWith('')
    # find body and extract text
    for p in soup.find_all('p'):
        txt = print(p.get_text())
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'https://www.geeksforgeeks.org/how-to-download-install-nltk-on-windows/',
        'https://www.geeksforgeeks.org/how-to-scrape-paragraphs-using-python/'
    ]
    txt = [getPageText(url) for url in urls]
    for t in txt:
        print(t)
    
if __name__=="__main__":
    main()

Output:

NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.Installing NLTK on Windows using PIP:In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Writing code in comment?
Please use ide.geeksforgeeks.org,
generate link and share the link here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM