[英]Extract text data from website using Python:
我正在嘗試使用正則表達式從網站中提取文本數據,但問題是它沒有完全提取。 我正在關注本教程: https://pythonprogramming.net/parse-website-using-regular-expressions-urllib但我不知道我哪里錯了。 我從中提取文本的網站是http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/及其相關子鏈接。
代碼:
import urllib2
from urllib2 import Request
import re
#url = "http://www.tutorialspoint.com/cplusplus/cpp_basic_syntax.htm"
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
req = Request(url)
resp = urllib2.urlopen(req)
respData = resp.read()
regex = '<p.*?>(.*?)<\/p>'
paragraphs = re.findall(regex,str(respData))
for eachP in paragraphs:
print(eachP)
任何的想法???
您應該為此使用BeautifulSoup
。 這比regex容易並且更好。
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls, 'html.parser')
for p in soup.find_all('p'):
print p.get_text().encode('utf-8') + "\n\n"
utf-8
用於源URL上的unicode文本。
在這里,您將找到如何安裝BeautifulSoup 。
我找到了一個很好的方法來從How to Scrape Paragraphs using Python 中提取段落?
舉例:
# import module
from bs4 import BeautifulSoup
# Html doc
html_doc = """
<html>
<head>
<title>Geeks</title>
</head>
<body>
<h2>paragraphs</h2>
<p>Welcome geeks.</p>
<p>Hello geeks.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# traverse paragraphs from soup
for data in soup.find_all("p"):
print(data.get_text())
Output:
Welcome geeks.
Hello geeks.
從給定的 URL 中提取段落:
# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup
# link for extract html data
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata("https://www.geeksforgeeks.org/")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
print(data.get_text())
Output:
多個網址:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup(["script", "style"]):
s.replaceWith('')
# find body and extract text
for p in soup.find_all('p'):
txt = print(p.get_text())
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'https://www.geeksforgeeks.org/how-to-download-install-nltk-on-windows/',
'https://www.geeksforgeeks.org/how-to-scrape-paragraphs-using-python/'
]
txt = [getPageText(url) for url in urls]
for t in txt:
print(t)
if __name__=="__main__":
main()
Output:
NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.Installing NLTK on Windows using PIP:In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Writing code in comment?
Please use ide.geeksforgeeks.org,
generate link and share the link here.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.