[英]Extract text data from website using Python:
我正在尝试使用正则表达式从网站中提取文本数据,但问题是它没有完全提取。 我正在关注本教程: https://pythonprogramming.net/parse-website-using-regular-expressions-urllib但我不知道我哪里错了。 我从中提取文本的网站是http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/及其相关子链接。
代码:
import urllib2
from urllib2 import Request
import re
#url = "http://www.tutorialspoint.com/cplusplus/cpp_basic_syntax.htm"
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
req = Request(url)
resp = urllib2.urlopen(req)
respData = resp.read()
regex = '<p.*?>(.*?)<\/p>'
paragraphs = re.findall(regex,str(respData))
for eachP in paragraphs:
print(eachP)
任何的想法???
您应该为此使用BeautifulSoup
。 这比regex容易并且更好。
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls, 'html.parser')
for p in soup.find_all('p'):
print p.get_text().encode('utf-8') + "\n\n"
utf-8
用于源URL上的unicode文本。
在这里,您将找到如何安装BeautifulSoup 。
我找到了一个很好的方法来从How to Scrape Paragraphs using Python 中提取段落?
举例:
# import module
from bs4 import BeautifulSoup
# Html doc
html_doc = """
<html>
<head>
<title>Geeks</title>
</head>
<body>
<h2>paragraphs</h2>
<p>Welcome geeks.</p>
<p>Hello geeks.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# traverse paragraphs from soup
for data in soup.find_all("p"):
print(data.get_text())
Output:
Welcome geeks.
Hello geeks.
从给定的 URL 中提取段落:
# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup
# link for extract html data
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata("https://www.geeksforgeeks.org/")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
print(data.get_text())
Output:
多个网址:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup(["script", "style"]):
s.replaceWith('')
# find body and extract text
for p in soup.find_all('p'):
txt = print(p.get_text())
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'https://www.geeksforgeeks.org/how-to-download-install-nltk-on-windows/',
'https://www.geeksforgeeks.org/how-to-scrape-paragraphs-using-python/'
]
txt = [getPageText(url) for url in urls]
for t in txt:
print(t)
if __name__=="__main__":
main()
Output:
NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.Installing NLTK on Windows using PIP:In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Writing code in comment?
Please use ide.geeksforgeeks.org,
generate link and share the link here.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.