繁体   English   中英

使用 Python 从网站中提取文本数据:

[英]Extract text data from website using Python:

我正在尝试使用正则表达式从网站中提取文本数据,但问题是它没有完全提取。 我正在关注本教程: https://pythonprogramming.net/parse-website-using-regular-expressions-urllib但我不知道我哪里错了。 我从中提取文本的网站是http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/及其相关子链接。

代码:

import urllib2
from urllib2 import Request
import re
#url = "http://www.tutorialspoint.com/cplusplus/cpp_basic_syntax.htm"
url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"

req = Request(url)
resp = urllib2.urlopen(req)
respData = resp.read()

regex = '<p.*?>(.*?)<\/p>'

paragraphs = re.findall(regex,str(respData))


for eachP in paragraphs:
    print(eachP)

任何的想法???

您应该为此使用BeautifulSoup 这比regex容易并且更好。

# -*- coding: utf-8 -*- 
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmls, 'html.parser')
for p in soup.find_all('p'):
    print p.get_text().encode('utf-8') + "\n\n"

utf-8用于源URL上的unicode文本。

在这里,您将找到如何安装BeautifulSoup

我找到了一个很好的方法来从How to Scrape Paragraphs using Python 中提取段落?

举例:

# import module
from bs4 import BeautifulSoup
  
# Html doc
html_doc = """
<html>
<head>
<title>Geeks</title>
</head>
<body>
<h2>paragraphs</h2>
  
<p>Welcome geeks.</p>
  
  
<p>Hello geeks.</p>
  
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
  
# traverse paragraphs from soup
for data in soup.find_all("p"):
    print(data.get_text())

Output:

Welcome geeks.
Hello geeks.

从给定的 URL 中提取段落:

# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup
  
# link for extract html data
def getdata(url):
    r = requests.get(url)
    return r.text
  
htmldata = getdata("https://www.geeksforgeeks.org/")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
    print(data.get_text())

Output:

输出图像

多个网址:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urlopen(url).read()
    # parse as html structured document
    soup = BeautifulSoup(data, 'html.parser')
    # kill javascript content
    for s in soup(["script", "style"]):
        s.replaceWith('')
    # find body and extract text
    for p in soup.find_all('p'):
        txt = print(p.get_text())
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'https://www.geeksforgeeks.org/how-to-download-install-nltk-on-windows/',
        'https://www.geeksforgeeks.org/how-to-scrape-paragraphs-using-python/'
    ]
    txt = [getPageText(url) for url in urls]
    for t in txt:
        print(t)
    
if __name__=="__main__":
    main()

Output:

NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.Installing NLTK on Windows using PIP:In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
In windows, we first have to install the python current version. Then we have to install pip with it. Without pip, NLTK can not be installed.Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 1: Browse to the official site of python by clicking this link.Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 2: Move the cursor to the Download button & then click on the latest python version.Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 4: Click on Next.Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 5: Click on Install.Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 6: Wait till installation finish.Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 7: Click on Close.Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 8: Open Command Prompt & execute the following commands:python --version
pip --version
pip install nltkHence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence, NLTK installation will start.Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Step 9: Then you can see the successfully installed message.Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Hence NLTK installation is successfulMy Personal Notes
arrow_drop_upSave
Writing code in comment?
Please use ide.geeksforgeeks.org,
generate link and share the link here.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM