简体   繁体   English

Python:抓取部分源代码并将其另存为html

[英]Python: scrape a part of source code and save it as html

Here is the case, I need to save a web page's source code as html file. 在这种情况下,我需要将网页的源代码另存为html文件。 But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself. 但是,如果您查看网页,则有很多部分,我不需要它们,我只想保存文章本身的源代码。

code: 码:

from urllib.request import urlopen

page = urlopen('http://www.abcde.com')
page_content = page.read()

with open('page_content.html', 'wb') as f:
    f.write(page_content)

I can save the whole source code from my code, but how can I just save the only part I want? 我可以保存代码中的整个源代码,但是如何保存我想要的唯一部分呢?

Explain: 说明:

<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>

I need to save the source code with and inside this tag , not extract the sentences in the tags. 我需要在标签中和一起保存源代码,而不是在标签中提取句子。

The result I want is to save like this: 我想要的结果是这样保存:

<div itemscope itemtype="http://schema.org/MedicalWebPage">

                    <div class="col-md-12 col-xs-12" style="padding-left:10px;">
                        <h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
                    </div>
                    <!--Article Start-->
                    <section class="page_article_div" id="print">
                        <article itemprop="text" class="page_article_content">
<p>
    <img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
    The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
    It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
    <strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
    <li>
        Germanic paganism</li>
    <li>
        Greek mythology</li>
</ol>
<p style="text-align: right;">
    【Jane】</p>
<p style="text-align: right;">
    Credit : Wiki</p>

                        </article>
                            <div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
                        <br />                  
                        <div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
                    </section>
                    <!--Article End-->
</div>

My own solution here: 我自己的解决方案在这里:

page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
    list.append(str(tag))
list2= (', '.join(list))
#print(list2)        
#print(type(list2)) 
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
    f.write(list2)

I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :) 我是一个初学者,所以我想尽可能简单地做到这一点,这就是我的答案,目前效果很好:)

You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below. 您可以使用具有标签属性(例如类或标签名称或ID)的标签进行搜索,然后将其保存为所需的格式,如下面的示例。

driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me

tag_for_me will have your required code. tag_for_me将具有您所需的代码。

You can use Beautiful Soup to get any HTML source you need. 您可以使用Beautiful Soup获取所需的任何HTML源。

import requests
from bs4 import BeautifulSoup

target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")

for elem in soup.find_all(attrs={"class":target_class}):
    if elem.text == target_text:
        print(elem)

Output: 输出:

<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>

Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. 使用BeautifulSoup获取要插入的HTML,获取要插入的HTML。 use insert() to generate new_tag. 使用insert()生成new_tag。 Overwrite to the original file. 覆盖到原始文件。

from bs4 import BeautifulSoup
import requests

#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g 
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>


res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM