简体   繁体   English

有没有办法擦除或分离网页抓取数据? 在Python中

[英]Is there a way to erase or separate web scraping data? in Python

Hello I'm scraping the lastest news from ABC News website, the code i'm scraping looks like this: 您好我正在从ABC新闻网站上抓取最新消息,我正在抓的代码如下:

 <a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>

But as you notice i got one span tag inside an a tag so when i scrape this with BeautifulSoup i get the info like this: 但是当你注意到我在一个标签里面有一个span标签,所以当我用BeautifulSoup刮掉它时,我得到这样的信息:

Huckabee Draws Cheers at Fundraiser for West Bank Settlement41 minutes ago 41分钟前,赫卡比在西岸定居点筹款活动中欢呼

But it gives me the time exactly next to my data and i would like to have separated 41 minutes so it could look like this: 但它给我的时间恰好在我的数据旁边,我想分开41分钟,所以它看起来像这样:

Huckabee Draws Cheers at Fundraiser for West Bank Settlement 41 minutes ago 41分钟前,赫卡比在西岸定居点筹款活动中欢呼

or at least erase it!. 或者至少擦掉它!

my code looks like this: 我的代码看起来像这样:

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

for x in range(1,10):
   for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
    print link.text
    print link.find_all("",{"class": "metaH_timeDay"})[0].text
    print ""

Can someone help me? 有人能帮我吗?

Let's extract it via extract() : 让我们通过extract()提取它:

>>> link.span.extract()     # remove the first `span` tag that we don't need
>>> time = link.span.extract()
>>> time
<span class="metaH_timeDay">2 hours, 45 minutes ago</span>
>>> link.text
' Obama Seeks to Remove Fear From ISIS Fight'
>>> time.text
'2 hours, 45 minutes ago'
>>> 

You can use decompose() function too-run a while lop to remove all span tag from that div - 您可以使用decompose()函数 - 运行一段时间来删除该div中的所有span标记 -

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for x in range(1):
    d=soup.select("div.h a")
    for j in d:
        j = str(j)
        f = BeautifulSoup(j,'html.parser')
        while f.span:
            f.span.decompose()
        print f.text.encode('utf-8') 

Output- 输出 -

 Obama Seeks to Remove Fear From ISIS Fight
Kerry off to Paris Again for Climate Conference
Huckabee Draws Cheers at Fundraiser for West Bank Settlement
Sanders Unveils Plan to Address Climate Change
 FBI Looking Into Blatter's Role in Bribery Case
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
13 Injured in Attack on Government Office in Western China
Police Arrest Mother of Newborn Baby Who Was Buried Alive
Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
 Justice Department to Investigate Chicago Police
Hillary Clinton Corrects Flub, Thanks to Justice Breyer
 Dashcam Must Be Working
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
 Man Killed in Wisconsin Standoff Was a Hostage
 2 New York College Students Abducted, Held Hostage
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
 Mood Dour Among Venezuelan Ruling Party Backers
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
Jimmy Carter Says Latest Brain Scan Shows No Cancer
One Direction Leads the Way on Twitter's List of 2015 Tweets
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Medical Examiner Shortage: Facts About Death Investigations
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM