简体   繁体   English

python 3,BeautifulSoup 4,刮擦并打印特定解析树的文本

[英]python 3, BeautifulSoup 4, scrape and print text of specific parse tree

I have searched around here and I haven't found a post yet that helps me in what I need to accomplish. 我在这里搜索过,但是还没有找到可以帮助我完成工作的帖子。

website: http://www.animefansftw.com/ 网址: http//www.animefansftw.com/

I'm trying to get the h1 Title of all posts from a set date only!. 我正在尝试仅从设定的日期获取所有帖子的h1标题! I was able to get the actual posts for the set date but got stuck on how to get the h1 title of the posts. 我能够获得设定日期的实际帖子,但在如何获取帖子的h1标题方面陷入了困境。

import time
import requests
import re
from bs4 import BeautifulSoup

Aniday = time.strftime("%B %d")
r = requests.get("http://www.animefansftw.com")  
r.content
soup = BeautifulSoup(r.content, "html.parser")
print("Today's Animu Crack:\n")

for div in soup.find_all("div", {"class": "date"}):
    get_date = div.text
    clean_date = " ".join(get_date.split())
    if clean_date == Aniday:
        print(clean_date)

Now to avoid confusion I can get the h1 title names for the posts just fine but i don't want all of them just those that contain the date I set. 现在,为了避免混淆,我可以为帖子添加h1标题名称,但是我不希望所有这些仅包含我设置的日期。

for item in soup.find_all("h1"):
    info = item.text
    clean_info = " ".join(info.split())
    print(clean_info) 

Glancing at the source, it looks like the h1 tag is included in the parent's parent. 看一下源代码,看起来h1标签已包含在父级的父级中。

Try: 尝试:

import time
import requests
import re
from bs4 import BeautifulSoup

Aniday = time.strftime("%B %d")
r = requests.get("http://www.animefansftw.com")  
r.content
soup = BeautifulSoup(r.content, "html.parser")
print("Today's Animu Crack:\n")

for div in soup.find_all("div", {"class": "date"}):
    get_date = div.text
    clean_date = " ".join(get_date.split())
    if clean_date == Aniday:
        post_div = div.parent.parent
        title = post_div.h1.text.encode('ascii','ignore')
        print("{title}\n{date}\n".format(title=title,date=clean_date))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM