简体   繁体   中英

BeautifulSoup4 using conditional statements with tag object

I wrote the following code to scrape info from the website for a bookshop and save this data to a JSON file. The code works fine, however I wanted to use conditional statements to filter the data returned with a specific word in the title (in this case 'Gardening'). No matter what way I try to implement this, it's returning all the books and not just the ones I specified.

bookArray =[]
content = BeautifulSoup(open("mywebpage...",encoding="utf8"), "html.parser")
    books = content.findAll('div', attrs={"class": "su-post"})
    # book is a bs4.element.Tag object

for book in books:
    titles = book.find('h2', attrs={"class": "su-post-title"})
    titleText= titles.find('a').contents[0]
    if 'Gardening' or 'gardening' in titleText:
            dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
            urls = titles.find('a').attrs['href'].split()
            year = getPublishDate(url)
            bookObject = {
            "title": titleText,
            "url": urls,
            "year": year, 
            "dateAdded": dateAdded.strip('\n\t').replace('Posted: ','')}
            bookArray.append(bookObject)

try:
    with open('bookList.json', 'w') as outfile:
        json.dump(bookArray, outfile)
    except:
        print("Write to file failed")

I also tried the following method but get the same output with all books written to the JSON

for book in books:
        if 'Gardening' or 'gardening' in book.text(): 
            #have also tried if 'Gardening' in book.string:
            
            dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
            ...same as above

Finally some sample output of the JSON file showing that the conditional statements are not having any effect

[{
    "title": "Of Mice and Men",
    "url": ["http://mysite...."],
    "year": "1937",
    "dateAdded": "2020-08-11"
},
{
    "title": "Wuthering Heights",
    "url": ["http://mysite...."],
    "year": "1847",
    "dateAdded": "2020-06-06"
},

Further details: If I modify the code to print out every book, they are displayed in the following HTML format:

for book in books:
    print(book)
<div class="su-post" id="su-post-6238">
    <h2 class="su-post-title"><a href="ref to local file.../">Wuthering Heights</a></h2>
    <div class="su-post-meta">Posted: 2020-06-06</div>
    <div class="su-post-excerpt"></div>
</div>
<div class="su-post" id="su-post-8990">
    <h2 class="su-post-title"><a href="ref to another local file...">Of Mice and Men</a></h2>
    <div class="su-post-meta">Posted: 2020-08-11</div>
    <div class="su-post-excerpt"></div>
</div>

Problem is:

if 'Gardening' or 'gardening' in titleText:

Because "Gardening" is truthy it was always evaluating to true.

Solution - Change to

if "Gardening" in titleText or "gardening" in titleText:

or

if "gardening" in titleText.lower():

or

if any( x in titleText for x in ['Gardening', 'gardening']):

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM