簡體   English   中英

如何從沒有屬性值的 HTML 樹中抓取內容

[英]How to scrape content from HTML tree without attribute value

我在抓取 html 數據和獲取特定字段時遇到問題。 這是 html 代碼:

```

<li class="highlight">
                                                    Relationship Issues
                                            </li>
<li class="highlight">
                                                    Depression
                                            </li>
<li class="highlight">
                                                    Spirituality
                                            </li>
</ul>
</div>
</div>, <div class="spec-list attributes-issues">
<h5 class="spec-subcat">Issues</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            ADHD
                                                    </li>
<li class="">
                                                            Alcohol Use
                                                    </li>
<li class="">
                                                            Anger Management
                                                    </li>
<li class="">
                                                            Antisocial Personality
                                                    </li>
<li class="">
                                                            Anxiety
                                                    </li>
<li class="">
                                                            Behavioral Issues
                                                    </li>
<li class="">
                                                            Bipolar Disorder
                                                    </li>
<li class="">
                                                            Borderline Personality
                                                    </li>
<li class="">
                                                            Career Counseling
                                                    </li>
<li class="">
                                                            Child or Adolescent
                                                    </li>
<li class="">
                                                            Chronic Illness
                                                    </li>
<li class="">
                                                            Chronic Pain
                                                    </li>
<li class="">
                                                            Coping Skills
                                                    </li>
<li class="">
                                                            Divorce
                                                    </li>
<li class="">
                                                            Domestic Abuse
                                                    </li>
<li class="">
                                                            Domestic Violence
                                                    </li>
<li class="">
                                                            Eating Disorders
                                                    </li>
<li class="">
                                                            Emotional Disturbance
                                                    </li>
<li class="">
                                                            Family Conflict
                                                    </li>
<li class="">
                                                            Grief
                                                    </li>
<li class="">
                                                            Internet Addiction
                                                    </li>
<li class="">
                                                            Life Coaching
                                                    </li>
<li class="">
                                                            Life Transitions
                                                    </li>
<li class="">
                                                            Marital and Premarital
                                                    </li>
<li class="">
                                                            Men's Issues
                                                    </li>
<li class="">
                                                            Narcissistic Personality
                                                    </li>
<li class="">
                                                            Obsessive-Compulsive (OCD)
                                                    </li>
<li class="">
                                                            Parenting
                                                    </li>
<li class="">
                                                            School Issues
                                                    </li>
<li class="">
                                                            Self Esteem
                                                    </li>
<li class="">
                                                            Self-Harming
                                                    </li>
<li class="">
                                                            Stress
                                                    </li>
<li class="">
                                                            Suicidal Ideation
                                                    </li>
<li class="">
                                                            Transgender
                                                    </li>
<li class="">
                                                            Trauma and PTSD
                                                    </li>
<li class="">
                                                            Women's Issues
                                                    </li>
</ul>
</div>
</div>, <div class="spec-list attributes-mental-health">
<h5 class="spec-subcat">Mental Health</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            Dissociative Disorders
                                                    </li>
<li class="">
                                                            Elderly Persons Disorders
                                                    </li>
<li class="">
                                                            Impulse Control Disorders
                                                    </li>
<li class="">
                                                            Mood Disorders
                                                    </li>
<li class="">
                                                            Personality Disorders
                                                    </li>
<li class="">
                                                            Psychosis
                                                    </li>
<li class="">
                                                            Thinking Disorders
                                                    </li>
</ul>
</div>
</div>, <div class="spec-list attributes-sexuality">
<h5 class="spec-subcat">Sexuality</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            Bisexual
                                                    </li>
<li class="">
                                                            Lesbian
                                                    </li>
<li class="">
                                                            Gay
                                                    </li>
</ul>
</div>
</div>]

```

這是我的代碼:

```
import requests
from bs4 import BeautifulSoup
from lxml import html
import html5lib
import re
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, parser='html5lib', features="lxml")

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})
issues = soup.find_all('div', {'class': 'spec-list attributes-issues'})
mental_health = soup.find_all('div', {'class': 'spec-list attributes-mental-health'})
sexuality = soup.find_all('div', {'class': 'spec-list attributes-sexuality'})

```

理想的結果是擁有一個 csv(或 excel)文件,其中包含 output:

Name: {name}
Location: {location}
Phone Number: {Phone_number}
Specialties: {Specialities_{count}}
Issues: {Issues_{count}}
Mental Health Care: {Mental_Health_{count}}

我想為它提供一個通用目錄網站,並讓代碼為這些字段抓取 html 數據。 url 是: https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578謝謝!

要從頁面獲取所需信息,您可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578'
headers = {'User-Agent': 'ozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

name = soup.select_one('h1[itemprop="name"]').get_text(strip=True)
location = soup.select_one('.address-data').get_text(strip=True, separator=' ')
phone_number = soup.select_one('.phone-number').get_text(strip=True, separator=' ')
specialties = [li.get_text(strip=True) for li in soup.select('h5:contains("Specialties") + div li')]
issues = [li.get_text(strip=True) for li in soup.select('h5:contains("Issues") + div li')]
mental_health = [li.get_text(strip=True) for li in soup.select('h5:contains("Mental Health") + div li')]

print('Name:')
print(name)
print('Location:')
print(location)
print('Phone Number:')
print(phone_number)
print('Specialties:')
print(*specialties, sep=', ')
print('Issues:')
print(*issues, sep=', ')
print('Mental Health')
print(*mental_health, sep=', ')

印刷:

Name:
Gary L Phillips
Location:
550 Sunset Ridge Rd Northfield, IL 60093
Phone Number:
(847) 212-1496
Specialties:
Relationship Issues, Depression, Spirituality
Issues:
ADHD, Alcohol Use, Anger Management, Antisocial Personality, Anxiety, Behavioral Issues, Bipolar Disorder, Borderline Personality, Career Counseling, Child or Adolescent, Chronic Illness, Chronic Pain, Coping Skills, Divorce, Domestic Abuse, Domestic Violence, Eating Disorders, Emotional Disturbance, Family Conflict, Grief, Internet Addiction, Life Coaching, Life Transitions, Marital and Premarital, Men's Issues, Narcissistic Personality, Obsessive-Compulsive (OCD), Parenting, School Issues, Self Esteem, Self-Harming, Stress, Suicidal Ideation, Transgender, Trauma and PTSD, Women's Issues
Mental Health
Dissociative Disorders, Elderly Persons Disorders, Impulse Control Disorders, Mood Disorders, Personality Disorders, Psychosis, Thinking Disorders

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM