简体   繁体   中英

How to extract text from <p> tag?

I would like to scrape reviews from Zomato with BeautifulSoup library in Python.

However, each review doesn't have the tag div but only the tag paragraph.

When I code this

 review = soup.find_all("p", attrs={"class": "sc-1hez2tp-0 sc-kQsIoO cCvqWb"})

The output is:

review
[<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb"></p>,
 <p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Kalau kesini wajib bangetnih pesen Ikan Gurame, rasanya bener” enakk. Bumbu guramenya pun macem”, mulai dari bumbu spesial gurih7, asem manis, sambal manga, kecombrang, rica-rica, &amp; pecak! Kulit di udang mayonaisenya juga udah dicopotin, jadi lebih enak makannya. <br/>.<br/>Buat minuman, es kelapa nya seger banget, dagingnya juga gampang diambil, ga kaya es kelapa ditempat lainnya yang dagingnya susah dikerok. Tapi air kelapanya pakai gula, jadi rasanya terlalu manis. Tips kalau pesen es kelapa, request tanpa gula aja biar manisnya pas &amp; lebih segerr👌🏻</p>,
 <p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">This restaurant served a west java cuisine that adapt into local taste which i could say is not spicy compare the original recipe. The place is easy to access, 15 minutes go to highway nearby means that people are not difficult to find the location. But the parking area are not so good &amp; not convenience in some place not all</p>,
 <p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Waiters nya kak kikis ramah dan sopan</p>,
 <p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Pelayanan yg responsif, waiter AA Joko baik pisan euy sangat ramah dan responsif. Menunya enak2, gurame goreng kipas dan udang bakar galah madu mantaabbbb!!! Sayur asemnya endeusss hihihi anyway thx zomato gold 🥰🥰🥰 pasti balik lagi donggg 💃💃💃</p>]

I want each of the text in paragraf to be inserted to the list of Dataframe with one column name ' reviews '.

reviews

1. Kalau kesini wajib bangetnih pesen Ikan Gurame, rasanya bener” enakk. Bumbu guramenya pun macem”, mulai dari bumbu spesial gurih7, asem manis, sambal manga, kecombrang, rica-rica, &amp; pecak! Kulit di udang mayonaisenya juga udah dicopotin, jadi lebih enak makannya. <br/>.<br/>Buat minuman, es kelapa nya seger banget, dagingnya juga gampang diambil, ga kaya es kelapa ditempat lainnya yang dagingnya susah dikerok. Tapi air kelapanya pakai gula, jadi rasanya terlalu manis. Tips kalau pesen es kelapa, request tanpa gula aja biar manisnya pas &amp; lebih segerr👌🏻
2. This restaurant served a west java cuisine that adapt into local taste which i could say is not spicy compare the original recipe. The place is easy to access, 15 minutes go to highway nearby means that people are not difficult to find the location. But the parking area are not so good &amp; not convenience in some place not all
3. ...

I had tried

import pandas as pd
review_text = []
for el in soup.find_all('p', attrs={'class': 'sc-1hez2tp-0 sc-kQsIoO cCvqWb'}):
    komentar = print(el.get_text().encode('utf-8'))
    review_text.append(komentar)

reviews = {'komentar':review_text}
df = pd.DataFrame(reviews, columns=['reviews'])
df

but it return an empty dataframe output.

Your code komentar = print(el.get_text().encode('utf-8')) is not correct, remove the print function

here simple example

from bs4 import BeautifulSoup

url = '<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb"></p>' \
      '<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Kalau kesini wajib bangetnih pesen Ikan Gurame </p>' \
      '<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">This restaurant served a west java cuisine</p>' \
      '<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Waiters nya kak kikis ramah dan sopan</p>' \
      '<p class="sc-1hez2tp-0 sc-kQsIoO cCvqWb">Pelayanan yg responsif</p>'


html = BeautifulSoup(url, features='html.parser')
paragraphs = html.find_all("p", {"class": "sc-1hez2tp-0 sc-kQsIoO cCvqWb"})

review_text = []

paragraph = [p.text for p in paragraphs]
review_text.append(paragraph)
print(review_text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM