简体   繁体   中英

Scrape Span Text from Google

I'm new to scraping and I'm trying to scrape text from google search results but I keep getting empty results.

I have a list of names and I need to get their google search Text results from <span class="st"> .

I've tried using

text_results = soup.find_all("span", attrs={'class':'st'})

but text_results results in []

It should be returning the description text.

Code :

i = 0
names = data['Names'] # list of names
while i < len(names):
    i += 1
list_url = ["https://www.google.com/search?q="+ name for name in names + tags]

soup_df = pd.DataFrame()
for l in list_url:
    url = requests.get(l)
    soup = bs(url.text, "html.parser")

    text_results = soup.find_all("span", attrs={'class':'st'})
    name_soup = []
    row = (l, text_results)
    name_soup.append(row)

    Search = (name_soup[0][0])
    Link = (name_soup[0][0])
    Text = (name_soup[0][1])
    print(Text)

    soup_df = soup_df.append({'Name': Search, 'Link': Link, 'About': Text}, ignore_index=True)
    soup_df['Name'] = soup_df['Name'].map(lambda x: x.lstrip("https://www.google.com/search?q="))
    soup_df['Name'] = soup_df['Name'].str.rstrip(tags)

Expected results

About                           | Name       | Link
Joan Smith. Engineer at Apple...|JOAN S SMITH|https://www.google...
Joey Smith. Engineer at Apple...|JOEY S SMITH|https://www.google...
John Smith. Engineer at Apple...|JOHN S SMITH|https://www.google...
Josh Smith. Engineer at Apple...|JOSH S SMITH|https://www.google...

Actual results:

About | Name       | Link
[]    |JOAN S SMITH|https://www.google.com/search?q=JOAN S SMITH..
[]    |JOEY S SMITH|https://www.google.com/search?q=JOEY S SMITH..
[]    |JOHN S SMITH|https://www.google.com/search?q=JOHN S SMITH..
[]    |JOSH S SMITH|https://www.google.com/search?q=JOSH S SMITH..

It looks like, google return something different from what you get from the browser. You should change your code:

 soup.find_all("span", attrs={'class':'st'})

to some other valid path.

Make sure you're using a user-agent . It could be the reason why you're getting an empty result because Google will block your request eventually.Check what's your user-agent . Check this post I answered some time ago.

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

Alternatively, you can use Google Organic Results API from SerpApi to get this output. It's a paid API with a free plan.

The difference is that you only need to iterate over structured JSON and get what you want rather than figuring out how to make stuff work.

Part of JSON:

 {
  "position": 1,
  "title": "Bill Clinton - Wikipedia",
  "link": "https://en.wikipedia.org/wiki/Bill_Clinton",
  "displayed_link": "https://en.wikipedia.org › wiki › Bill_Clinton",
  "snippet": "William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...",
  "sitelinks": {
    "inline": [
      {
        "title": "Presidency of Bill Clinton",
        "link": "https://en.wikipedia.org/wiki/Presidency_of_Bill_Clinton"
      }
    ]
  }
}

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "bill clinton",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Description text: {result['snippet']}\n")

Output from replit.com :

Description text: William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...

Description text: Bill Clinton is an American politician from Arkansas who served as the 42nd President of the United States (1993-2001). He took office at the end of the Cold War ...

Description text: William Jefferson Clinton, the first Democratic president in six decades to be elected twice, led the U.S. to the longest economic expansion in American history, ...

Description text: Bill Clinton, byname of William Jefferson Clinton, original name William Jefferson Blythe III, (born August 19, 1946, Hope, Arkansas, U.S.), 42nd president of the ...

Description text: Bill Clinton was the 42nd president of the United States, serving from 1993 to 2001. In 1978 Clinton became the youngest governor in the ...

Description text: President Bill Clinton. 3834926 likes · 1078 talking about this. Founder, Clinton Foundation and 42nd President of the United States. Posts by Bill...

Description text: William Jefferson Clinton spent the first six years of his life in Hope, Arkansas, where he was born on August 19, 1946. His father, William Jefferson Blythe, had ...

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM