简体   繁体   中英

How to extract the href link inside a particular td in python with beautifulsoup

I have the snippet that extract two tds. Every td has a link. I wanted to extract the link inside the (token) td. Any help or ideas will be very helpful.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re, random, ctypes
import requests
from time import sleep


url = 'https://bscscan.com/tokentxns'

user_agent_list = [
"header = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0Gecko/20100101         Firefox/86.0'}",
"header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}",
"header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15'}",
"header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}",
"header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'}",
"header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'}"
]

header = random.choice(user_agent_list)
req = requests.get(url,header, timeout=10)
soup = BeautifulSoup(req.content, 'html.parser')
rows = soup.findAll('table')[0].findAll('tr')

for row in rows[1:]: 
    txnhash = row.find_all('td')[1].text[0:]
    token = row.find_all('td')[8].text[0:] #-- I wanted to extract the link inside this td
    print (str(txnhash) + str(token))

Sample Output:

 0x58a4254f8dafffd846a15d32939e98b290e76c5b32dbf9ab453911c31340f84e Wrapped BNB (WBNB) TD-LINK 
 0x58a4254f8dafffd846a15d32939e98b290e76c5b32dbf9ab453911c31340f84e PathFund (PATH) TD-LINK 
 0x43aa8ad160bd7f6aa7a740dfd561abfece3b118a5fd2488f4c35b2edf1bec3ff SyrupBar Tok... (SYRUP) TD-LINK 

Since each link starts with /token/ you can use a filter function to search for the first tag with appropriate 'href' attribute value:

for row in rows[1:]
    token = row.find(href = lambda h: h is not None and h.startswith("/token/"))["href"]
    print(token)

To extract the link it would be needed to get href attribute of the a element inside of the given td element:

from urllib.parse import urljoin

for row in rows[1:]:
    tds = row.find_all('td')
    txnhash = tds[1].text[0:]
    token = tds[8].text[0:]
    # retrieve the link
    link = urljoin(url, tds[8].find('a')['href'])
    print (str(txnhash) + str(token) + ' ' + link)

The href attribute contains URL relative to the base URL, therefore function urljoin() can be used to get absolute URL.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM