简体   繁体   中英

Python BeautifulSoup extract html table cells that contains images and text

I want to extract a table from the URL, but got lost... see what I have done below:

url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"

headers = {'User-agent': 'Mozilla/5.0'}
raw_html = requests.get(url, headers=headers)

raw_data = raw_html.text
soup_data = BeautifulSoup(raw_data, "lxml")

td = soup_data.findAll('tr')[1:]

country = []

for data in td:
    col = data.find_all('td')
    country.append(col)

How do I get the text and URL of some of the columns (Country, Port Name, UN/LOCODE, Type, and Port's Map)?

I did some scraping for you. You can use a dictionary with key value as table headers like below. You can iterate through individual td to get the required column and then use the find('tag_name')['attribute_name'] to get url, src, href etc and .text for texts. Hope this helps.

url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"

headers = {'User-agent': 'Mozilla/5.0'}
raw_html = requests.get(url, headers=headers)

raw_data = raw_html.text
soup_data = BeautifulSoup(raw_data, "lxml")

td = soup_data.findAll('tr')[1:]

country = []

for data in td:
    col = data.find_all('td')
    details = {}
    for i,col in enumerate(col):
        if i == 0:
            details['Img-src'] = ("https://www.marinetraffic.com"+col.find('img')['src'])
        if i == 1:
            details["Port_name"] = (col.text.replace('\n',''))
        if i == 2: 
            details['UN/LOCODE'] = (col.text.replace('\r\n','').replace(" ",""))
        if i == 4:
            details['type'] = (col.text.replace('\r\n','').replace(" ",""))
        if i == 5:
            details['map_url'] = ("https://www.marinetraffic.com"+(col.find('a')['href']))
    country.append(details)

Output:

[{'Img-src': 'https://www.marinetraffic.com/img/flags/png40/CN.png',
  'Port_name': 'SHANGHAI',
  'UN/LOCODE': 'CNSHA',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:9/centerx:121.614746/centery:31.3663635/showports:true/portid:1253',
  'type': 'Port'},
 {'Img-src': 'https://www.marinetraffic.com/img/flags/png40/CN.png',
  'Port_name': 'MAANSHAN',
  'UN/LOCODE': 'CNMAA',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:118.459503/centery:31.7180004/showports:true/portid:2746',
  'type': 'Port'},
 {'Img-src': 'https://www.marinetraffic.com/img/flags/png40/HK.png',
  'Port_name': 'HONG KONG',
  'UN/LOCODE': 'HKHKG',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:114.181366/centery:22.2879486/showports:true/portid:2429',
  'type': 'Port'}, 
  ...
  ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM