简体   繁体   中英

HTML Parsing using bs4

I am parsing an HTMl page and am having a hard time figuring out how to pull a certain 'p' tag without a class or on id. I am trying to reach the tag of 'p' with the lat and long. Here is my current code:

 import bs4
 from urllib import urlopen as uReq #this opens the URL
 from bs4 import BeautifulSoup as soup #parses/cuts  the html

 my_url = 'http://www.fortwiki.com/Battery_Adair'
 print(my_url)
 uClient = uReq(my_url) #opens the HTML and stores it in uClients

 page_html = uClient.read() # reads the URL
 uClient.close() # closes the URL

 page_soup = soup(page_html, "html.parser") #parses/cuts the HTML
 containers = page_soup.find_all("table")
 for container in containers:
    title = container.tr.p.b.text.strip()
    history = container.tr.p.text.strip()
      lat_long = container.tr.table
       print(title)
       print(history)
       print(lat_long)

Link to website: http://www.fortwiki.com/Battery_Adair

The <p> tag you're looking for is very common in the document, and it doesn't have any unique attributes, so we can't select it directly.

A possible solution would be to select the tag by index, as in bloopiebloopie's answer .
However that won't work unless you know the exact position of the tag.

Another possible solution would be to find a neighbouring tag that has distinguishing attributes/text and select our tag in relation to that.
In this case we can find the previous tag with text: "Maps & Images", and use find_next to select the next tag.

import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

b = soup.find('b', text='Maps & Images')
if b:
    lat_long = b.find_next().text

This method should find the coordinates data in any www.fortwiki.com page that has a map.

You can use re to match partial text inside a tag.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

lat_long = soup.find('p', text=re.compile('Lat:\s\d+\.\d+\sLong:')).text
print(lat_long)
# Lat: 24.5477038 Long: -81.8104541

I am not exactly sure what you want but this works for me. There are probably neeter ways of doing it. I am new to python

soup = BeautifulSoup(requests.get("http://www.fortwiki.com/Battery_Adair").content, "html.parser")
x = soup.find("div", id="mw-content-text").find("table").find_all("p")[8]
x = x.get_text()
x = x.split("Long:")
lat = x[0].split(" ")[1]
long = x[1]
print("LAT = " + lat)
# LAT = 24.5477038 
print("LNG = " + long)
# LNG = -81.8104541

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM