[英]Geopandas and beautiful soup - web scraping and writing to shapefile
使用下面的代碼時,我在 shapefile 中獲得了帶有標題內容的所需列。 但僅當 shapefile 具有一行/特征時。 在具有多個功能的 shapefile 上運行時,根本不會寫入任何列。 非常感謝任何提示/幫助!
import geopandas as gpd
import requests
from bs4 import BeautifulSoup
gdf = gpd.read_file("Test404_PhotosMeta.shp", driver="ESRI Shapefile", encoding ="utf8")
for url in gdf['url']:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for title in soup.find_all('title'):
gdf['HTitle']=title
gdf.to_file("HTitle.shp", driver="ESRI Shapefile")
response
對象中apply()
以便為每一行運行from bs4 import BeautifulSoup
import requests
import geopandas as gpd
from pathlib import Path
# use included sample geodataframe
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# add a url column
gdf["url"] = (
"https://simplemaps.com/data/"
+ world.loc[~world["iso_a3"].eq("-99"), "iso_a3"].str[0:2].str.lower()
+ "-cities"
)
# utility function to get titles for a URL
def get_titles(url):
if pd.isna(url):
return ""
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# if there are multiple titles for a row join them
return ",".join([str(t) for t in soup.find_all("title")])
# get the titles
gdf['HTitle'] = gdf["url"].apply(get_titles)
gdf.to_file(Path.cwd().joinpath("HTitle.shp"), driver="ESRI Shapefile")
gpd.read_file(Path.cwd().joinpath("HTitle.shp")).drop(columns="geometry").sample(5)
pop_est | 大陸 | 姓名 | iso_a3 | gdp_md_est | 網址 | 標題 | |
---|---|---|---|---|---|---|---|
98 | 1281935911 | 亞洲 | 印度 | IND | 8.721e+06 | https://simplemaps.com/data/in-cities | 印度城市數據庫 |
157 | 28036829 | 亞洲 | 也門 | 耶姆 | 73450 | https://simplemaps.com/data/ye-cities | 也門城市數據庫 |
129 | 11491346 | 歐洲 | 比利時 | 貝爾 | 508600 | https://simplemaps.com/data/be-cities | 比利時城市數據庫 |
113 | 38476269 | 歐洲 | 波蘭 | 波蘭 | 1.052e+06 | https://simplemaps.com/data/po-cities | 404 |
57 | 24994885 | 非洲 | 喀麥隆 | CMR | 77240 | https://simplemaps.com/data/cm-cities | 喀麥隆城市數據庫 |
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.