![](/img/trans.png)
[英]Using customized function inside applymap() in pandas DataFrame
[英]How to convert multiple html tables into a pandas dataframe by using a customized function?
I am trying to convert multiple html tables to a pandas dataframe, For this task I've defined a function to return all these html tables as a pandas dataframe,
然而 function 返回一个 null 列表[]
当想法是它返回一个 pandas Z6A8064D55DF4794555570
这是我到目前为止所尝试的:
import requests
from bs4 import BeautifulSoup
import lxml
import html5lib
import pandas as pd
import string
### defining a list for all the needed links ###
first_url='https://www.salario.com.br/tabela-salarial/?cargos='
second_url='#listaSalarial'
allTheLetters = string.ascii_uppercase
links = []
for letter in allTheLetters:
links.append(first_url+letter+second_url)
### defining function to parse html objects ###
def getUrlTables(links):
for link in links:
# requesting link, parsing and finding tag:table #
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
tab_div = soup.find_all('table', {'class':'listas'})
# writing html files into directory #
with open('listas_salariales.html', "w") as file:
file.write(str(tab_div))
file.close
# reading html file as a pandas dataframe #
tables=pd.read_html('listas_salariales.html')
return tables
getUrlTables(links)
[]
我在getUrlTables()
中遗漏了什么吗?
有没有更简单的方法来完成这项任务?
以下代码将从所有链接中获取 HTML,解析它们以提取表数据并构造一个大型组合 dataframe(我没有将中间数据帧存储到磁盘,如果表的大小变得太大,可能需要这样做):
import requests
from bs4 import BeautifulSoup
import lxml
import html5lib
import pandas as pd
import string
### defining a list for all the needed links ###
first_url='https://www.salario.com.br/tabela-salarial/?cargos='
second_url='#listaSalarial'
allTheLetters = string.ascii_uppercase
links = []
for letter in allTheLetters:
links.append(first_url+letter+second_url)
### defining function to parse html objects ###
def getUrlTables(links, master_df):
for link in links:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'lxml') # using the lxml parser
try:
table = soup.find('table', attrs={'class':'listas'})
# finding table headers
heads = table.find('thead').find('tr').find_all('th')
colnames = [hdr.text for hdr in heads]
#print(colnames)
# Now extracting the values
data = {k:[] for k in colnames}
rows = table.find('tbody').find_all('tr')
for rw in rows:
for col in colnames:
cell = rw.find('td', attrs={'data-label':'{}'.format(col)})
data[col].append(cell.text)
# Constructing a pandas dataframe using the data just parsed
df = pd.DataFrame.from_dict(data)
master_df = pd.concat([master_df, df], ignore_index=True)
except AttributeError as e:
print('No data from the link: {}'.format(link))
return master_df
master_df = pd.DataFrame()
master_df = getUrlTables(links, master_df)
print(master_df)
上述代码中的output如下:
CBO Cargo ... Teto Salarial Salário Hora
0 612510 Abacaxicultor ... 2.116,16 6,86
1 263105 Abade ... 5.031,47 17,25
2 263105 Abadessa ... 5.031,47 17,25
3 622020 Abanador na Agricultura ... 2.075,81 6,27
4 862120 Abastecedor de Caldeira ... 3.793,98 11,65
... ... ... ... ... ...
9345 263110 Zenji (missionário) ... 3.888,52 12,65
9346 723235 Zincador ... 2.583,20 7,78
9347 203010 Zoologista ... 4.615,45 14,21
9348 203010 Zoólogo ... 4.615,45 14,21
9349 223310 Zootecnista ... 5.369,59 16,50
[9350 rows x 8 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.