Pandas read_html 生成带有元组列名的空 df

Question

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs我想检索以下网站上的表格并将它们存储在熊猫数据框中： https : //www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-程式

However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:但是，页面上的第三个表返回一个空数据框，其中所有表的数据都存储在元组中作为列标题：

Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []

Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables?有没有办法将元组标题“展平”为标题 + 值，然后将其附加到由所有四个表组成的数据帧？ My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting.我的代码在下面——它已经在其他类似的页面上工作，但由于这个表格的格式而不断中断。 Thanks!谢谢！

funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
    df['URL'] = url
    df['YEAR'] = year
    funds_df = funds_df.append(df)

Answer 1

For this site, there's no need for beautifulsoup or requests对于这个站点，不需要beautifulsoup或requests
pandas.read_html creates a list of DataFrames for each <table> at the URL. pandas.read_html为DataFrames中的每个<table>创建一个pandas.read_html列表。

import pandas as pd

url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'

# read the url
dfl = pd.read_html(url)

# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
    print(i)
    display(d)  # display worker in Jupyter, otherwise use print
    print('\n')

dfl[0]

   Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program       TOTAL
0  State of Colorado                   $7,140,000                      $1,896,854                    $503,424  $9,540,278

dfl[1]

     WF-CMA 2         RSS     TAG-F CMA Mandatory 3       TOTAL
0  $3,309,953  $1,896,854  $503,424      $7,140,000  $9,540,278

dfl[2]

   Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program     Total
0  State of Colorado              $430,000                                  $0                           $100,000                          $150,000                      $0  $680,000

dfl[3]

  Volag                             Affiliate Name Projected ORR  MG Funding                                                                     Director
0   CWS  Ecumenical Refugee & Immigration Services                  $127,600   Ferdi Mevlani  1600 Downing St., Suite 400  Denver, CO 80218  303-860-0128
1  ECDC              ECDC African Community Center                  $308,000      Jennifer Guddiche  5250 Leetsdale Drive  Denver, CO 80246  303-399-4500
2   EMM                Ecumenical Refugee Services                  $191,400   Ferdi Mevlani  1600 Downing St., Suite 400  Denver, CO 80218  303-860-0128
3  LIRS   Lutheran Family Services Rocky Mountains                  $121,000    Floyd Preston  132 E Las Animas  Colorado Springs, CO 80903  719-314-0223
4  LIRS   Lutheran Family Services Rocky Mountains                  $365,200  James Horan  1600 Downing Street, Suite 600  Denver, CO 80218  303-980-5400

Pandas read_html 生成带有元组列名的空 df

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-09-10 18:15:41

Pandas read_html 生成带有元组列名的空 df

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-09-10 18:15:41

解决方案1
2 已采纳 2020-09-10 18:15:41