如何将HTML表转换为Python字典

Question

I have the following HTML excerpt in a format of a Python list that I'd like to turn into a dictionary. 我有一个Python列表格式的以下HTML摘录，我想把它变成一个字典。 It is a timetable for everyday of the week. 这是一周中每天的时间表。

[u'
<table class="hours table">\n
    <tbody>\n
        <tr>\n
            <th scope="row">Mon</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Tue</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Wed</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Thu</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Fri</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sat</th>\n
            <td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sun</th>\n
            <td>\n Closed\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']

The wishful output is: 如意输出是：

{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Sat': '5:00pm - 10:00pm', 
'Sun': 'Closed'
}

How would you achieve this in Python 3.x? 你会如何在Python 3.x中实现这一目标？ I would not mind if the 'Sat' and 'Sun' keys have values in a list format if that'd help at all. 我不介意“Sat”和“Sun”键是否具有列表格式的值，如果它有帮助的话。 Thank you for your thoughts in advance. 提前感谢您的想法。

Answer 1

Here's a solution which first reads into Pandas DataFrame, and then converts to dictionary as in your desired output: 这是一个首先读入Pandas DataFrame的解决方案，然后按照您想要的输出转换为字典：

import pandas as pd

dfs = pd.read_html(html_string)
df = dfs[0]  # pd.read_html reads in all tables and returns a list of DataFrames

Giving: 赠送：

     0                                      1         2
0  Mon  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
1  Tue  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
2  Wed  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm  Open now
3  Thu  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
4  Fri  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
5  Sat                     5:00 pm - 10:00 pm       NaN
6  Sun                                 Closed       NaN

Then use groupby and a dictionary comprehension: 然后使用groupby和字典理解：

summary = {k: v.iloc[0, 1].split('  ') for k, v in df.groupby(0)}

Giving: 赠送：

{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Sat': ['5:00 pm - 10:00 pm'],
 'Sun': ['Closed'],
 'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}

You may need to edit slightly if splitting on exactly two spaces won't always work for your opening times data format. 如果在两个空格上拆分并不总是适用于您的开放时间数据格式，则可能需要稍微编辑。

Answer 2

from bs4 import BeautifulSoup
from collections import OrderedDict
from pprint import pprint

soup = BeautifulSoup(data, 'lxml')

d = OrderedDict()
for th, td in zip(soup.select('th'), soup.select('td')[::2]):
    d[th.text.strip()] = td.text.strip().splitlines()

pprint(d)

Prints: 打印：

OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Sat', ['5:00 pm - 10:00 pm']),
             ('Sun', ['Closed'])])

Answer 3

Use a library to parse the HTML, something like this: 使用库来解析HTML，如下所示：

import pandas as panda
url = r'https://en.wikipedia.org/wiki/List_of_sovereign_states'
tables = panda.read_html(url)
sp500_table = tables[0] #Selecting the first table (for example)

Answer 4

from bs4 import BeautifulSoup

def tables(file):
data= {}
    with open(file,"r") as f:
        soup = BeautifulSoup(f.read(), "html.parser")
        tables = soup.find_all('table')
        for key,value in enumerate(tables):
            data["table_"+key] = value

Answer 5

Try this one-liner: 试试这个单行：

from bs4 import BeautifulSoup as b
yourdict={e.strip("\n").split("\n\n")[0]:e.strip().strip("\n").split("\n\n")[1].split("\n") for e in b(a,"lxml").text.split("\n\n\n\n")}

Output: 输出：

{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Sat': ['5:00 pm - 10:00 pm'],
 'Sun': [' Closed'],
 'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}

如何将HTML表转换为Python字典

问题描述

5 个解决方案

解决方案1
5 2018-08-02 15:25:16

解决方案2
3 已采纳 2018-08-02 15:22:55

解决方案3
2 2018-08-02 15:17:44

解决方案4
1 2018-08-02 15:17:52

解决方案5
0 2018-08-02 15:21:32

如何将HTML表转换为Python字典

问题描述

5 个解决方案

解决方案1 5 2018-08-02 15:25:16

解决方案2 3 已采纳 2018-08-02 15:22:55

解决方案3 2 2018-08-02 15:17:44

解决方案4 1 2018-08-02 15:17:52

解决方案5 0 2018-08-02 15:21:32

解决方案1
5 2018-08-02 15:25:16

解决方案2
3 已采纳 2018-08-02 15:22:55

解决方案3
2 2018-08-02 15:17:44

解决方案4
1 2018-08-02 15:17:52

解决方案5
0 2018-08-02 15:21:32