简体   繁体   English

如何将HTML表转换为Python字典

[英]How to convert an HTML table into a Python dictionary

I have the following HTML excerpt in a format of a Python list that I'd like to turn into a dictionary. 我有一个Python列表格式的以下HTML摘录,我想把它变成一个字典。 It is a timetable for everyday of the week. 这是一周中每天的时间表。

[u'
<table class="hours table">\n
    <tbody>\n
        <tr>\n
            <th scope="row">Mon</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Tue</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Wed</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Thu</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Fri</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sat</th>\n
            <td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sun</th>\n
            <td>\n Closed\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']

The wishful output is: 如意输出是:

{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Sat': '5:00pm - 10:00pm', 
'Sun': 'Closed'
}

How would you achieve this in Python 3.x? 你会如何在Python 3.x中实现这一目标? I would not mind if the 'Sat' and 'Sun' keys have values in a list format if that'd help at all. 我不介意“Sat”和“Sun”键是否具有列表格式的值,如果它有帮助的话。 Thank you for your thoughts in advance. 提前感谢您的想法。

Here's a solution which first reads into Pandas DataFrame, and then converts to dictionary as in your desired output: 这是一个首先读入Pandas DataFrame的解决方案,然后按照您想要的输出转换为字典:

import pandas as pd

dfs = pd.read_html(html_string)
df = dfs[0]  # pd.read_html reads in all tables and returns a list of DataFrames

Giving: 赠送:

     0                                      1         2
0  Mon  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
1  Tue  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
2  Wed  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm  Open now
3  Thu  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
4  Fri  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
5  Sat                     5:00 pm - 10:00 pm       NaN
6  Sun                                 Closed       NaN

Then use groupby and a dictionary comprehension: 然后使用groupby和字典理解:

summary = {k: v.iloc[0, 1].split('  ') for k, v in df.groupby(0)}

Giving: 赠送:

{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Sat': ['5:00 pm - 10:00 pm'],
 'Sun': ['Closed'],
 'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}

You may need to edit slightly if splitting on exactly two spaces won't always work for your opening times data format. 如果在两个空格上拆分并不总是适用于您的开放时间数据格式,则可能需要稍微编辑。

from bs4 import BeautifulSoup
from collections import OrderedDict
from pprint import pprint

soup = BeautifulSoup(data, 'lxml')

d = OrderedDict()
for th, td in zip(soup.select('th'), soup.select('td')[::2]):
    d[th.text.strip()] = td.text.strip().splitlines()

pprint(d)

Prints: 打印:

OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Sat', ['5:00 pm - 10:00 pm']),
             ('Sun', ['Closed'])])

Use a library to parse the HTML, something like this: 使用库来解析HTML,如下所示:

import pandas as panda
url = r'https://en.wikipedia.org/wiki/List_of_sovereign_states'
tables = panda.read_html(url)
sp500_table = tables[0] #Selecting the first table (for example)
from bs4 import BeautifulSoup

def tables(file):
data= {}
    with open(file,"r") as f:
        soup = BeautifulSoup(f.read(), "html.parser")
        tables = soup.find_all('table')
        for key,value in enumerate(tables):
            data["table_"+key] = value

Try this one-liner: 试试这个单行:

from bs4 import BeautifulSoup as b
yourdict={e.strip("\n").split("\n\n")[0]:e.strip().strip("\n").split("\n\n")[1].split("\n") for e in b(a,"lxml").text.split("\n\n\n\n")}

Output: 输出:

{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Sat': ['5:00 pm - 10:00 pm'],
 'Sun': [' Closed'],
 'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM