如何在 BeautifulSoup 中使用 Python 將這些由單行中的多列分隔的數據導出到 .csv 或 .xls 中？

Question

我目前將此數據存儲為result變量。

['Draw Date:']
['Draw Date:']
['']
['']
['']
['Draw Date:  2019-01-15']
['']
['Perdana Lottery']
[]
['F', '2771', 'M', '0133', 'A', '6215']
[]
['A', '----', 'B', '1859', 'C', '3006', 'D', '3327']
['E', '5699', 'F', '----', 'G', '1123', 'H', '9193']
['I', '9076', 'J', '0573', 'K', '0950', 'L', '7258']
['', 'M', '-----', '', '', '']
['N', '1226', 'O', '0565', 'P', '1563', 'Q', '1420']
['R', '5265', 'S', '9345', 'T', '0483', 'U', '0933']
['', 'V', '6468', 'W', '3247', '']
['']
['']
['']
['']

我想將此數據導出到 .csv 或 .xls 格式的表格中，如下所示：

+------------+----------+----------+----------+----------+-------------+
| Date       | First    | Second   | Third    | Special  | Consolation |
+------------+---+------+---+------+---+------+---+------+---+---------+
| 2019-01-15 | F | 2771 | M | 0133 | A | 6215 | A | ---- | N | 1226    |
|            |   |      |   |      |   |      | B | 1859 | O | 0565    |
|            |   |      |   |      |   |      | C | 3006 | P | 1563    |
|            |   |      |   |      |   |      | ... etc  | ... etc     |
+------------+---+------+---+------+---+------+----------+-------------+

...等表示來自上述結果變量的剩余數據。 為了避免混亂，我沒有在這里寫出來。

那么，我應該使用哪些模塊以及如何使用？ 請注意，我是一個完整的 Python 新手。 我只知道一些 PHP 的東西，但老實說我開始喜歡 py。

Answer 1

第一個問題是您需要知道獎品之間的分配位置。 如果沒有看到Special Prize文本，這將是困難的。 另一種方法是使用find_all()僅發現td和th元素。 列表理解僅添加非空單元格。 這將生成一個包含您需要的所有數據的列表。

cols包含所需列的列表。 這是為一等獎、二等獎和三等獎手動填寫的，因為這些條目應該是固定的。 然后使用循環將相應的字母和獎品添加到最后四列。

Python groupby()函數可用於將列表分組為由split_on的元素分隔的子組。

from itertools import groupby, zip_longest, islice
from bs4 import BeautifulSoup
import requests
import csv


def grouper(iterable, n):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip(*args)


response = requests.get("http://perdana4d.com/resulten.php")
soup = BeautifulSoup(response.content, 'lxml') 
rows = [cell.get_text(strip=True) for cell in soup.find_all(['td', 'th']) if len(cell.get_text(strip=True))]
draw_date = rows[2].split(' ')[-1]
split_on = ['Special Prize', 'Consolation Prize']

cols = [
    ['Date', draw_date], 
    ['FirstL', rows[7]], 
    ['FirstP', rows[8]], 
    ['SecondL', rows[9]], 
    ['SecondP', rows[10]], 
    ['ThirdL', rows[11]], 
    ['ThirdP', rows[12]], 
    ['SpecialL'], 
    ['SpecialP'], 
    ['ConsolationL'], 
    ['ConsolationP']
    ]

col_l = islice(cols, 7, None, 2)
col_p = islice(cols, 8, None, 2)

for k, g in groupby(rows[13:], lambda x: x not in split_on):
    if k:
        l = next(col_l)
        p = next(col_p)

        for letter, prize in grouper(g, 2):
            l.append(letter)
            p.append(prize)

with open('output.csv', 'w', newline='') as f_output:
    csv.writer(f_output).writerows(zip_longest(*cols, fillvalue=''))

這將導致 CSV 文件在加載到電子表格包中時具有以下類型的格式：

這里使用了相當多的 Python 技術，需要一段時間才能理解。 grouper例如是一個itertools食譜。 islice()是一種迭代對象而無需從第一個位置開始的方法。

CSV 文件的最終輸出是使用 Python 的 CSV 庫完成的。 這會將行列表轉換為格式正確的輸出行。 由於數據采用列格式，因此需要使用技巧將列表轉換為行列表，這是使用zip_longest() 。

如果您將打印語句添加到代碼中以查看數據的外觀，這可能會有所幫助。

請注意，要將數據直接保存為 Excel 格式 (.xlsx)，您需要安裝另一個庫，例如您可以使用openpyxl或xlwt 。

如何在 BeautifulSoup 中使用 Python 將這些由單行中的多列分隔的數據導出到 .csv 或 .xls 中？

問題描述

1 個解決方案

解決方案1
1 已采納 2019-01-17 10:42:51

如何在 BeautifulSoup 中使用 Python 將這些由單行中的多列分隔的數據導出到 .csv 或 .xls 中？

問題描述

1 個解決方案

解決方案1 1 已采納 2019-01-17 10:42:51

解決方案1
1 已采納 2019-01-17 10:42:51