繁体   English   中英

使用 BeautifulSoup 保存网页内容并提取数据

[英]Saving content of a webpage using BeautifulSoup and extract data

我是 Python 的新手,我正在尝试从以下信息的网页中提取数据,并将它们导出为保存在 csv 文件中以对其进行处理:

SYNOPS 来自 65528,Odienne(科特迪瓦)

202211070600 AAXX 07064 65528 42958 51202 10213 20208 39654 40126 85030 333 20209 58014 79999 85360=

202211061800 AAXX 06184 65528 11458 61404 10237 20214 39640 40108 69902 70296 83970 333 10326 58015 83815 81920 86360=

SYNOPS 来自 65536,Korhogo(科特迪瓦)

202211070600 AAXX 07064 65536 42960 23402 10204 20204 39708 40122 80002 333 20203 58002 79999 82076=

202211061800 AAXX 06184 65536 11458 70402 10268 20204 39688 40095 69902 70162 83932 333 10340 58008 82613 82920 87076=

SYNOPS 来自 65545,Bondoukou(科特迪瓦)

202211070600 AAXX 07064 65545 32958 ///// 10215 20206 39706 40124 80002 333 20212 59001 85076=

202211061800 AAXX 06184 65545 32958 ///// 10260 20213 39696 40107 80072 333 10325 58015 83360 88076=

SYNOPS 来自 65548,Man(科特迪瓦)

202211070600 AAXX 07064 65548 42458 70000 10215 20215 39753 40121 86530 333 20215 59001 70140 86613=

202211061800 AAXX 06184 65548 31458 60000 10285 20234 39738 40097 71799 84933 333 10313 58011 84813 81913=

SYNOPS 来自 65555,Bouake(科特迪瓦)

202211070600 AAXX 07064 65555 11440 32002 10210 20210 39698 40125 60084 71011 83502 333 20208 58003 83610=

202211061800 AAXX 06184 65555 11458 71402 10247 20237 39688 40109 60082 70196 8297/ 333 10315 58013 82813 81920 87360=

SYNOPS 来自 65557,Gagnoa(科特迪瓦)

202211070600 AAXX 07064 65557 41458 41602 10211 20208 39881 40115 70296 84500 333 20211 59004 70250 84613=

202211061800 AAXX 06184 65557 13460 62204 10276 20234 39866 40094 69902 333 10320 58007 86613 81920=

SYNOPS 来自 65560,达洛亚(科特迪瓦)

202211070600 AAXX 07064 65560 32460 ///// 10220 20213 39811 40124 81572 333 20220 58002 81613 84360=

202211061800 AAXX 06184 65560 32460 ///// 10285 20219 39794 40100 83902 333 10324 58012 83815 81920 85076=

SYNOPS 来自 65562,Dimbokro(科特迪瓦)

202211070600 AAXX 07064 65562 41209 60802 10230 20229 30017 40143 74442 86500 333 20222 58003 70090 83705 84613=

202211061800 AAXX 06184 65562 32460 53102 10305 20256 30003 40126 85900 333 10340 58016 85815 81920=

SYNOPS 来自 65563,Yamoussoukro(科特迪瓦)

202211070600 AAXX 07064 65563 42956 30000 10208 20206 39882 40115 80001 333 20202 58004 70040 83076=

202211061800 AAXX 06184 65563 11458 70000 10234 20229 39868 40099 60042 79596 85903 333 10331 58014 85810 81920 87076=

SYNOPS 来自 65578,阿比让(科特迪瓦)

202211070600 AAXX 07064 65578 41358 60404 10245 20245 30112 40120 70362 83530 333 20244 58000 70010 81707 83620 86360= 202211061800 AAXX 06184 65578 32460 62006 10285 20254 30104 40112 85932 333 10297 58011 81917 85620 85075=

SYNOPS 来自 65585,阿迪亚克(科特迪瓦)

202211070600 AAXX 07064 65585 42458 ///// 10236 20232 30079 40117 82202 333 20232 58002 70080 82813 83076=

202211061800 AAXX 06184 65585 11460 ///// 10280 20247 30070 40108 60082 70262 85270 333 10315 58008 85813=

SYNOPS 来自 65592,Tabou(科特迪瓦)

202211070600 AAXX 07064 65592 41458 71804 10239 20236 30097 40119 76266 8527/ 333 20239 58003 70150 85813 87360=

202211061800 AAXX 06184 65592 12460 71804 10276 20242 30083 40105 60052 83902 333 10300 58004 83813 81920 87076=

SYNOPS 来自 65594,圣佩德罗(科特迪瓦)

202211070600 AAXX 07064 65594 42458 43604 10221 20221 30086 40120 82202 333 20220 58002 70100 82813 84080=

202211061800 AAXX 06184 65594 11458 53602 10271 20259 30071 40104 60102 70296 83202 333 10290 58002 83813 85080=

SYNOPS 来自 65599,Sassandra(科特迪瓦)

202211070600 AAXX 07064 65599 41456 53202 10228 20224 30043 40118 70196 84202 333 20224 58000 70030 84813 84076=

202211061800 AAXX 06184 65599 31460 61802 10280 20252 30031 40104 70392 85902 333 10298 58001 85815 81920 84076=

目前还不太清楚您希望如何在 csv 中构造数据,但是可以预期任何 pandas DataFrame 都可以使用 .to_csv 保存为.to_csv 例如:

csvRows = []
for pCont in soup.find_all('pre'): 
    csvRows += [( 
        ('[About Query]', '\n'.join([
            ql[1:].strip() for ql in pb.split('\n')[2:5]
        ])) if pi == 0 else tuple(pb.split(f"\n{'#'*80}\n")[:2]) 
    ) for pi, pb in enumerate(pCont.get_text().split(f"{'#'*80}\n#  "))]
pandas.DataFrame(csvRows, columns=[
    'Result Header', 'Result Lines'
]).to_csv('Cotey_synopsc.csv', index=False)

会产生一个看起来像这样的文件。



虽然,我认为 [视觉上] 将它拆分得更多一点会更好 - 就像

aboutQuery, qresBlocks, bHeads = [], [], []
for pci, pCont in enumerate(soup.find_all('pre')):
    pBlocks = pCont.get_text().split(f"{'#'*80}\n#  ")

    aboutQuery.append({k: v for v, k in zip([pci]+[
        ql[1:].strip() for ql in pBlocks[0].split('\n')[2:5]
    ], ['qIndex', 'Query Time', 'Query', 'Query Inerval'])})

    for pbi, pb in enumerate(pBlocks[1:]):
        if f"\n{'#'*80}\n" not in pb: continue # unknown format
        bhead, blines = pb.split(f"\n{'#'*80}\n")[:2]
        bhead = bhead.strip().split(' | ')
        bHeads.append({k: v for k, v in [
            ('qIndex', pci), ('bIndex', pbi), ('from', bhead[0]), 
            ('N', ''.join(bhead[1:2])), ('W', ''.join(bhead[2:3])),
            ('m', ''.join(bhead[3:4])), ('extra', ' | '.join(bhead[4:]))
        ] if v != ''})
        for bl in blines.split('=\n')[:-1]:
            line1 = bl.split('\n')[0].strip()
            ldate = line1.split(' ')[0]
            line1 = line1.replace(ldate, '', 1)
            line2 = '\n'.join([l.strip() for l in bl.split('\n')[1:]])
            if line2 == '' and line1.endswith(' NIL'):
                line1, line2 = line1[:-4], 'NIL'
            qresBlocks.append({
                'qIndex': pci, 'bIndex': pbi, 'date?': ldate,
                'firstLine': line1, 'lastLine(/s)': f'{line2}='
            })

listsByName = [
    ('About Query', aboutQuery), 
    ('Result Blocks - Headers', bHeads), 
    ('Result Blocks - Lines', qresBlocks)
]

您可以保存到单独的 CSV 文件中

for name, nList in listsByName:
    pandas.DataFrame(nList).to_csv(f'{name}.csv', index=False)

或分隔单个 Excel 文件的工作表

with pandas.ExcelWriter('Cote_synopsc.xlsx') as w:
    for name, nList in listsByName:
        pandas.DataFrame(nList).to_excel(w, name, index=False)

输出看起来像: “关于查询” “结果块 - 标题” “结果块 - 行”

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM