簡體   English   中英

如何將已解析的HTML輸出到文件中?

[英]How to output parsed HTML into a file?

(更新)在尋求幫助之后,我現在有了以下代碼。 我可以輸出到一個csv文件,但似乎無法使csv具有適當的列數:

soup = BeautifulSoup(html_doc)
import csv
outfile=csv.writer(open('outputrows.csv','wb'),delimiter='\t')
#def get_movie_info(imdb):
tbl = soup.find('table')
rows = tbl.findAll('tr')
list=[]
for row in rows:
    cols = row.find_all('td')
    for col in cols:
        if col.has_attr('class') and col['class'][0] == 'title':
            spans = col.find_all('span')
            for span in spans:
                if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                    ID = span.get('data-tconst')
                    list.append(ID)
        elif col.has_attr('class') and col['class'][0] == 'number':
            rank = col.text
            list.append(rank)            
        elif col.has_attr('class') and col['class'][0] == 'image':
            hrefs = col.find_all('a')
            for href in hrefs:
                moviename = href.get('title')
                list.append(moviename)

outfile.writerows(list)

print list

問題在於它以這種格式輸出,僅是一列數據:

1.
The Shawshank Redemption (1994)
tt0111161
2.
The Dark Knight (2008)
tt0468569
3.
Inception (2010)
tt1375666

當我想要3列數據時,如下所示:

1.   The Shawshank Redemption (1994)   tt0111161
2.   The Dark Knight (2008)   tt0468569
3.   Inception (2010)   tt1375666

樣本html代碼:

 <tr class="odd detailed">
     <td class="number">
      48.
     </td>
     <td class="image">
      <a href="/title/tt0082971/" title="Raiders of the Lost Ark (1981)">
       <img alt="Raiders of the Lost Ark (1981)" height="74" src="http://ia.media-imdb.com/images/M/MV5BMjA0ODEzMTc1Nl5BMl5BanBnXkFtZTcwODM2MjAxNA@@._V1._SX54_CR0,0,54,74_.jpg" title="Raiders of the Lost Ark (1981)" width="54"/>
      </a>
     </td>
     <td class="title">
      <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
      </span>
      <a href="/title/tt0082971/">
       Raiders of the Lost Ark
      </a>
      <span class="year_type">
       (1981)
      </span>
      <br/>

您是否嘗試get_movie_info函數返回打印行的列表?

def get_movie_info():
  returnedRows = []
  tbl = soup.find('table')
  rows = tbl.findAll('tr')
  for row in rows:
     cols = row.find_all('td')
     for col in cols:
        if col.has_attr('class') and col['class'][0] == 'image':
            hrefs = col.find_all('a')
            for href in hrefs:
                print href.get('title')
                returnedRows.append(href.get('title'))             # <-- append 'title' 
        elif col.has_attr('class') and col['class'][0] == 'title':
            spans = col.find_all('span')
            for span in spans:
                if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                    print span.get('data-tconst')
                    returnedRows.append(span.get('data-tconst'))   # <-- append 'tconst'
        elif col.has_attr('class') and col['class'][0] == 'number':
            print col.text
            returnedRows.append(col.text)                          # <-- append 'number'
   return returnedRows                                             # <-- then return the list

然后這樣執行

import csv
outfile=csv.writer(open('outputrows.tsv','wb'),delimiter='\t')
rows=get_movie_info()
outfile.writerows(rows)

您可以嘗試一下嗎(不是優化的解決方案,但應該可以完成此工作):

soup = BeautifulSoup(html_doc)

def get_movie_info():
  tbl = soup.find('table')
  rows = tbl.findAll('tr')
  for row in rows:
    (imageTitle, dataTConst, number) = ('', '', '')
    cols = row.find_all('td')
    for col in cols:
        if col.has_attr('class') and col['class'][0] == 'image':
            href = col.find('a')
            imageTitle = href.get('title')
        elif col.has_attr('class') and col['class'][0] == 'title':
            span = col.find('span')
            if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                dataTConst = span.get('data-tconst')
        elif col.has_attr('class') and col['class'][0] == 'number':
            number = col.text

    yield (imageTitle, dataTConst, number)

#################################################
import csv
outfile=csv.writer(open('outputrows.csv','wb'), delimiter='\t')
for row in get_movie_info():
    outfile.writerow(row)

這是一種簡單的方法:

#!/usr/bin/env python

import pandas as pd
import BeautifulSoup as BeautifulSoup
import requests

url = 'some_url.html'
r=requests.get(url)

movie_id=[]
title=[]
year=[]

bs = BeautifulSoup(r.text)
for movie in bs.findAll('td', 'title'):
    movie_id.append((movie.find('a').get('href')).split('/')[2])
    title.append(movie.find('a').contents[0])
    year.append(movie.find('span', 'year_type').contents[0])

movie_dic={'movie_id': movie_id, 'title': title, 'year': year}
movie_data = pd.DataFrame(movie_dic, index = None)

file_name = "~/movies.txt"
movie_data.to_csv(file_name, sep = ',', header = True, encoding = 'utf-8', mode = 'w')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM