如何將已解析的HTML輸出到文件中？

Question

（更新）在尋求幫助之后，我現在有了以下代碼。 我可以輸出到一個csv文件，但似乎無法使csv具有適當的列數：

soup = BeautifulSoup(html_doc)
import csv
outfile=csv.writer(open('outputrows.csv','wb'),delimiter='\t')
#def get_movie_info(imdb):
tbl = soup.find('table')
rows = tbl.findAll('tr')
list=[]
for row in rows:
    cols = row.find_all('td')
    for col in cols:
        if col.has_attr('class') and col['class'][0] == 'title':
            spans = col.find_all('span')
            for span in spans:
                if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                    ID = span.get('data-tconst')
                    list.append(ID)
        elif col.has_attr('class') and col['class'][0] == 'number':
            rank = col.text
            list.append(rank)            
        elif col.has_attr('class') and col['class'][0] == 'image':
            hrefs = col.find_all('a')
            for href in hrefs:
                moviename = href.get('title')
                list.append(moviename)

outfile.writerows(list)

print list

問題在於它以這種格式輸出，僅是一列數據：

1.
The Shawshank Redemption (1994)
tt0111161
2.
The Dark Knight (2008)
tt0468569
3.
Inception (2010)
tt1375666

當我想要3列數據時，如下所示：

1.   The Shawshank Redemption (1994)   tt0111161
2.   The Dark Knight (2008)   tt0468569
3.   Inception (2010)   tt1375666

樣本html代碼：

 <tr class="odd detailed">
     <td class="number">
      48.
     </td>
     <td class="image">
      <a href="/title/tt0082971/" title="Raiders of the Lost Ark (1981)">
       <img alt="Raiders of the Lost Ark (1981)" height="74" src="http://ia.media-imdb.com/images/M/MV5BMjA0ODEzMTc1Nl5BMl5BanBnXkFtZTcwODM2MjAxNA@@._V1._SX54_CR0,0,54,74_.jpg" title="Raiders of the Lost Ark (1981)" width="54"/>
      </a>
     </td>
     <td class="title">
      <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">
      </span>
      <a href="/title/tt0082971/">
       Raiders of the Lost Ark
      </a>
      <span class="year_type">
       (1981)
      </span>
      <br/>

Answer 1

您是否嘗試get_movie_info函數返回打印行的列表？

def get_movie_info():
  returnedRows = []
  tbl = soup.find('table')
  rows = tbl.findAll('tr')
  for row in rows:
     cols = row.find_all('td')
     for col in cols:
        if col.has_attr('class') and col['class'][0] == 'image':
            hrefs = col.find_all('a')
            for href in hrefs:
                print href.get('title')
                returnedRows.append(href.get('title'))             # <-- append 'title' 
        elif col.has_attr('class') and col['class'][0] == 'title':
            spans = col.find_all('span')
            for span in spans:
                if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                    print span.get('data-tconst')
                    returnedRows.append(span.get('data-tconst'))   # <-- append 'tconst'
        elif col.has_attr('class') and col['class'][0] == 'number':
            print col.text
            returnedRows.append(col.text)                          # <-- append 'number'
   return returnedRows                                             # <-- then return the list

然后這樣執行

import csv
outfile=csv.writer(open('outputrows.tsv','wb'),delimiter='\t')
rows=get_movie_info()
outfile.writerows(rows)

Answer 2

您可以嘗試一下嗎（不是優化的解決方案，但應該可以完成此工作）：

soup = BeautifulSoup(html_doc)

def get_movie_info():
  tbl = soup.find('table')
  rows = tbl.findAll('tr')
  for row in rows:
    (imageTitle, dataTConst, number) = ('', '', '')
    cols = row.find_all('td')
    for col in cols:
        if col.has_attr('class') and col['class'][0] == 'image':
            href = col.find('a')
            imageTitle = href.get('title')
        elif col.has_attr('class') and col['class'][0] == 'title':
            span = col.find('span')
            if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                dataTConst = span.get('data-tconst')
        elif col.has_attr('class') and col['class'][0] == 'number':
            number = col.text

    yield (imageTitle, dataTConst, number)

#################################################
import csv
outfile=csv.writer(open('outputrows.csv','wb'), delimiter='\t')
for row in get_movie_info():
    outfile.writerow(row)

Answer 3

這是一種簡單的方法：

#!/usr/bin/env python

import pandas as pd
import BeautifulSoup as BeautifulSoup
import requests

url = 'some_url.html'
r=requests.get(url)

movie_id=[]
title=[]
year=[]

bs = BeautifulSoup(r.text)
for movie in bs.findAll('td', 'title'):
    movie_id.append((movie.find('a').get('href')).split('/')[2])
    title.append(movie.find('a').contents[0])
    year.append(movie.find('span', 'year_type').contents[0])

movie_dic={'movie_id': movie_id, 'title': title, 'year': year}
movie_data = pd.DataFrame(movie_dic, index = None)

file_name = "~/movies.txt"
movie_data.to_csv(file_name, sep = ',', header = True, encoding = 'utf-8', mode = 'w')

如何將已解析的HTML輸出到文件中？

問題描述

3 個解決方案

解決方案1
1 2014-02-06 02:42:39

解決方案2
1 已采納 2014-02-06 03:16:12

解決方案3
1 2014-02-06 04:06:28

如何將已解析的HTML輸出到文件中？

問題描述

3 個解決方案

解決方案1 1 2014-02-06 02:42:39

解決方案2 1 已采納 2014-02-06 03:16:12

解決方案3 1 2014-02-06 04:06:28

解決方案1
1 2014-02-06 02:42:39

解決方案2
1 已采納 2014-02-06 03:16:12

解決方案3
1 2014-02-06 04:06:28