簡體   English   中英

不能向CSV寫入多行

[英]Can't write more than one line to CSV

我建立了一個網絡抓取工具,可提取網站上的所有圖像。 我的代碼應該將每個img URL打印到標准輸出,並使用所有這些將其寫入一個csv文件,但是現在它僅將找到的最后一個圖像寫入文件,並將結果編號寫入csv。

這是我當前正在使用的代碼:

# This program prints a list of all images contained in a web page 
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls       ex. https://www.python.org/:  ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
   line = str(line, encoding)
   #defines imgTag
   imgTag = '<img '
   #goes to next piece of logical flow
   if imgTag in line :
      i = i+1
      srcAttribute = 'src="'
      if srcAttribute in line:
      #parses the html retrieved from user input 
       m = re.search('src="(.+?)"', line)
       if m:
          reline = m.group(1)
          #prints results
          print("[ ",[i], reline , " ]")

data = [[i, reline]]

output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()

如何獲得該程序,將找到的所有圖像寫入CSV文件?

您只會在csv中看到最后的結果,因為data永遠不會在for循環的范圍內正確更新:退出循環后,您只需寫入一次即可。 要將所有相關的HTML片段添加到列表data ,應縮進該行,並使用列表的appendextend方法。

因此,如果您將循環重寫為:

img_nbr = 0  # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here

for line in webPage:
   line = str(line, encoding)
   if imgTag in line :
      img_nbr += 1  # += saves you typing a few keystrokes and a possible future find-replace.
      #if srcAttribute in line:  # this check and the next do nearly the same: get rid of one
      m = re.search('src="(.+?)"', line)
      if m:
          reline = m.group(1)
          print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
          data.append((img_nbr, reline)) # This is what you really missed.

您會獲得更好的結果。 我添加了一些評論,為您的編碼技巧提供了一些建議,並刪除了您的評論,以使新評論脫穎而出。

但是,您的代碼仍然存在一些問題:除非源代碼的結構特別好(即使如此……),也不應使用正則表達式解析HTML。 現在,由於您要詢問用戶輸入內容,因此他們可以提供任何url,而且網頁的結構通常會很差。 如果您想構建更強大的網絡抓取工具,建議您看一下BeautifulSoup

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM