简体   繁体   English

不能向CSV写入多行

[英]Can't write more than one line to CSV

I've built a web-scraper that extracts all images on a website. 我建立了一个网络抓取工具,可提取网站上的所有图像。 My code is supposed to print each img URL to the standard output and write a csv file with all of these, but right now it is only writing the last image found to the file and the number of that result to the csv. 我的代码应该将每个img URL打印到标准输出,并使用所有这些将其写入一个csv文件,但是现在它仅将找到的最后一个图像写入文件,并将结果编号写入csv。

Here's the code I'm currently using: 这是我当前正在使用的代码:

# This program prints a list of all images contained in a web page 
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls       ex. https://www.python.org/:  ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
   line = str(line, encoding)
   #defines imgTag
   imgTag = '<img '
   #goes to next piece of logical flow
   if imgTag in line :
      i = i+1
      srcAttribute = 'src="'
      if srcAttribute in line:
      #parses the html retrieved from user input 
       m = re.search('src="(.+?)"', line)
       if m:
          reline = m.group(1)
          #prints results
          print("[ ",[i], reline , " ]")

data = [[i, reline]]

output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()

How do I get this program to write all of the images found to a CSV file? 如何获得该程序,将找到的所有图像写入CSV文件?

You're only seeing the last result in your csv, because data is never properly updated within the scope of the for-loop: you're only writing to it once, when you've exited the loop. 您只会在csv中看到最后的结果,因为data永远不会在for循环的范围内正确更新:退出循环后,您只需写入一次即可。 To get all the relevant pieces of the HTML added to your list data , you should indent that line and use the append or extend method of the list. 要将所有相关的HTML片段添加到列表data ,应缩进该行,并使用列表的appendextend方法。

So if you'd rewrite the loop as: 因此,如果您将循环重写为:

img_nbr = 0  # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here

for line in webPage:
   line = str(line, encoding)
   if imgTag in line :
      img_nbr += 1  # += saves you typing a few keystrokes and a possible future find-replace.
      #if srcAttribute in line:  # this check and the next do nearly the same: get rid of one
      m = re.search('src="(.+?)"', line)
      if m:
          reline = m.group(1)
          print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
          data.append((img_nbr, reline)) # This is what you really missed.

you'll get better results. 您会获得更好的结果。 I've added a few comments to give some suggestions for your coding skills and removed your comments to make the new ones stand out. 我添加了一些评论,为您的编码技巧提供了一些建议,并删除了您的评论,以使新评论脱颖而出。

However, your code still has a few problems: HTML should not be parsed with regular expressions unless the source code is extremely well-structured (and even then...). 但是,您的代码仍然存在一些问题:除非源代码的结构特别好(即使如此……),也不应使用正则表达式解析HTML。 Now, because you are asking the user for input, they could give any url, and the webpage will more often than not be poorly structured. 现在,由于您要询问用户输入内容,因此他们可以提供任何url,而且网页的结构通常会很差。 I suggest you to have a look into BeautifulSoup if you'd like to build more robust web-scrapers. 如果您想构建更强大的网络抓取工具,建议您看一下BeautifulSoup

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM