简体   繁体   English

需要将抓取的数据写入csv文件(线程)

[英]Need to write scraped data into csv file (threading)

Here is my code: 这是我的代码:

from download1 import download
import threading,lxml.html
def getInfo(initial,ending):
    for Number in range(initial,ending):
        Fields = ['country', 'area', 'population', 'iso', 'capital', 'continent', 'tld', 'currency_code',
                  'currency_name', 'phone',
                  'postal_code_format', 'postal_code_regex', 'languages', 'neighbours']
        url = 'http://example.webscraping.com/places/default/view/%d'%Number
        html=download(url)
        tree = lxml.html.fromstring(html)
        results=[]
        for field in Fields:
            x=tree.cssselect('table > tr#places_%s__row >td.w2p_fw' % field)[0].text_content()
            results.append(x)#should i start writing here?
downloadthreads=[]
for i in range(1,252,63): #create 4 threads
    downloadThread=threading.Thread(target=getInfo,args=(i,i+62))
    downloadthreads.append(downloadThread)
    downloadThread.start()

for threadobj in downloadthreads:
    threadobj.join() #end of each thread

print "Done"

So results will have the values of Fields ,I need to write the data with Fields as top row (only once) then the values in results into CSV file. 因此results将具有Fields的值,我需要将Fields作为第一行写入数据(仅一次),然后将results的值写入CSV文件。 I am not sure i can open the file in the function because threads will open the file multiple times simultaneously. 我不确定我是否可以在函数中打开文件,因为线程会同时多次打开文件。

Note: i know threading isn't desirable when crawling but i am just testing 注意:我知道抓取时不希望使用线程,但我只是在测试

I think you should consider using some kind of queuing or thread pools. 我认为您应该考虑使用某种排队或线程池。 Thread pools are really useful if you want create several threads (not 4, I think you would use more than 4 threads, but 4 threads at a time). 如果要创建多个线程(不是4个,我想您一次使用4个以上的线程,但是一次要使用4个线程),则线程池非常有用。

An example of Queue technique can be found here . 可以在此处找到Queue技术的示例。

Of course, you can label the files with its threads id, for example: "results_1.txt", "results_2.txt" and so on. 当然,您可以使用其线程ID标记文件,例如:“ results_1.txt”,“ results_2.txt”等等。 Then, you can merge them after all threads finished. 然后,您可以在所有线程完成后合并它们。

You can use the basic concepts of Lock, Monitor, and so on, however I am not the biggest fan of them. 您可以使用“锁”,“监视器”等基本概念,但是我不是它们的忠实拥护者。 An example of locking can be found here 锁定的例子可以在这里找到

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM