简体   繁体   English

在amazon EC2或其他服务器上运行python脚本

[英]Run a python script on amazon EC2 or other server

I am working on a project in python that is starting to overwhelm my low-end windows lap-top and I wanted to ask for advice about how to find the additional computing power I think I need. 我正在研究python中的一个项目,它开始压倒我的低端Windows桌面,我想询问有关如何找到我认为需要的额外计算能力的建议。

Here are some details about my project: I am processing and analyzing a fairly large database of text from the web. 以下是关于我的项目的一些细节:我正在处理和分析来自网络的相当大的文本数据库。 Approximately 10,000 files each equivalent to on average approximately 500 words or so (though with a lot of variance around this mean). 大约10,000个文件,每个平均大约相当于500字左右(尽管这个方法有很多差异)。 The first step is pulling certain key phrases and using GenSim to do a fairly simple similarity analysis. 第一步是拉动某些关键短语并使用GenSim进行相当简单的相似性分析。 This takes my computer a while but it can handle it if I'm gentle. 这需要我的电脑一段时间,但如果我很温柔,它可以处理它。 Second, once I have identified a short list of candidates I fingerprint each candidate document to more closely assess similarity. 其次,一旦我确定了候选人的简短列表,我就会对每个候选文件进行指纹,以更加密切地评估相似性 Each file requires fingerprinting and comparison over 2-10 other files - so its not really an n-to-n comparison of the sort that would require months of computer time I don't think. 每个文件都需要对2-10个其他文件进行指纹识别和比较 - 所以它并不是真正的n-to-n比较,需要几个月的计算机时间我不认为。

It is this second step where my computer starts to struggle. 这是我的计算机开始挣扎的第二步。 I was considering looking into running the script in an EC2 environment but when I started reading about that on here, I saw aa comment to the effect that effectively doing so requires a linux sys admin level of sophistication - I am about as far from that level of sophistication as any member of this site can be. 我正在考虑在EC2环境中运行脚本,但当我开始在这里阅读时,我看到了一个评论效果,有效地这样做需要一个linux系统管理员级别的复杂性 - 我差不多就是那个级别这个网站的任何成员都可以是复杂的。

So is there another option? 那还有另一种选择吗? Or is getting a fairly simply python script running on ES2 not so hard. 或者在ES2上运行一个相当简单的python脚本并不那么难。

The part of the script that seems the most resource-intensive is below. 脚本中看起来资源最密集的部分如下。 For each text file, it creates a list of fingerprints by selecting certain text files from amdt_word_bags trim according to criteria in PossDupes_1 (both of which are lists). 对于每个文本文件,它根据PossDupes_1中的条件(两者都是列表)从amdt_word_bags修剪中选择某些文本文件来创建指纹列表。 It uses the fingerprintgenerator module which I found here: https://github.com/kailashbuki/fingerprint . 它使用我在这里找到的指纹生成器模块: https//github.com/kailashbuki/fingerprint

fingerprints_hold=[]
counter=0
error_count=0
for amdt, sims in zip(amdt_word_bags_trim, PossDupes_1):
    counter+=1
    if counter%100==0:
        print counter    
    if len(sims)>1:
        poss_sim=[sim for sim in sims if sim !=(counter-1)]
        fpg_orig = FingerprintGenerator(input_string=amdt)        
        try:
            fpg_orig.generate_fingerprints()
            orig_prints=fpg_orig.fingerprints
        except IndexError as s:
            orig_prints=["small"]
            print s
            error_count+=1
            print error_count
        cand_text=[[cand for cand in amdt_word_bags_trim[num]] for num in poss_sim]
        cand_text=[''.join(c) for c in cand_text]
        fing_cands_hold=[]
        for text in cand_text:
            fpg_cands = FingerprintGenerator(input_string=text)
            try:
                fpg_cands.generate_fingerprints()
                fing_cands_pre=[int(a[0]) for a in fpg_cands.fingerprints]
                fing_cands_hold.append(fing_cands_pre)                
            except IndexError as s2:
                fing_cands_hold.append('small cand')
            except TypeError as s3:
                fing_cands_hold.append("none")
        fingerprints_hold.append([orig_prints, fing_cands_hold])
    else: fingerprints_hold.append("no potential matches")

How about using Amazon's Elastic Map Reduce (EMR). 如何使用亚马逊的弹性地图减少(EMR)。 This is Amazon's hadoop service which basically runs on top of EC2. 这是亚马逊的hadoop 服务 ,基本上运行在EC2之上。 You can copy you your data files to AmazonS3 and have your EMR cluster pick up the data from there. 您可以将数据文件复制到AmazonS3 ,让EMR集群从那里获取数据。 You can also send your results to files in Amazon S3. 您还可以将结果发送到Amazon S3中的文件。

When you launch your cluster you can customize how many EC2 instances you want to use and what size for each instance. 启动群集时,您可以自定义要使用的EC2实例数以及每个实例的大小。 That way you can tailor how much CPU power you need. 这样您就可以定制所需的CPU功率。 After you are done with your job you can tear down your cluster when you are not using it. 完成工作后,可以在不使用时拆除群集。 (Avoiding paying for it) (避免支付费用)

You can also do all of the above programmatically too. 您也可以通过编程方式执行上述所有操作。 For example python I use the boto Amazon API which is quite popular. 例如python我使用的是非常受欢迎的boto Amazon API

For getting started on how to write python map reduce jobs you can find several posts on the web explaining how to do it. 要开始学习如何编写python map reduce jobs,你可以在网上找到几个帖子来解释如何去做。 Here's an example: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 这是一个例子: http//www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM