简体   繁体   English

使用python中的Web搜寻器创建垃圾邮件列表

[英]Creating a spam list with a web crawler in python

Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. 大家好,我不是要在这里做任何恶意的事情,我只需要做一些作业即可。 I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. 我是一个相当新的程序员,我使用的是python 3.0,并且很难使用递归来解决问题。 I've been stuck on this question for quite a while. 我在这个问题上停留了很长时间。 Here's the 这是

Assignment: 分配:

  1. Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http link contained in the web page. 编写一个递归方法spam(url,n),该方法将网页的url作为输入和非负整数n,收集网页中包含的所有电子邮件地址,并将它们添加到全局字典变量spam_dict中,然后在网页中包含的每个http链接上递归调用自身。

  2. You will use a dictionary so only one copy of every email address is saved; 您将使用词典,因此每个电子邮件地址仅保存一份。 your dictionary will store (key,value) pairs (email, email). 您的词典将存储(键,值)对(电子邮件,电子邮件)。 The recursive call should use the parameter n-1 instead of n. 递归调用应使用参数n-1而不是n。 If n = 0, you should collect the email addresses but no recursive calls should be made. 如果n = 0,则应收集电子邮件地址,但不应进行递归调用。 The parameter n is used to limit the recursion to at most depth n. 参数n用于将递归限制为最大深度n。

You will need to use the solutions of the two above problems; 您将需要使用上述两个问题的解决方案。 your method spam() will call the methods links2() and emails() and possibly other functions as well. 您的方法spam()将调用方法links2()和emails()以及可能的其他函数。

Notes: 笔记:

  1. running spam() directly will produce no output on the screen; 直接运行spam()将在屏幕上不产生任何输出; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam. 要找到您的spam_dict,您将需要读取spam_dict的值,并且还需要在每次运行垃圾邮件之前将其重置为空字典。
  2. Recall how global variables are used. 回忆一下如何使用全局变量。

Usage: 用法:

>>> spam_dict = {}
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0)
>>> spam_dict.keys()  
dict_keys([])  
>>> spam_dict = {}  
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1)
>>> spam_dict.keys()
dict_keys(['lperkovic@cs.depaul.edu', 'nobody@xyz.com'])

So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. 到目前为止,我已经编写了一个遍历网页并将所有链接放在一个漂亮的小列表中的函数,而我想做的就是调用该函数。 And why would I use recursion on a dictionary? 为什么我要对字典使用递归? And how? 如何? I don't understand how n ties into all of this. 我不明白所有这些之间的关系。

def links2(url):
    content = str(urlopen(url).read())
    myparser = MyHTMLParser()
    myparser.feed(content)
    lst = myparser.get()
    mergelst = []
    for link in lst:
        mergelst.append(urljoin(lst[0],link))
    print(mergelst)

Any input (except why spam is bad) would be greatly appreciated. 任何输入(垃圾邮件为什么不好除外)将不胜感激。 Also, I realize that the above function could probably look better, if you have a way to do it, I'm all ears. 另外,我意识到上面的功能可能看起来更好,如果您有办法做到,我也很高兴。 However, all I need is the point is for the program to produce the proper output. 但是,我所需要的只是使程序产生适当的输出。

Added: 添加:

I wrote a function that collects emails from a page, but I'm not sure how to lump .com and .edu and .org all together. 我编写了一个从页面收集电子邮件的函数,但是我不确定如何将.com和.edu和.org一起使用。

from re import findall
def emails(url): 
    links = str(links3(url)) 
    # how do I construct pattern? 
    pattern='[A-Za-z0-9_.]+\@[A-Za-z0-9_.]+.com\.edu\.org 

    lst = findall(pattern,links) 
    print(lst) 

How do I tell python that? 我该如何告诉python? I can't find it in the documentation. 我在文档中找不到它。

Think about how recursion works. 考虑一下递归如何工作。 What you want is for your function to be able to call itself in some cases. 您想要的是函数在某些情况下能够自行调用。 In this case, you need to add a parameter for the recursion level to your function, and then you need to figure out what it should do in the various cases? 在这种情况下,您需要为函数添加递归级别的参数,然后您需要弄清楚在各种情况下该怎么做?

At the most basic level, what should it do with n=0? 在最基本的级别上,n = 0怎么办? (hint: you've about got it already) (提示:您已经知道了)

What should it do if n=1? 如果n = 1怎么办? You probably want to call your function again on each element of your existing list with n=0. 您可能想在n = 0的现有列表的每个元素上再次调用函数。

What about if n is greater than 1? 如果n大于1怎么办? You want to call your function again with n = n-1 on each element you've got so far. 您想用到目前为止每个元素上的n = n-1再次调用函数。

n would play into it, as the problem states, by limiting the recursion to a maximum "call depth". 如问题所述, n会通过将递归限制为最大“调用深度”来发挥作用。

The idea is that since you're recursively invoking the scanning for emails from an already-running scan, you build up a call stack of what called what that gets deeper and deeper as you continue to recursively call the scanner. 这样做的想法是,由于您要递归地从已经运行的扫描中调用电子邮件扫描,因此您会建立一个调用堆栈,该调用堆栈在继续递归调用扫描程序时变得越来越深。

You don't want it to go on forever, so as one of the arguments you pass an integer that you decrement each time you make a call. 您不希望它永远持续下去,因此作为参数之一,您传递一个整数,每次调用时递减一个。 When it reaches 0, you stop doing recursive calls and let the sequence of recursions unwind itself. 当它达到0时,您将停止进行递归调用,并让递归序列自动展开。

call 1 (args...., n=3)
   call 2a (args...., n=2)
       call 3 (args...., n=1)
            call 4a (args..., n=0) <-- these calls won't call more scans
            call 4b (args..., n=0) <-- because n=0, so this is max depth
   call 2b (args...., n=2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM