简体   繁体   English

Python re.findall挂在某些网站上

[英]Python re.findall hangs on certain websites

I´ve got a python script to loop through a list of websites/domains to scrap phones and e-mails from my clients websites, 99% of websites scraps are OK and works. 我有一个python脚本可以循环浏览网站/域列表,以从我的客户网站中删除电话和电子邮件,其中99%的网站中没有问题,并且可以正常工作。 Some websites just hangs and cant even force break operation, like it is on an insane loop. 一些网站只是挂起,甚至无法强制中断操作,就像它处于一个疯狂的循环中一样。 Below an example. 下面举一个例子。 Anyone could help me improve or fix this? 有人可以帮助我改善或解决此问题吗?

import requests,re

try:   
    r = requests.Session()
    f = r.get('http://www.poffoconsultoria.com.br', verify=False, allow_redirects=False,timeout=(5,5) )
    s = f.text                  
    tels = set(re.findall(r"\s?\(?0?[1-9][1-9]\)?[-\.\s][2-5]\d{3}\.?-?\s?\d{4}",s))
    emails = set(re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
    print(tels)
    print(emails)
except Exception as e:
    print(e)

You should remove the \\s? 您应该删除\\s? from the first regex (you do not really need a whitespace at the start of the match), or replace with (?<!\\S) if you want to only match after a whitespace or start of string. 从第一个正则表达式开始(在比赛开始时您实际上并不需要空格),或者如果只想在空格或字符串开始之后进行匹配,则替换为(?<!\\S)

The real problem is with the second regex where . 真正的问题在于第二个正则表达式where . resides in a character class that is quantified with + . 驻留在用+量化的字符类中。 The \\. \\. that follows it also matches a . 其后还匹配一个. and that makes it a problem when no matching text appears in the string. 当字符串中没有匹配的文本时,这将成为一个问题。 This is catastrophic backtracking . 这是灾难性的回溯

Since the matches you expect are whole words, I suggest enhancing the pattern by 1) adding word boundaries, 2) making all adjoining subpatterns match different types of chars. 由于您期望的匹配项是整个单词,因此我建议通过以下方法来增强模式:1)添加单词边界,2)使所有相邻的子模式匹配不同类型的字符。

Use 采用

r'\b[A-Za-z0-9._%+-]+@(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b'

to match emails. 匹配电子邮件。

See the (?:[A-Za-z0-9-]+\\.)+ part: it matches one or more repetitions of 1 or more alphanumeric/hyphen chars followed with a dot, and there is no \\. 请参阅(?:[A-Za-z0-9-]+\\.)+部分:它匹配一个或多个1个或多个字母数字/连字符的重复,后跟一个点,并且没有\\. after this pattern, there is an alpha character class, so there should be no problem like the one present before. 在此模式之后,有一个alpha字符类,因此应该不会像以前那样存在问题。

So. 所以。 I got the website data fine in Python27 using >>> string = requests.get('http://www.poffoconsultoria.com.br').text 我在Python27中使用>>> string = requests.get('http://www.poffoconsultoria.com.br').text获得了很好的网站数据

I then took the length of the string and it was >>> len(strings) 474038 That's a really high value. 然后,我取了字符串的长度,然后是>>> len(strings) 474038这确实是一个很高的值。

So for problems like these when one sees regex take such a long time (really, after getting the length of the page), you should visit the page in your browser and inspect the page source 因此,对于此类问题,当您看到正则表达式需要很长时间(确实是在获得页面长度之后)时,您应该在浏览器中访问该页面并inspect the page source

When I inspected the page in my browser I found these: 当我在浏览器中检查页面时,我发现了以下内容:

在此处输入图片说明

在此处输入图片说明

The second regex [A-Za-z0-9._%+-]+ will definitely hang (really, take a long time) because it isn't quantifiable and has to search through those ginormous portions. 第二个正则表达式[A-Za-z0-9._%+-]+肯定会挂起(真的,要花很长时间),因为它无法量化,并且必须搜索那些巨大的部分。

You either need to chunk the page or limit your regex. 您要么需要分页页面,要么限制正则表达式。 Or you could write a function that discards dictionary data if you suspect that what you need to return won't appear inside of them; 或者,如果您怀疑需要返回的内容不会出现在其中,可以编写一个丢弃字典数据的函数; basically though, those huge dictionaries above are causing the regex you posted to take a long time. 但是,基本上,上述那些大词典使您发布的正则表达式花费很长时间。

使用有效的电子邮件

(?i)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#$%&'*+/=?^`{}|~\w])*)?[0-9a-z]@))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM