简体   繁体   中英

Python: Memory Error while using large strings

Basically, I am designing a web search engine, so I designed a crawler to get web pages.

When read in, the web pages are in html format, so all the tags are there. I need to extract keywords from the body and title, so I'm trying to remove all the tags (anything between '<' and '>')

The code below works well for small html pages, but when I try to use this on a large scale (ie starting from http://www.google.com ), I run out of memory.

0 def remove_tags(self, s):
1     while '<' in s:
2         start = s.index('<')
3         end = s.index('>')
4         s = s[:start] + " " + s[end+1:]
5     return s.split()

The memory error occurs at line 4. How do I fix my code so that taking the substrings of s doesn't consume excessive memory?

Your general approach is wrong. Firstly, use a real XML/HTML parser. Something like BeautifulSoup, which is forgiving when it comes to bad HTML. Your approach with looking at < and > won't survive for long.

Secondly, you've read the whole thing into memory and are playing with it there. That's memory consuming and some of the operations you're doing might create copies which is not a good thing either. Instead, iterate over the input stream and process it as you see data. Think of remove_tags as a filter on the input rather than a text processing function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM