Python: Memory Error while using large strings

Question

Basically, I am designing a web search engine, so I designed a crawler to get web pages.

When read in, the web pages are in html format, so all the tags are there. I need to extract keywords from the body and title, so I'm trying to remove all the tags (anything between '<' and '>')

The code below works well for small html pages, but when I try to use this on a large scale (ie starting from http://www.google.com ), I run out of memory.

0 def remove_tags(self, s):
1     while '<' in s:
2         start = s.index('<')
3         end = s.index('>')
4         s = s[:start] + " " + s[end+1:]
5     return s.split()

The memory error occurs at line 4. How do I fix my code so that taking the substrings of s doesn't consume excessive memory?

Answer 1

Your general approach is wrong. Firstly, use a real XML/HTML parser. Something like BeautifulSoup, which is forgiving when it comes to bad HTML. Your approach with looking at < and > won't survive for long.

Secondly, you've read the whole thing into memory and are playing with it there. That's memory consuming and some of the operations you're doing might create copies which is not a good thing either. Instead, iterate over the input stream and process it as you see data. Think of remove_tags as a filter on the input rather than a text processing function.

Python: Memory Error while using large strings

Question

1 answers

solution1
8 ACCPTED 2012-07-29 07:45:38

Python: Memory Error while using large strings

Question

1 answers

solution1 8 ACCPTED 2012-07-29 07:45:38

solution1
8 ACCPTED 2012-07-29 07:45:38