Cleaning and stripping of strings/HTML - Python

Question

I have a set of questions, of which I do not have an answer to.

1) Stripping lists of string

input:
'item1,   item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '

output:
['item1', 'item2', 'item3', 'item4', 'item5']

Anything more efficient than doing the following?

[x.strip() for x in l.split(',') if x.strip()]

2) Cleaning/Sanitizing HTML

keeping basic tags eg strong, p, br, ...

removing malicious javascript, css and divs

3) Unicode handling...

what would you recommend for dealing with unicode parsed within documents?

Any ideas? :) Thanks guys!

Answer 1

For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]

Answer 2

To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()

Answer 3

I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.

Answer 4

1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/

Answer 5

1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

Answer 6

I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines .

Cleaning and stripping of strings/HTML - Python

Question

6 answers

solution1
2 2010-10-28 21:38:26

solution2
2 2010-10-28 21:39:36

solution3
1 2010-10-28 21:40:26

solution4
1 2010-10-28 21:41:02

solution5
1 2010-10-28 21:47:14

solution6
1 ACCPTED 2010-10-29 03:48:49

Cleaning and stripping of strings/HTML - Python

Question

6 answers

solution1 2 2010-10-28 21:38:26

solution2 2 2010-10-28 21:39:36

solution3 1 2010-10-28 21:40:26

solution4 1 2010-10-28 21:41:02

solution5 1 2010-10-28 21:47:14

solution6 1 ACCPTED 2010-10-29 03:48:49

solution1
2 2010-10-28 21:38:26

solution2
2 2010-10-28 21:39:36

solution3
1 2010-10-28 21:40:26

solution4
1 2010-10-28 21:41:02

solution5
1 2010-10-28 21:47:14

solution6
1 ACCPTED 2010-10-29 03:48:49