Python duplicates in web crawler

Question

I am trying to build a web crawler to get specific values from a page. These values may be updated and I don't want to get the previous value in my output.

Here is an simplified example of my problem:

html_example=''' 
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated 
'''

The code that I am using (based on Professor's Dave MOOC)

def get_values(content):
    values=[]
    while True:
        start_value=content.find('<')
        end_value=content.find('>', start_value+1)
        value=content[start_value+1:end_value]
        if value:
          values.append(value)
          content=content[end_value:]
        else:
            break
    return values

get_values(html_example)

The output that I get:

['value', 'valueIdontwant', 'value', 'value', 'valueIdontwant', 'value']

The output that I would like to get:

['value', 'value', 'value', 'value']

The only way to track the value that I want to leave out is the keyword "previous", not the values it-selves that all vary (a "for value in values" kind of code will not work in my case).

I am fairly new to programing and I am really bad at it, I tried different "if" statements but it did not work out. Thank you in advance if you have any idea about how to solve this issue!

Answer 1

code is convoluted and not very pythonic, but look for enumerate() if you want indexed access on a list.

def get_values_ignore_current_line(content, keyword):
   content = '\n'.join([x for x in content.splitlines() if keyword not in x]) 
   tags = re.findall('<.*?>', content)
   return tags

def get_values_ignore_next_line(content, keyword):
    lines = content.splitlines()
    new_content = [lines[0]]
    for i, line in enumerate(lines):
        if (keyword not in line) or (re.match('<.*?>', line) is not None):
            if i < len(lines) - 1:
                new_content.append(lines[i+1])
    new_content = '\n'.join(new_content)
    return re.findall('<.*?>', new_content)

Python duplicates in web crawler

Question

1 answers

solution1
0 ACCPTED 2014-04-25 10:09:29

Python duplicates in web crawler

Question

1 answers

solution1 0 ACCPTED 2014-04-25 10:09:29

solution1
0 ACCPTED 2014-04-25 10:09:29