简体   繁体   中英

Python duplicates in web crawler

I am trying to build a web crawler to get specific values from a page. These values may be updated and I don't want to get the previous value in my output.

Here is an simplified example of my problem:

html_example=''' 
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated 
'''

The code that I am using (based on Professor's Dave MOOC)

def get_values(content):
    values=[]
    while True:
        start_value=content.find('<')
        end_value=content.find('>', start_value+1)
        value=content[start_value+1:end_value]
        if value:
          values.append(value)
          content=content[end_value:]
        else:
            break
    return values

get_values(html_example)

The output that I get:

['value', 'valueIdontwant', 'value', 'value', 'valueIdontwant', 'value']

The output that I would like to get:

['value', 'value', 'value', 'value']

The only way to track the value that I want to leave out is the keyword "previous", not the values it-selves that all vary (a "for value in values" kind of code will not work in my case).

I am fairly new to programing and I am really bad at it, I tried different "if" statements but it did not work out. Thank you in advance if you have any idea about how to solve this issue!

code is convoluted and not very pythonic, but look for enumerate() if you want indexed access on a list.

def get_values_ignore_current_line(content, keyword):
   content = '\n'.join([x for x in content.splitlines() if keyword not in x]) 
   tags = re.findall('<.*?>', content)
   return tags

def get_values_ignore_next_line(content, keyword):
    lines = content.splitlines()
    new_content = [lines[0]]
    for i, line in enumerate(lines):
        if (keyword not in line) or (re.match('<.*?>', line) is not None):
            if i < len(lines) - 1:
                new_content.append(lines[i+1])
    new_content = '\n'.join(new_content)
    return re.findall('<.*?>', new_content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM