如何使用python从网页中提取数据

Question

can someone point me what i am doing wrong? 有人可以指出我做错了什么吗？

Enter Item name:Rockfish Traceback (most recent call last): File "C:\\Users\\partn_000\\Desktop\\sarvesh\\Python Source Code\\working\\jellyneoscraper.py", line 45, in search(br, ITEMNAME) File "C:\\Users\\partn_000\\Desktop\\sarvesh\\Python Source Code\\working\\jellyneoscraper.py", line 33, in search increment = increment[0] IndexError: list index out of range 输入项目名称：Rockfish Traceback（最近一次通话）：文件“ C：\\ Users \\ partn_000 \\ Desktop \\ sarvesh \\ Python源代码\\ working \\ jellyneoscraper.py”，第45行，位于search（br，ITEMNAME）文件中：\\ Users \\ partn_000 \\ Desktop \\ sarvesh \\ Python源代码\\ working \\ jellyneoscraper.py“，行33，搜索增量=增量[0] IndexError：列表索引超出范围

This is the code i wrote 这是我写的代码

#Library Imports
import mechanize
import cookielib
import re
import sys
import time
import os.path
from operator import itemgetter
import ctypes
ctypes.windll.kernel32.SetConsoleTitleA("test")


def init_browser():
    br = mechanize.Browser()
    br.set_handle_equiv(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)

    return br


def search(br, ITEMNAME):
    datapage = br.open('http://items.jellyneo.net/index.php?go=show_items&name=' +ITEMNAME +'&name_type=exact&desc=&cat=0&specialcat=0&status=0&rarity=0&sortby=name&numitems=20')
    f = open('search.html', 'w')
    f.write(datapage.read())
    f.close()
    value = re.findall('style="font-weight:bold;">(.+) NP</a></td>"',datapage.read())  #(.+) is replaced in place of required value
    value = value[0].replace(",","")
    value = int(value)
    print value
#http://items.jellyneo.net/index.php?go=show_items&name=Rockfish&name_type=exact&desc=&cat=0&specialcat=0&status=0&rarity=0&sortby=name&numitems=20


#('style="font-weight:bold;"> (.+) NP</a>"',search.read())


ITEMNAME = raw_input('Enter Item name:eg. Rockfish')

br = init_browser()
search(br, ITEMNAME)

Answer 1

in your search method you read the entire page and save it to a file, then you try to reread it yo execute your regex but you are already at the end of the page so it returns empty string. 在您的搜索方法中，您将读取整个页面并将其保存到文件中，然后尝试重新读取它并执行您的正则表达式，但是您已经在页面末尾，因此它返回空字符串。 you should add datapage.seek(0) before reading it again like this: 您应该先添加datapage.seek（0），然后再像这样读取它：

datapage = br.open('http://items.jellyneo.net/index.php?go=show_items&name=' +ITEMNAME +'&name_type=exact&desc=&cat=0&specialcat=0&status=0&rarity=0&sortby=name&numitems=20')
f = open('search.html', 'w')
f.write(datapage.read())
f.close()
datapage.seek(0)
value = re.findall('style="font-weight:bold;">(.+) NP</a></td>"',datapage.read())  #(.+) is replaced in place of required value
value = value[0].replace(",","")
value = int(value)

如何使用python从网页中提取数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-09-06 18:28:20

如何使用python从网页中提取数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-09-06 18:28:20

解决方案1
1 已采纳 2015-09-06 18:28:20