解析.js頁面python

Question

我有一個網頁http://timetable.ait.ie/js/filter.js ，我非常需要解析此頁面。 在過去的幾天里，我一直在使用BeautifulSoup來解析html頁面，我確實得到了我在做什么，但是這個.js文件使我喪命。

目前，我正在使用以下代碼：

import urllib
page = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
pageInfo = page.read()

它返回的字符串包含18283行代碼的整個文件。 在代碼中，我試圖將人員姓名放到最下面，這是一個數組：

staffarray[373][0] = "BRADY, DAMIEN";
staffarray[373][1] = "SCI";
staffarray[373][2] = "BRADY001608";

我需要[0]和[1]中的值，然后使用這些值構建數據庫，以便以后使用。

我已經嘗試過使用正則表達式來查找staffarray，但是我對於獲取此信息感到非常沮喪。 有沒有人可以幫助我。

Answer 1

如果您對正則表達式有疑問，請使用標准字符串函數和切片。

首先將代碼分成幾行，然后搜索staffarray[和[0]或[1] 。 最后使用切片。

import urllib

req = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
lines = req.read().split('\n')

for x in lines:
    if 'staffarray[' in x:
        if '[0] = ' in x:
            start = x.find('"')+1
            end = -3
            print '0', x[start:end]
        elif '[1] = ' in x:
            start = x.find('"')+1
            end = -3
            print '1', x[start:end]

Answer 2

您可以編寫帶有捕獲組的regexp模式：

import re
with open('filter.js') as file:
    pattern = r'staffarray\[(?P<first_index>\d+)\]\s*\[(?P<second_index>\d+)\] = "(?P<name>.+)"'
    for line in file:
        match = re.search(pattern, line)
        if match:
            first_index, second_index, name = match.groups()
            # do something with data

解析.js頁面python

問題描述

2 個解決方案

解決方案1
1 已采納 2016-11-12 01:34:05

解決方案2
1 2016-11-12 01:44:16

解析.js頁面python

問題描述

2 個解決方案

解決方案1 1 已采納 2016-11-12 01:34:05

解決方案2 1 2016-11-12 01:44:16

解決方案1
1 已采納 2016-11-12 01:34:05

解決方案2
1 2016-11-12 01:44:16