Python bs4+lxml解析表

Question

我想從此 url - http://portal.ksada.org:8090/time-table/student?id=5598解析表。 我最后需要得到的是某種數據結構。 例如，我試圖達到的目標：

class Schedule():
   date='02.02.2022' # headdate class in html
   day='Ср' # headday class in html
   lessons=[['1 пара #span lesson', '09:00-10:35', 'КомпКн[Пз]', 'ауд. 304', 'Чайка Л.Е.'],
            [...],] # div with class lessons-1 or lessons-2

所以有了它，我會確切地知道有一天會有很多課程。 也許它不是最好的解決方案，也許這就是我卡住的原因。 一般來說，我想要的是結構化所有這些東西，這樣我就可以得到一天、一周和一個月的課程。 我嘗試了很多解決方案，然后就卡在那里了。 我現在擁有的是這段代碼：

url = 'http://portal.ksada.org:8090/time-table/student?id='
id = 5598

def get_data(url, id):
    page = requests.get(url+id)
    soup = BeautifulSoup(page.text, 'lxml')
    table = soup.select_one('table')
    items = []
    for tr in table.select('tr'):
        th_list = tr.select('th')
        td_list = tr.select('td')

        for th in th_list:
            print(th.text)
            for td in td_list:
                print(td.text.strip().replace('&nbsp', ''))

我還嘗試找到每天之間的“距離”，如下所示：

def get_data(url, id):
    page = requests.get(url+id)
    soup = BeautifulSoup(page.text, "html.parser")

    table = soup.find('table')
    tbody = table.find_all('tr')

    for i, t in enumerate(tbody):
            if t.find('th', class_='headday'):
                days.append(i)

並像這樣使用它：

for i, d in enumerate(days[:-1]):
        for t in tbody[days[i]:days[i+1]]:

我只是不知道如何以某種方式很好地做到這一點。

Answer 1

我希望這將幫助您獲得最終解決方案。

#Import the library - pandas
import pandas as pd

table_list=pd.read_html('http://portal.ksada.org:8090/time-table/student?id=5598',attrs = {'id': 'timeTable'},flavor='lxml')
df = table_list[0].replace(r'&nbsp','NoValue', regex=True) # replace the value with NoValue, in case needed further
df_header=['Day','W1','W2','W3','W4','W5']
df.columns=df_header # logical header 
df.head(2) # this can be commented out as this is only for data viewing

由於 pandas 將第一行讀取為 header，將其轉換為第一行數據。

#converting header to first row data
df_t=pd.DataFrame(columns=df_header, data=[table_list[0].columns.tolist()])

這是將用於滿足數據需求的最終數據框。

df_final=df_t.append(df, ignore_index=True)
df_final.head(5) # # this can be commented out as this is only for data viewing

#setup for group weeks
week_days_notation=['Пн','Вт','Ср','Чт','Пт','Сб','Нд']
day_of_week=""
week_days=[]
for e in df_final['Day']:
    if e in week_days_notation:
        day_of_week=e
    week_days.append(day_of_week)
#week_days


# add the week_days to the dataframe
df_final.insert(0,'week_group',week_days)
df_final.head(2)

#group by week
df_final_grp=df_final.groupby('week_group')

# now can get week and  iterate in case needed
# give me only  'Wednesday':'Ср'
wed_classes=df_final_grp.get_group('Ср')
wed_classes.head(10)

Python bs4+lxml解析表

問題描述

1 個解決方案

解決方案1
0 已采納 2022-01-26 16:41:24

Python bs4+lxml解析表

問題描述

1 個解決方案

解決方案1 0 已采納 2022-01-26 16:41:24

解決方案1
0 已采納 2022-01-26 16:41:24