簡體   English   中英

迭代並從python中的列表元素中刪除HTML

[英]Iterate and remove HTML from list elements in python

免責聲明:全面的noob Python編碼器; 剛剛開始了“艱苦學習Python”的第44章,我正嘗試自己做一些補充項目,以補充我的學習。

我正在嘗試編寫一個腳本,用作我的后端“管理界面”,允許我輸入一個包含球隊足球賽程表的URL,並從他們的團隊中自動提取該賽程表,然后將其保存到文件供以后訪問。

我已經能夠在終端中輸入一個URL,打開該URL,遍歷該URL中的HTML每行,然后刪除足夠的HTML標記,以使兩個單獨的元素顯示我想要的內容(在至少在包含的字符串方面...):游戲列表和這些游戲的日期列表; 它們被保存在兩個單獨的列表中,我另存為HTML文件,以在瀏覽器中查看並確認我得到的數據。

注意:這些文件通過解析URL獲得其文件名。

這是我正在使用的示例網址:www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php

我現在面臨的問題有兩個:

1)從兩個列表中刪除所有HTML,這樣剩下的唯一是它們各自索引中的字符串。 我嘗試過BeautifulSoup,但在過去的一天中,我一直在用它撞牆,通過StackOverflow進行梳理並嘗試了不同的方法。

沒有骰子(用戶錯誤,我很肯定)。

2)然后,在包含日期的列表中,將兩個索引的每組(即,將0和1、2和3、4和5等)組合到單個列表索引中的單個字符串中。

從那里開始,我相信我已經找到了一種將兩個列表合並為一個列表的方法(“學習Python,困難的一堂課涵蓋了我所相信的內容,以及StackOverflow上的很多內容),但這兩個都是目前對我來說是真正的阻礙。

這是我編寫的代碼,包括每個步驟以及其余步驟的注釋,但是我沒有適用的代碼:

# Import necessary modules

from urllib import urlopen
import sys
import urlparse

# Take user input to get the URL where schedule lives

team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")

# Parse the URL to grab the 'path' segment to whittle down and use as the file name

file_name = urlparse.urlsplit(team_url)

# Parse the URL to make the file name:

name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".html"
name_final_s = name_after[0] + "sched" + ".html"

# Create an empty list to hold our HTML data:

team_data = []
schedule_data = []

# Grab the HTML file to then be written & parsed down to just team names:

for line in urlopen(team_url).readlines():
    if "tr"  in line:
        if "a href=" in line:
            if "strong" in line:
                team_data.append(line.rstrip())

# Grab the HTML file to then be written & parsed down to just schedules:

for line in urlopen(team_url).readlines():
    if 'td class="cfb1"' in line:
        if "Buy" not in line:
            schedule_data.append(line.rstrip())
            # schedule_data[0::1] = [','.join(schedule_data[0::1])]

# Save team's game list file with contents of HTML:

with open(name_final, 'w') as fout:
    fout.write(str(team_data))

# Save team's schedule file with contents of HTML:

with open(name_final_s, 'w') as fout:
    fout.write(str(schedule_data))

# Remove all HTML tags from the game list file:



# Remove all HTML tags from the schedule list file:


# Combine necessary strings from the schedule list:


# Combine the two lists into a single list:

任何幫助將不勝感激!

更新:2015年5月27日上午9:42

因此,我使用HTMLParser玩了一下,我想到達那里了。 這是新代碼(仍可使用以下URL: http : //www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php ):

# Import necessary modules

from HTMLParser import HTMLParser
from urllib import urlopen
import sys
import urlparse
import os

# Take user input to get the URL where schedule lives

team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")

# Parse the URL to grab the 'path' segment to whittle down and use as the file name

file_name = urlparse.urlsplit(team_url)

# Parse the URL to make the file name:

name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".txt"
name_final_s = name_after[0] + "-dates" + ".txt"

# Create an empty list to hold our HTML data:

team_data = []
schedule_data = []

# Grab the HTML file to then be written & parsed down to just team names:

for line in urlopen(team_url).readlines():
    if "tr"  in line:
        if "a href=" in line:
            if "strong" in line:
                team_data.append(line.rstrip())

# Grab the HTML file to then be written & parsed down to just schedules:

for line in urlopen(team_url).readlines():
    if 'td class="cfb1"' in line:
        if "Buy" not in line:
            schedule_data.append(line.rstrip())
            # schedule_data[0::1] = [','.join(schedule_data[0::1])]

# Save team's game list file with contents of HTML:

with open(name_final, 'w') as fout:
    fout.write(str(team_data))

# Save team's schedule file with contents of HTML:

with open(name_final_s, 'w') as fout:
    fout.write(str(schedule_data))

# Create file name path from pre-determined directory and added string:

game_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final))
schedule_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s))

# Utilize MyHTML Python HTML Parsing module via MyHTMLParser class

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data :", data

# Create a game instance of HTMLParser:

game_parser = MyHTMLParser()


# Create a schedule instance of HTMLParster:

sched_parser = MyHTMLParser()


# Create function that opens and reads each line in a file:

def open_game():
    run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final)).readlines()
    for x in run:
        game_parser.feed(x)

def open_sched():
    run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s)).readlines()
    for x in run:
        sched_parser.feed(x)


open_game()
open_sched()


# Combine necessary strings from the schedule list:



# Combine the two lists into a single list:


# Save again as .txt files

# with open(name_final, 'w') as fout:
#   fout.write(str(team_data))
#   
# with open(name_final_s, 'w') as fout:
#   fout.write(str(schedule_data))

所以,現在我正在解析它,我只需要從字符串中完全刪除所有HTML標記,這樣就只是剩下的對手和每個單獨文件中的剩余日期。

如果在此期間未提供解決方案,我將繼續努力,並將結果寄回此處。

到目前為止,感謝您提供的所有幫助和見解,這位菜鳥對此非常感激。

如果您想知道如何使用BeatifulSoup,可以在第(1)部分中嘗試一下:

首先,請確保您安裝了正確的版本:

$ pip install beautifulsoup4

在您的python shell中:

from bs4 import BeautifulSoup
from urllib import urlopen
team_url = "http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php"
text = urlopen(team_url).read()
soup = BeautifulSoup(text)
table = soup.find('table', attrs={"class": "cfb-sch"})
data = []

for row in table.find_all('tr'):
    data.append([cell.text.strip() for cell in row.find_all('td')])

print data

# should print out something like:
#[[u'2015 Texas A&M Aggies Football Schedule'],
# [u'Date', u'', u'Opponent', u'Time/TV', u'Tickets'],
# [u'SaturdaySep. 5',
#  u'',
#  u'Arizona State Sun Devils \r\n      NRG Stadium, Houston, TX',
#  u'7:00 p.m. CT\r\nESPN network',
#  u'Buy\r\nTickets'],
# [u'SaturdaySep. 12',
#  u'',
#  u'Ball State Cardinals \r\n      Kyle Field, College Station, TX',
#  u'TBA',
#  u'Buy\r\nTickets'],
# ...

只要確定了所需的標簽,使用BeautifulSoup並查看頁面的HTML應該非常簡單。 這是代碼:

import urllib2
from bs4 import BeautifulSoup


def main():
    url = 'http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php'
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html)
    table = soup.find("table",{"class" : "cfb-sch"})
    # Working on the teams
    teams_td = table.findAll("td",{"class" : "cfb2"})
    teams = []
    for t in teams_td:
        teams.append(t.text.split('\r\n')[0].strip())
    # Working on the dates
    dates_td = table.findAll("td",{"class" : "cfb1"})
    dates = []
    # In the HTML table only 1 on 3 cfb1 is the date
    for i in range(0,len(dates_td),3):
        dates.append(dates_td[i].text)

    # Print everytin
    for s in zip(dates, teams):
        print s

if __name__ == '__main__':
    main()

運行它時,應獲得以下信息:

(u'SaturdaySep. 5', u'Arizona State Sun Devils')
(u'SaturdaySep. 12', u'Ball State Cardinals')
(u'SaturdaySep. 19', u'Nevada Wolf Pack')
(u'SaturdaySep. 26', u'at Arkansas Razorbacks')
(u'SaturdayOct. 3', u'Mississippi State Bulldogs')
(u'SaturdayOct. 10', u'Open Date')
(u'SaturdayOct. 17', u'Alabama Crimson Tide')
(u'SaturdayOct. 24', u'at Ole Miss Rebels')
(u'SaturdayOct. 31', u'South Carolina Gamecocks')
(u'SaturdayNov. 7', u'Auburn Tigers')
(u'SaturdayNov. 14', u'Western Carolina Catamounts')
(u'SaturdayNov. 21', u'at Vanderbilt Commodores')
(u'Saturday\r\n    Nov. 28', u'at LSU Tigers')
(u'SaturdayDec. 5', u'SEC Championship Game')

我希望這能幫到您。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM