I've been working on this small piece for hours now and couldn't find a solution, and it should be simple. This time, I'll post the actual code, and not simple examples, as somehow I can't get the examples to work with the real code.
I'm trying to do this with built-in modules (though if you have the answer using bs4 I'd like to know it as well). It should be a simple thing.
I have two files, an HTML file that goes like this.
<b>Match #139</b></font></td></tr><tr bgcolor="#EEEEEE"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap=""> <font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3822pb01">3822pb01</a> </font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Left with 'POLICE' Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>: <a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv"> </font></td></tr><tr bgcolor="#FFFFFF"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap=""> <font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3821pb01">3821pb01</a> </font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Right with 'POLICE' Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>: <a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv"> </font></td></tr><tr bgcolor="#5E5A80"><td colspan="4"><font face="Tahoma,Arial" size="2" color="#FFFFFF"> <b>Match #140</b></font></td></tr><tr bgcolor="#EEEEEE"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap=""> <font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3822pb02">3822pb02</a> </font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Left with Classic Fire Logo Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>: <a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv"> </font></td></tr><tr bgcolor="#FFFFFF"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap=""> <font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3821pb02">3821pb02</a> </font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Right with Classic Fire Logo Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>: <a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv"> </font></td></tr><tr bgcolor="#5E5A80"><td colspan="4"><font face="Tahoma,Arial" size="2" color="#FFFFFF"> <b>
Please don't kill me, yes, it's only a line. You can paste it into some code editor to see it in multiple lines. The file continues with more "Matches".
I want to do two things.
1st, I want to create a dictionary that will use the match number as it's index number. So, for example, it would be
matches = {'139' : 'etc', '140' : 'etc'}
And then, if you look at the HTML, after the first link after the Match, there is a part number, in example, the first one is 3822pb01. There are usually 2 part numbers inside a match, and I want to create a tuple inside the dict with those 2 part numbers.
matches = {'139' : ['3822pb01', '3821pb01'], '140' : ['3822pb02', 3821pb02]}
So far, I have been able to strip out the part numbers, or the Match #'s, but not correlate the part #'s and the Match #'s.
Could someone help me approach this? - it runs a little away from my current knowledge.
Here's the full HTML file - http://pastebin.com/raw.php?i=eWWh4XfM - HTML doesn't have the best formatting
Using BeautifulSoup:
import re
from bs4 import BeautifulSoup
matches = {}
_catalog_link = re.compile(r'^http://www\.bricklink\.com/catalogItem\.asp\?P=')
soup = BeautifulSoup(htmlpage)
for match in soup.find_all(text=re.compile(r'Match #\d+')):
match_number = match.string.split('#', 1)[-1]
matches[match_number] = matched_links = []
# Find the parent table row
row = next(p for p in match.parents if p.name == 'tr')
# next rows hold the links
for sibling in row.next_siblings:
if sibling.name != 'tr':
continue
links = sibling.find_all('a', href=_catalog_link)
if not links:
break
matched_links.extend(l.string for l in links)
This produces:
{u'139': [u'3822pb01', u'3821pb01'],
u'140': [u'3822pb02', u'3821pb02'],
u'141': [u'3822pb06', u'3821pb06'],
u'142': [u'3822p03', u'3821p03'],
u'143': [u'3822p24', u'3821p24'],
u'144': [u'3822pb05', u'3821pb05'],
u'145': [u'3822pb04', u'3821pb04'],
u'146': [u'3822px1', u'3821px1'],
u'147': [u'3822', u'3821'],
u'148': [u'3189', u'3188'],
u'149': [u'801a', u'802a'],
u'150': [u'801', u'802'],
u'151': [u'445', u'446'],
u'152': [u'825', u'826'],
u'153': [u'825p01', u'826p01'],
u'154': [u'825p02', u'826p02'],
u'155': [u'3195', u'3194'],
u'156': [u'30231pb02', u'30231pb01'],
u'158': [u'30230px1', u'30230px2'],
u'159': [u'3936', u'3935'],
u'160': [u'30355', u'30356'],
u'161': [u'3586', u'3585'],
u'162': [u'3933', u'3934'],
u'164': [u'981', u'982'],
u'165': [u'43369', u'43368'],
u'166': [u'972', u'971'],
u'167': [u'972pa2', u'971pa2'],
u'168': [u'972p4f', u'971p4f'],
u'169': [u'972p63', u'971p63'],
u'170': [u'30073', u'30074'],
u'171': [u'6128', u'6127'],
u'172': [u'4466', u'4467'],
u'173': [u'fabah1', u'fabah2'],
u'174': [u'x46', u'x48'],
u'175': [u'4181', u'4182'],
u'176': [u'4181p05', u'4182p05'],
u'177': [u'4181pb01', u'4182pb01'],
u'178': [u'4181p02', u'4182p02'],
u'179': [u'4181p06', u'4182p06'],
u'180': [u'4181p04', u'4182p04'],
u'181': [u'4181px1', u'4182px1'],
u'182': [u'4181p03', u'4182p03'],
u'183': [u'4181p01', u'4182p01'],
u'184': [u'4181p07', u'4182p07'],
u'185': [u'3195px1', u'3194px1'],
u'186': [u'32190', u'32191'],
u'187': [u'32188', u'32189'],
u'188': [u'32527', u'32528'],
u'189': [u'32534', u'32535'],
u'190': [u'44350', u'44351'],
u'191': [u'44352', u'44353'],
u'192': [u'47712', u'47713'],
u'193': [u'42061', u'42060'],
u'194': [u'43710', u'43711'],
u'195': [u'41765', u'41764'],
u'196': [u'41748', u'41747'],
u'197': [u'41750', u'41749'],
u'198': [u'6565', u'6564'],
u'199': [u'41770', u'41769'],
u'200': [u'43723', u'43722'],
u'201': [u'43721', u'43720'],
u'202': [u'41768', u'41767'],
u'203': [u'3069bps5', u'3069bps4'],
u'204': [u'42061pb03', u'42060pb03'],
u'205': [u'42061pb05', u'42060pb05'],
u'206': [u'3005pb001', u'3005pb002'],
u'207': [u'48288pb02', u'48288pb01'],
u'208': [u'2582pb03', u'2582pb04'],
u'209': [u'712', u'713'],
u'211': [u'3039px17', u'3039px18'],
u'212': [u'3037px5', u'3037px6'],
u'213': [u'3037px3', u'3037px4'],
u'214': [u'30249pb02', u'30249pb01'],
u'215': [u'42022pb09', u'42022pb08'],
u'216': [u'42022pb05', u'42022pb06'],
u'217': [u'30647pb05', u'30647pb04'],
u'218': [u'30647pb01', u'30647pb02'],
u'219': [u'30647pb07', u'30647pb06'],
u'220': [u'30647px1', u'30647px2'],
u'221': [u'2744pb02', u'2744pb01'],
u'222': [u'42061px5', u'42060px5'],
u'223': [u'42061pb01', u'42060pb01'],
u'224': [u'42061px1', u'42060px1'],
u'225': [u'41748pb05', u'41747pb05'],
u'226': [u'41748pb16', u'41747pb16'],
u'227': [u'41748pb12', u'41747pb12'],
u'228': [u'41748pb15', u'41747pb15'],
u'229': [u'41748pb07', u'41747pb07'],
u'230': [u'41748px1', u'41747px1'],
u'231': [u'41748pb06', u'41747pb06'],
u'232': [u'41748pb14', u'41747pb14'],
u'233': [u'41748pb02', u'41747pb02'],
u'234': [u'41748pb04', u'41747pb04'],
u'235': [u'41748pb09', u'41747pb09'],
u'236': [u'41748pb08', u'41747pb08'],
u'237': [u'41748pb11', u'41747pb11'],
u'238': [u'41748pb03', u'41747pb03'],
u'239': [u'41748pb13', u'41747pb13'],
u'240': [u'41748pb10', u'41747pb10'],
u'241': [u'41750px2', u'41749px2'],
u'242': [u'41750pb01', u'41749pb01'],
u'243': [u'6565pb01', u'6564pb01'],
u'244': [u'4864bp10', u'4864bp11'],
u'245': [u'4864pb006L', u'4864pb006R'],
u'246': [u'2362pb04', u'2362pb05'],
u'247': [u'4215ap06', u'4215ap04'],
u'248': [u'4215ap24', u'4215ap25'],
u'249': [u'4215pb021', u'4215pb022'],
u'250': [u'4215ap07', u'4215ap05'],
u'251': [u'30117pb02L', u'30117pb02R'],
u'252': [u'30117pb03L', u'30117pb03R'],
u'253': [u'30117pb04L', u'30117pb04R'],
u'254': [u'30117pb01', u'30117pb05'],
u'255': [u'30116pb01', u'30116pb02'],
u'256': [u'2468pb02', u'2468pb03'],
u'257': [u'3245apx2', u'3245apx1'],
u'258': [u'4070pb02', u'4070pb01'],
u'259': [u'41855pb09', u'41855pb10'],
u'401': [u'47847pb001L', u'47847pb001R'],
u'418': [u'4460pb01', u'4460pb02'],
u'419': [u'3010pb027', u'3010pb026'],
u'420': [u'3010pb025', u'3010pb024'],
u'421': [u'2341pb02', u'2341pb01'],
u'439': [u'4286pb03', u'4286pb02'],
u'440': [u'41748pb17', u'41747pb17'],
u'472': [u'43710pb01', u'43711pb01'],
u'473': [u'30363pb08', u'30363pb09'],
u'474': [u'50305', u'50304'],
u'475': [u'50955', u'50956'],
u'512': [u'4286pb04', u'4286pb01'],
u'546': [u'47397', u'47398'],
u'572': [u'3193', u'3192'],
u'598': [u'3933a', u'3934a'],
u'606': [u'3822pb07', u'3821pb07'],
u'620': [u'3939px1', u'3939px2'],
u'621': [u'2431px18', u'2431px19'],
u'622': [u'3069bpx57', u'3069bpx56'],
u'643': [u'4215pb015', u'4215pb016'],
u'678': [u'54384', u'54383'],
u'680': [u'42061pb06', u'42060pb06'],
u'681': [u'42061pb02', u'42060pb02'],
u'682': [u'41748pb18', u'41747pb18'],
u'683': [u'41768pb01', u'41767pb01'],
u'684': [u'42061pb07', u'42060pb07'],
u'685': [u'48933pb02', u'48933pb03'],
u'686': [u'3622pb011', u'3622pb012'],
u'687': [u'3010pb055L', u'3010pb055R'],
u'688': [u'3008pb038', u'3008pb039'],
u'689': [u'3822pb08', u'3821pb08'],
u'690': [u'3822pb09', u'3821pb09'],
u'691': [u'3822pb10', u'3821pb10'],
u'692': [u'3189pb01', u'3188pb01'],
u'693': [u'3193pb01', u'3192pb01'],
u'694': [u'3193pb02', u'3192pb02'],
u'695': [u'3195pb01', u'3194pb01'],
u'696': [u'4864apx10', u'4864apx11'],
u'697': [u'4215pb029', u'4215pb030'],
u'700': [u'2362pb10', u'2362pb11'],
u'701': [u'4286pb06', u'4286pb05'],
u'702': [u'3678apb05', u'3678apb06'],
u'703': [u'3678apb07', u'3678apb08'],
u'704': [u'4460pb04', u'4460pb03'],
u'705': [u'2340pb17L', u'2340pb17R'],
u'706': [u'2340pb21L', u'2340pb21R'],
u'707': [u'2340pb03', u'2340pb02'],
u'708': [u'2340pb11', u'2340pb10'],
u'709': [u'2340pb04', u'2340pb05'],
u'710': [u'2340pb16', u'2340pb15'],
u'711': [u'2340pb07', u'2340pb06'],
u'712': [u'2340pb09', u'2340pb08'],
u'714': [u'2431pb039', u'2431pb040'],
u'727': [u'2431pb025', u'2431pb026'],
u'728': [u'791pb01L', u'791pb01R'],
u'766': [u'3004pb031L', u'3004pb031R'],
u'768': [u'3010pb057L', u'3010pb057R'],
u'769': [u'3009pb071L', u'3009pb071R'],
u'770': [u'3009pb072L', u'3009pb072R'],
u'771': [u'2873pb08L', u'2873pb08R'],
u'772': [u'4286pb07L', u'4286pb07R'],
u'773': [u'4286pb08L', u'4286pb08R'],
u'774': [u'2340pb25L', u'2340pb25R'],
u'775': [u'2340pb23L', u'2340pb23R'],
u'776': [u'3004pb021L', u'3004pb021R'],
u'777': [u'3004pb017L', u'3004pb017R']}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.