简体   繁体   English

有没有更多的Pythonic方式将两个HTML标头行与colspans合并?

[英]Is there a more Pythonic way to merge two HTML header rows with colspans?

I am using BeautifulSoup in Python to parse some HTML. 我在Python中使用BeautifulSoup来解析一些HTML。 One of the problems I am dealing with is that I have situations where the colspans are different across header rows. 我要处理的问题之一是,在某些情况下,标题行的colspan有所不同。 (Header rows are the rows that need to be combined to get the column headings in my jargon) That is one column may span a number of columns above or below it and the words need to be appended or prepended based on the spanning. (标题行是需要组合以获取行话中的列标题的行。)也就是说,一列可能跨越其上方或下方的许多列,并且需要根据扩展范围添加或添加单词。 Below is a routine to do this. 下面是执行此操作的例程。 I use BeautifulSoup to pull the colspans and to pull the contents of each cell in each row. 我使用BeautifulSoup拉动colspans,并拉动每一行中每个单元格的内容。 longHeader is the contents of the header row with the most items, spanLong is a list with the colspans of each item in the row. longHeader是具有最多项目的标题行的内容,spanLong是具有该行中每个项目的列数的列表。 This works but it is not looking very Pythonic. 这行得通,但是看起来不太像Python。

Alos-it is not going to work if the diff is <0, I can fix that with the same approach I used to get this to work. 如果diff <0,Alos将无法正常工作,我可以使用与以前相同的方法来解决此问题。 But before I do I wonder if anyone can quickly look at this and suggest a more Pythonic approach. 但是在我开始之前,我想知道是否有人可以快速看一下并提出更Python化的方法。 I am a long time SAS programmer and so I struggle to break the mold-well I will write code as if I am writing a SAS macro. 我是SAS程序员的老手,因此我很难打破常规,就像编写SAS宏一样,我将编写代码。

longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0

for each in range(len(shortHeader)):
    sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
    sumSpanShort=sumSpanShort+spanShort[each]
    spanDiff=sumSpanShort-sumSpanLong
    if spanDiff==0:
        combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
        longHeaderCount=longHeaderCount+1
        continue
    for i in range(0,spanDiff):
            combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
            longHeaderCount=longHeaderCount+1
            sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
            spanDiff=sumSpanShort-sumSpanLong
            if spanDiff==0:
                combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
                longHeaderCount=longHeaderCount+1
                break

print combinedHeader

Here is a modified version of your algorithm. 这是您算法的修改版本。 zip is used to iterate over short lengths and headers and a class object is used to count and iterate the long items, as well as combine the headers. zip用于迭代长度和标头,而class对象用于计数和迭代项,以及组合标头。 while is more appropriate for the inner loop. 更适合于内循环。 (forgive the too short names). (原谅名字太短)。

class collector(object):
    def __init__(self, header):
        self.longHeader = header
        self.combinedHeader = []
        self.longHeaderCount = 0
    def combine(self, shortValue):
        self.combinedHeader.append(
            [self.longHeader[self.longHeaderCount]+' '+shortValue] )
        self.longHeaderCount += 1
        return self.longHeaderCount

def main():
    longHeader = [ 
       '','','bananas','','','','','','','','','','trains','','planes','','','','']
    shortHeader = [
    '','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
    spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
    spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
    sumSpanLong=0
    sumSpanShort=0

    combiner = collector(longHeader)
    for sLen,sHead in zip(spanShort,shortHeader):
        sumSpanLong += spanLong[combiner.longHeaderCount]
        sumSpanShort += sLen
        while sumSpanShort - sumSpanLong > 0:
            combiner.combine(sHead)
            sumSpanLong += spanLong[combiner.longHeaderCount]
        combiner.combine(sHead)

    return combiner.combinedHeader

You've actually got a lot going on in this example. 在此示例中,您实际上正在进行很多工作。

  1. You've "over-processed" the Beautiful Soup Tag objects to make lists. 您已经“过度处理了” Beautiful Soup Tag对象以创建列表。 Leave them as Tags. 将其保留为标签。

  2. All of these kinds of merge algorithms are hard. 所有这些类型的合并算法都很困难。 It helps to treat the two things being merged symmetrically. 它有助于处理对称合并的两件事。

Here's a version that should work directly with the Beautiful Soup Tag objects. 这是一个应直接与Beautiful Soup Tag对象配合使用的版本。 Also, this version doesn't assume anything about the lengths of the two rows. 另外,此版本不假设有关两行长度的任何信息。

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

Maybe look at the zip function for parts of the problem: 也许看一下zip函数来解决部分问题:

>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]

>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>> 

enumerate can help avoid some of the complex indexing with counters: enumerate可以帮助避免使用计数器进行某些复杂的索引编制:

>>> diff_list = []
>>> for place, header in enumerate(short_header):
    diff_list.append(abs(span_short[place] - span_long[place]))

>>> for place, num in enumerate(diff_list):
    if num:
        new_shortlist.extend(short_header[place] for item in range(num+1))
    else:
        new_shortlist.append(short_header[place])


>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... 
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

Also more pythonic naming may add clarity: 同样,更多的pythonic命名也可以增加清晰度:

    for each in range(len(short_header)):
        sum_span_long += span_long[long_header_count]
        sum_span_short += span_short[each]
        span_diff = sum_span_short - sum_span_long
        if not span_diff:
            combined_header.append...

I guess I am going to answer my own question but I did receive a lot of help. 我想我要回答自己的问题,但确实得到了很多帮助。 Thanks for all of the help. 感谢您的所有帮助。 I made S.LOTT's answer work after a few small corrections. 经过一些小的更正后,我使S.LOTT的答案有效。 (They may be so small as to not be visible (inside joke)). (它们可能很小,以至于不可见(在笑话中))。 So now the question is why is this more Pythonic? 所以现在的问题是,为什么这更像Pythonic? I think I see that it is less denser / works with the raw inputs instead of derivations / I cannot judge if it is easier to read ---> though it is easy to read 我想我看到它不那么密集/使用原始输入而不是派生/我无法判断它是否更容易阅读--->尽管它易于阅读

S.LOTT's Answer Corrected S.LOTT的答案已更正

row1=headerCells[0]
row2=headerCells[1]

i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
    if i1 == len(row1):
        result.append( ' '.join(row1[i1]) )
        i2 += 1
    elif i2 == len(row2):
        result.append( ' '.join(row2[i2]) )
        i1 += 1
    else:
        if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")):
            c1= int(row1[i1].get("colspan","1"))
            while c1 != int(row2[i2].get("colspan","1")): 
                txt1= ' '.join(row1[i1])  # needed to add when working adjust opposing case
                txt2= ' '.join(row2[i2])     # needed to add  when working adjust opposing case
                result.append( txt1 + " " + txt2 )  # needed to add when working adjust opposing case
                print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1
                c1 += 1
                i1 += 1    # Is this the problem it

        elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")):
                # Fill extra cols from row2  Make same adjustment as above
            c2= int(row2[i2].get("colspan","1"))
            while int(row1[i1].get("colspan","1")) != c2:
                result.append( ' '.join(row1[i1]) )
                c2 += 1
                i2 += 1
        else:
            assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1"))
            pass


        txt1= ' '.join(row1[i1])
        txt2= ' '.join(row2[i2])
        result.append( txt1 + " " + txt2 )
        print 'went to bottom', 'i1=',i1,'i2=',i2
        i1 += 1
        i2 += 1
print result

Well I have an answer now. 好吧,我现在有一个答案。 I was thinking through this and decided that I needed to use parts of every answer. 我正在考虑这个问题,因此决定我需要使用每个答案的一部分。 I still need to figure out if I want a class or a function. 我仍然需要弄清楚我想要一个类还是一个函数。 But I have the algorithm that I think is probably more Pythonic than any of the others. 但是我有一种算法,我认为它可能比其他任何算法都更适合Python。 But, it borrows heavily from the answers that some very generous people provided. 但是,它大量借鉴了一些非常慷慨的人提供的答案。 I appreciate those a lot because I have learned quite a bit. 我非常感谢,因为我学到了很多东西。

To save the time of having to make test cases I am going to paste the the complete code I have been banging away with in IDLE and follow that with an HTML sample file. 为了节省编写测试用例的时间,我将在IDLE中粘贴我一直在使用的完整代码,然后再加上HTML示例文件。 Other than making a decision about class/function (and I need to think about how I am using this code in my program) I would be happy to see any improvements that make the code more Pythonic. 除了做出关于类/函数的决定(而且我需要考虑如何在程序中使用此代码)之外,我很乐意看到使代码更具有Pythonic性的任何改进。

from BeautifulSoup import BeautifulSoup

original=file(r"C:\testheaders.htm").read()

soupOriginal=BeautifulSoup(original)
all_Rows=soupOriginal.findAll('tr')


header_Rows=[]
for each in range(len(all_Rows)):
    header_Rows.append(all_Rows[each])


header_Cells=[]
for each in header_Rows:
    header_Cells.append(each.findAll('td'))

temp_Header_Row=[]
header=[]
for row in range(len(header_Cells)):
    for column in range(len(header_Cells[row])):
        x=int(header_Cells[row][column].get("colspan","1"))
        if x==1:
            temp_Header_Row.append( ' '.join(header_Cells[row][column]) )

        else:
            for item in range(x):

                temp_Header_Row.append( ''.join(header_Cells[row][column]) )

    header.append(temp_Header_Row)
temp_Header_Row=[]
combined_Header=zip(*header)

for each in combined_Header:
    print each

Okay test file contents are below Sorry I tried to attach these but couldn't make it happen: 好的,测试文件的内容在下面。对不起,我尝试附加这些文件,但无法实现:

  <TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
  <TR valign="bottom">
  <TD width="40%">&nbsp;</TD>
  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FOODS WE LIKE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD>

  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">OTHER THAN</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MONTY PYTHON</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">CHERRYPY</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">APPLE PIE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MOTHERS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FATHERS</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD nowrap align="left">Name</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">SHOWS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PROGRAMS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">BANANAS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PERFUME</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">TOOLS</TD>
  <TD>&nbsp;</TD>
  </TR>
  </TABLE>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM