简体   繁体   中英

How to properly print the whole result of a for loop (Python)?

I have a function that, from an input XML, gets the child tags of a certain tag and puts their index and tags as key, values in a dictionary. After having filtered this dictionary, I take the values and get their text, then I delete some elements in the text. The problem is that the text is not returned as one line, put "piece by piece". I use the "end" workaround to have it in one line, but it is still a problem because it is still not the whole text. Also, the print must be inside the loop if I want to print something.

This is the code:

def get_xml_by_tag_names(xml_path, tag_name_1, tag_name_2):

    data = {}
    xml_tree = minidom.parse(xml_path)
    item_group_nodes = xml_tree.getElementsByTagName(tag_name_1)
    for idx, item_group_node in enumerate(item_group_nodes):
        cl_compile_nodes = item_group_node.getElementsByTagName(tag_name_2)
        for _ in cl_compile_nodes:
            data[idx]=[item_group_node.toxml()]
    return data

def main():
    lista_prima = []
    uncinata1 = " < "
    uncinata2 = " >"
    punto = "."
    virgola = ","
    puntoevirgola = ";"
    dash = "-"
    puntoesclamativo = "!"
    duepunti = ":"
    apostrofo = "’"
    puntointerrogativo = "?"
    angolate = "<>"
    data = get_xml_by_tag_names('output2.xml', 'new_line', 'text')
    deletekeys = []
    for k in data:
        for v in data[k]:
            if "10.238" in v:
                # return k
                deletekeys.append(k)
    for item in deletekeys:
        del data[item]



    for value in data.values():
        myxml = ' '.join(value)
        # print(myxml)

        tree = ET.fromstring(myxml)
        lista = ([text.text for text in tree.findall('text')])
        testo = (' '.join(lista))
        testo = testo.replace(uncinata1, "")
        testo = testo.replace(uncinata2, "")
        testo = testo.replace(punto, "")
        testo = testo.replace(virgola, "")
        testo = testo.replace(puntoevirgola, "")
        testo = testo.replace(dash, "")
        testo = testo.replace(puntoesclamativo, "")
        testo = testo.replace(duepunti, "")
        testo = testo.replace(apostrofo, "")
        testo = testo.replace(puntointerrogativo, "")
        testo = testo.replace(angolate, "")


    print(testo)


if __name__ == "__main__":
    main()

My XML is:

<pages>
      <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="191.745,592.218,249.042,603.578">
    <textline>
         <new_line>
                  <text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">della quale non conosce che una parte;] </text>
                  <text font="PYNIYO+ImprintMTnum-Italic" bbox="322.455,540.839,328.251,553.566" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
                  <text font="NUMPTY+ImprintMTnum" bbox="331.206,545.345,334.683,552.834" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
                  <text font="NUMPTY+ImprintMTnum" bbox="177.602,528.028,180.850,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che nonconosce ancora appieno;</text>
                  <text font="NUMPTY+ImprintMTnum" bbox="189.430,532.545,192.908,540.034" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
                  <text font="NUMPTY+ImprintMTnum" bbox="203.879,528.028,208.975,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che</text>
                </new_line>
    </textline>
<textline bbox="68.032,408.428,372.762,421.166">
<new_line>
          <text font="NUMPTY+ImprintMTnum" bbox="307.143,408.428,310.392,420.910" colourspace="DeviceGray" ncolour="0" size="12.482">viso] vi</text>
          <text font="NUMPTY+ImprintMTnum" bbox="310.280,408.808,313.243,419.046" colourspace="DeviceGray" ncolour="0" size="10.238">-</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="320.072,408.439,325.868,421.166" colourspace="DeviceGray" ncolour="0" size="12.727">su</text>
          <text font="NUMPTY+ImprintMTnum" bbox="328.829,408.428,338.452,420.910" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
        </new_line>
</textline>
    </textbox>
    </page>
    </pages>

Basically it returns the text of the XML one at a time like this:

piece of text
piece of text
piece of text

But I need the whole text together because I can't process it further otherwise. If I print outside the loop it prints just one line.

I tried print(testo, end = " ") but even though it prints it in one line, it still can't be processed.

When you exit the for value loop, testo is just the value of testo from the last iteration - you never preserved the previous values of testo in any way. Possible fix:

testo = []
for value in data.values():
    myxml = ' '.join(value)
    tree = ET.fromstring(myxml)
    tmpstring = ' '.join(text.text for text in tree.findall('text')))
    for to_remove in (" < ", " >", ".", ",", ";", "-", "!", ":", "’", "?", "<>"):
        tmpstring = tmpstring.replace(to_remove, "")
    testo.append(tmpstring)

testo = ''.join(testo)
print(testo)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM