简体   繁体   中英

Xlsxwriter doesn't create file upon some inputs from a dataframe, only spits gibberish into terminal

Intro

I have a rather interesting issue, when creating files with xlsxwriter, in 95% of the cases (before and even after - so no crash) the files are created correctly, but when trying to write data of a specific BSI Baustein (German IT Security baseline requirement) the saving the file fails without an error message. I just get some gibberish in the terminal as seen below. Anyone experienced something similar already? I am happy about pointers as I don't know what else to look for.

What I know for sure

  1. There is no malformation in the data, which causes the strange output. Checked the dataframe during debugging, just before it is written.
  2. It only happens with Bausteine under the CON section, not before not after.
  3. Based on the visible logs, this happens when the file has finished writing and is closing. So when it gets created on the file system.
with pd.ExcelWriter(
                Path("output", "empty_templates", f"{module_title}.xlsx"),
                engine="xlsxwriter",
            ) as writer:
                log.info(f"Writing {module_title} data...")
                create_front_page(writer, "Deckblatt")
                write_file(writer, "Anforderungen", df_requirements)
                log.info("We returned, after writing data, closing and saving file...")
            log.info("File saved!")

What I see on the console

 INFO    : [comp_converter] : get_gs_bausteine : ['CON.2.A1 Umsetzung Standard-Datenschutzmodell (B)']
 INFO    : [comp_converter] : get_gs_bausteine : Writing CON.2 Datenschutz data...
 INFO    : [comp_converter] : write_file   : Nr. of entries: 3
 INFO    : [comp_converter] : write_file   : Data written into excel.
 INFO    : [comp_converter] : write_file   : Done writing the file
 INFO    : [comp_converter] : get_gs_bausteine : We returned, after writing data, closing and saving file...
�6�§♂����V�♠l�^xk�5��U�gհJ���)D��/���↕3T���▲�[a◄"x�T!5��5MeTz�� �{��'�S�!�����Q�����$   ∟��~-lY�P1:�§q].��A�7��;;
h♥↓0ydKHda�y/[�♦��o���>��t↕i�O▲§c☻e�♠��§k�♥d��  ��������#���s☼ۮE��?�߉Qv��S⌂☼��?t�J`^)�=p�����!��d�      P�J�!Q☺�♦PK♥?♂_rels/.rels���J♥1►���►��Ͷ��4�K◄z‼�☼0&�?�&‼�Q�o¶�3�⌂X܁ʂ��ā♀∟(æ��Z?фR�r?Ĭ`�
%d♥�H��:۞<�#���r�(�:↔юؑ^���N?↓�∟1��↓H;�♦�?D���m;Xڲ}�¶�D�_↓���#10O�����<V♣
����r���Ԟ♦↔
4▼PP|N��L☻PK♥?→xl/_rels/workbook.xml.rels���j�0►D����{-;-��ȹ�B�m�☺BZ[&�$�ۤ�����♫�ЃObF�̃���{��▲‼u�+��↕♦z‼l�[♣�۷�↨►��[��
F$X��w�w�5�!r]$�S<)p��UJ2♫ME���O‼Ҡ9��ʨ�N�(↨e�,�4♥�L��
��V �c��d���♀���→��
iG♫�s�N-���E��TEN♣y↔f1'♀�Y�♥9ʓy��qN♠��C����Y�Nh?8�s�RL�_↑yqq�☼P��i��☻PK♥?↑xl/worksheets/sheet1.xml���n�0►��♣�♫☻�V��↔��ݤi���rf��E�↕♣���>}IJ���I|◄�|��sfHj~��Rk☼\►�-�;p��!��<��d�h�_�E�H&→]gꏇ�ڑ��♀wBy�S"�P!�↕^%q��_*����\(◄�↕A%◄t]x�J♀K�a%�w]�.1*%Fu(.�↑�↕�z!��bRJL�z↨���↕�Z�҅�Ω����ޙ‼��RS�+,�r����f�Bm��♂{b�YOީ!G)�8♥���S"
u��3��>�G���)▼U�zw���:%�KL��M�p�&��A�&��A�&��A�&�      ��m☼2l"w=Ȩ�|�A�M���k��T�����↓☺�L�§כ.�
� �3�§�u�↑����]�o��↕�̮���VJ�]�?.�~���.�D��VA�C�*�]���>ۚx'نP     ��K�.��▬�c�%������=9P�→/�#�V��i���N�D�♣�=§��1y��ӡ��X�♥P���ɕ☻PK♥?☼xl/workbook.xml��_O�0¶��'�↔�U_��QU��
���►CZ�Gd��±#�Yʷ�:%↔}�↔>���!�9∟�P��2-%m�H♠����H�"ts�>��{����T▲�§���;♀��\�↔�¶��B`(☻�‼‼��$YN~O�U�^�(��p%��p:�'♂`U��%���[Ç ��Xl�♫P�ާ'�☺�☻PK♥?¶xl/sharedStrings.xml}S�n�►
�W�;�|J$�@���B���#♣��↨v�Wxw�θ4<K�^�↑��b↔��JM��;���͌;W�m♠�0��→Э�6.�Fw���w◄►+�U�↔v�����U��Aj↔u��y�!�i��U��[t↕Y�`§�↓���☺��¶�m▬��ͷ�U�E���n�~‼A���∟�������!��po��P'�^'.∟G�
�@,+eC♠♥���a)�p�k♀���▼@��b9�D�T�♥�L09Υ♠��x��)8�0M���xl��&�z.1⌂2�%���♠♥��\
}♫*a♀�4���☼♠Ov��_u�ݺ☺�p☻�0�Y�y�A9��PL����↕�c@�|&���
xl/styles.xml�Tۊ�0♀}/�▼�߻�t;�%�R▬♠�▼P1♦�?♥☻♦♦PK♥?¶
m)�,�Չ����`k�I��r�d2�lK/y�%Y:G�d↨w�V�,��֔4�I)◄��\���������x`�3e�(�(<��^�(<�J<∟��►ƗпO↕�∟�f�����Nk�f���↕�;��☼AZ%y��↕ͤ�U�Z♥�4�d���l�
S�`��♣�m�R☼a���W�CK�I�5|��p��I/"&4�◄&*☺⌂�▬�7��⌂♦K�v���h��^�▼��▬K�lQí♣ل↓kP§qȆ�⌂R~w�?�☺���,[va˟e��H4k↑�����t��߫��kY�/�Q�$§H�D�◄�☼�6O��j|ѮX►����¶∟�͒^�ς˓~�z}�g♂��E�¶�<�↓
\���'P#�♣�☻☻|♣PK♥?‼xl/theme/theme1.xml�YM��D↑�#�▼F���‼;ͮ��6٤�v��nZ��ę�ӌ=��d�������(�♂↕7��J\ʯY(�"�/��#�x3�f�E¶�9$���~⌂�↔���◄C�DH���\�Y��>▼�8h[��♂-♂I��◄f<&mkF�ue��☼.�§
�� ���&n[�RɦmK▼����‼↕ý1↨◄V�¶�=↕�F̮�jM;�4�P�#`{k<�>A����5g�c�§+�n�L∟��D�"Î&N�#g��♦:Ĭm��◄?→���B♀K♣7�V-�X��e{A�T♣�F��>♣]A0��3:◄♀↨�N�ݸ���_����z�^��,�e��`���u�-�3穁��U�ݚW
ǁN���/���S�ד�^<�ʌ�:���>���/�@�♥�⌂������⌂���?<2��♣▲��☺��D7�◄��◄�f►@��l¶�►�↕♣♫☺i�TX☻ޜaf�uH�yw♦4‼���^I׃PL§5��Q˗+[�,��9�pa4�z*K7g→f�b���1>4�mo�@&S‼�nHJj�1�6♫HL¶J�� !♠������K}�%▼+t��♫�F�♀�P����23�.�f�♫�pfb�C♫�H�L,     +��*�*∟↓5�◄ӑ7�
MJ▲̄_r�T►�0�z#"��斘�Խ��‼↓þ�fQ↓)¶���70�:r�O�!�↕��4♫u�Gr☻)��▲WF%x�B�5�☺Ǖ�C�:[YߦAhN���T¶]��⌂#→�֌↓�n��↓����h2���▬\��▼6�↔<�������ヒ}������♠k�sq�/�∟�ǔ�♥5c��Z�♦�G}��▬↓�b&OB�
,ĕp���5↕\}BUx►�♦�8��@▬�♥�↕.�$`U�Ύ�¶�����↓►�X��Q���φ♂6�*���F�`]a�Ko&�ɁkJs<�4�Ti��M�♠�ӓ�Ӭ�!c0#���9�yX�=D2�#R��1→�4�t[��^Ӥm4�L�:A�Ź§�s�Rm%J�j9���BG��W�,��m�a���(☺~2m@�♣q��Ua�+�����tj�♠�D$B�↔,Ü*�5⌂u↕/��{n��1�Ѝ�Ӣ�r�C-쓡%�1�U��rY��SE�A8:BC6§�↑�v��→Q  ό�|!�B�"�ʕ_T��W4Eu`����I--�9<�^萭4��
�_Ӕ�9�⽻���♂ckc�↔�`♀►↑�9ڶ�P!�.����♂↑∟2Y�↨��HUB,}��J♫�}+�7� T�4@�B�S� dO§v���Sן�sFE�Y�+��wH♫     ��L�P8�&�#2�ɠ٦�→♠��x�q+&��ǃ� �,���5}�Q��f*��Q[7[\��~�&p�@�↨4n*|��o|▼�����9�wϽ���T������)�⌂w�Z��U◄��∟>5g7*�}���w�g�w����↕���L�Z��☼��↔8(M���ۤ�p����2>��t�▼P�%��♣SK♥?◄docProps/core.xml��_K�0¶��♣�C�{��Sq��@eO♫♦'‼�Br�♣�?$Ѯ�޴ۺ
�,G     hn�ԛ
�r�T2t▬�Z����y9→۶���`��g�\��☼��R����Rp�↔�`\]��"▲�a>,��↕�S↨�+o�E�} �↑���▲����e9Gu�↨�4'iA�dJ��h>��G^���0���#`���‼�⌂P�A��'☺Q☻PK♥?►docProps/app.xml�R�N�0►���?D�S�§B�r�P♂��R♂{�:�ƪc[�i����jHaO�ӛ7O�/3#�☼��Z�h�+�t��♀���qۂ�o���X��\��wP�# ��W��*��♀`�,∟▬�&
@(�@�p�↔cs#�� �K!▼�$|↓qc�☻�U+§�▼����}♠6�����ˇ▼�cH~R<�`�V��R�→↔=���ǃ♠+��)��→�>→:�\�q)�ZYX$cY)� �↨!�AuC[)‼Q���-h�1C��6c�▼���)X��Q��Iv*zl♥R��}�a
J�{�♣�#���oO,|‼�Ks�♥z1n��a㗊�<�KR�k§�L�↑�>►�9ŋ��/j���g��Fw♥▼�C���$O_��3'��I˿P�=��⌂☺↑♥PK☺☻?�J�!Q☺�♦‼��[Content_Types].xmlPK☺☻?P|N��L☻♂���☺_rels/.relsPK☺☻?��i��☻→���☻x���rels/workbook.xml.relsPK☺☻?���ɕ☻↑���♥xl/worksheets/sheet1.xmlPK☺☻?�ާ'�☺�☻☼���♠xl/workbook.xmlPK☺☻?1♦�?♥☻♦♦¶��xl/sharedStrings.xmlPK☺☻?#�♣�☻☻|♣
xl/theme/theme1.xmlPK☺☻?�A��'☺Q☻◄��♦‼docProps/core.xmlPK☺☻?�=��⌂☺↑♥►��j¶docProps/app.xmlPK♣♠

�☻'▬ 

What you will need for a working test

  1. BSI Kompendium XML edition - save the content as "Kompendium2022.xml" into the path were the comp_converter.py is >I know it's an awful format, but that is unfortunately given...

  2. I use VSCode with Python 3.9.4 - additional packages above the imports: colorlog, xlsxwriter (what I remember)

  3. YAML config

    <<comp_converter.yaml>> SKIP_NR_COMPENDIUM_CHAPTERS: 7 # We have that number of chapters without Bausteine INPUT_BSI_COMPENDIUM: "Kompendium2022.xml" INPUT_PREV_MAPPING_COLUMNS: - "Kategorie" - "CIA" - "Anf. Nr." - "Titel" - "Verantwortung" - "Umsetzungsbeschreibung" - "Referenziertes Dokument" - "Status" - "Risiko" - "Risikobeschreibung" - "Gefährdungszuordnung" ABBREV_REPLACEMENTS: "usw.": "usw|" "zB": "zB|" "z. B.": "z| B|" "bzw.": "bzw|" "ggf.": "ggf|" "idR.": "idR|" "idR.": "idR|" "engl.": "engl|" "inkl.": "inkl|" "Absch.": "Absch|" "o.Ä.": "o.Ä|" "dh": "dh|" "etc.": "etc|" "bspw.": "bspw|" "va": "va|" "vA": "vA|"
  4. Logger config

<<logging.conf>>
[loggers]
keys=root

[handlers]
keys=consoleHandler,fileHandler

[formatters]
keys=color_console,file

[logger_root]
level=DEBUG
handlers=consoleHandler,fileHandler

[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=color_console
args=(sys.stdout,)

[handler_fileHandler]
class=FileHandler
level=DEBUG
formatter=file
kwargs={"filename": "bsi_compendium_debug.log" ,"mode": "w"}

[formatter_file]
format=%(asctime)s %(levelname)-7s : [%(module)-12s] : %(funcName)-12s : %(message)s
datefmt=%Y-%m-%d %H:%M:%S

[formatter_color_console]
class=colorlog.ColoredFormatter
format=%(log_color)s %(levelname)-7s : [%(module)-12s] : %(funcName)-12s : %(message)s
  1. comp_converter.py with main
import logging
from logging import config
import yaml
import pandas as pd
from pathlib import Path
import re
from bs4 import BeautifulSoup


config.fileConfig(fname="logging.conf")
log = logging.getLogger(__name__)

with open("comp_converter.yaml", "r") as f:
    comp_conv_config = yaml.load(f, Loader=yaml.FullLoader)


def main():
    with open(comp_conv_config["INPUT_BSI_COMPENDIUM"], "r", encoding="utf-8") as f:
        compendium = f.read()

    compendium_tree = BeautifulSoup(compendium, "xml")
    get_gs_bausteine(compendium_tree)


def get_gs_bausteine(compendium_soup: BeautifulSoup) -> None:

    # Each BSI Baustein is grouped into a chapter
    bausteine = compendium_soup.find_all("chapter")
    # We get rid of unneeded chapters at the beginning, so we start only with the first Baustein
    for _ in range(comp_conv_config["SKIP_NR_COMPENDIUM_CHAPTERS"]):
        bausteine.pop(0)
    log.info(
        f"We found {len(bausteine)} Bausteine to work with: {[baustein.title.text for baustein in bausteine]} "
    )

    # We loop through each Baustein to get the necessary data
    for baustein in bausteine:
        module_titles = []
        modules = []
        log.info(f'Starting to process "{baustein.title.text}" Baustein')

        # first_module_title = baustein.find_next("section").title.text
        first_module = baustein.find_next("section")
        first_module_title = first_module.title.text
        module_titles.append(first_module_title)
        modules.append(first_module)
        # Append any potential additional module titles to the list of module titles
        # and modules to the list of modules
        [
            (module_titles.append(a.title.text), modules.append(a))
            for a in baustein.section.find_next_siblings("section")
        ]

        log.info(
            f"We have {len(module_titles)} modules in the Baustein: {module_titles}"
        )
        for (module, module_title) in zip(modules, module_titles):
            requirements = []
            module_title_prefix = module_title.split(" ", 1)[0]

            req_title_matcher = rf"^{module_title_prefix}\.(\d+\.)*A\d+\s{{1}}"
            log.debug(req_title_matcher)
            requirements = module.find_all("title", text=re.compile(req_title_matcher))

            log.info(f"We have {len(requirements)} requirements in the module.")
            log.info([req.text for req in requirements])

            mod_requirements = get_requirements(requirements)
            df_requirements = pd.DataFrame.from_dict(
                mod_requirements,
                orient="index",
                columns=["Titel", "Verantwortung", "Kategorie"],
            )

            # Here we get the Anf. Nr. data from the index into a column
            df_requirements["Anf. Nr."] = df_requirements.index.values

            all_headers = comp_conv_config["INPUT_PREV_MAPPING_COLUMNS"]
            missing_headers = set(all_headers) - set(df_requirements.columns)

            for header in missing_headers:
                df_requirements[header] = ""

            # Final data format
            df_requirements = df_requirements[all_headers]

            # We drop the rows, where the subcriteria only contains text, that the main criteria is not there anymore,
            # overall we will compare against the ID of the main criteria.
            df_requirements.drop(
                df_requirements[
                    df_requirements["Titel"] == "Diese Anforderung ist entfallen."
                ].index,
                inplace=True,
            )
            # Create comparison with prev. edition file
            # TODO

            # Create empty template
            with pd.ExcelWriter(
                Path("output", "empty_templates", f"{module_title}.xlsx"),
                engine="xlsxwriter",
            ) as writer:
                log.info(f"Writing {module_title} data...")
                create_front_page(writer, "Deckblatt")
                write_file(writer, "Anforderungen", df_requirements)
                log.info("We returned, after writing data, closing and saving file...")
            log.info("File saved!")



def get_requirements(requirements: BeautifulSoup) -> dict[str, list[str]]:
    req_rows = {}
    for requirement in requirements:
        req_info = []
        # Here we want to prepare the for the final dataframe format,
        # where each entry will be a row in the dataframe
        # Index will be the Anf.Nr. e.g. ISMS.1.A01 or ISMS.1.A01-1
        req_index, req_title = requirement.text.split(" ", 1)
        log.debug(f"Preparing data for {requirement.text}")
        req_info.append(req_title)
        # Here we make the single digit IDs two digit IDs
        if "." in req_index[-3]:
            req_index = f"{req_index[:-1]}0{req_index[-1:]}"
        # print(req_index)
        req_owner = get_req_owner(requirement)
        req_info.append(req_owner)

        req_category = get_req_category(requirement)
        req_info.append(req_category)

        # Here we are done with the high level Baustein
        req_rows[req_index] = req_info

        # Here we add the low level Baustein from the sentences
        req_sentences = get_req_paragraph_text(requirement)
        for ctr, sub_req in enumerate(req_sentences):
            sub_req_info = []
            sub_req_info.append(sub_req)
            sub_req_info.append(req_owner)
            sub_req_info.append(req_category)
            req_rows[f"{req_index}-{ctr+1}"] = sub_req_info
    # print(req_rows)
    return req_rows


def get_req_paragraph_text(requirement: BeautifulSoup) -> list[str]:

    # Here we get rid of the module title first, as it is included in the text
    sentences = re.split(r"\n+", requirement.parent.text.strip(), 1)[1]
    # Here we take care of the abbreviations, so they don't cause trouble when we look for sentences
    mapping = comp_conv_config["ABBREV_REPLACEMENTS"]
    for key, value in mapping.items():
        sentences = sentences.replace(key, value)

    # Here we try to seperate sentences from eachother
    sentences = re.split(r"(?<=[^A-Z].[.?!])([\n\t]* |[\n\t]*)+(?=[A-Z])", sentences)
    # From the regex we get empty resulsts, so we delete them
    while "" in sentences:
        sentences.remove("")

    # Here we get rid of unnecessary new lines to shorten the text
    sentences = [" ".join(sentence.split()) for sentence in sentences]

    # Here we reverse the dots for the abbreviations
    for key, value in mapping.items():
        sentences = [sentence.replace(value, key) for sentence in sentences]
    # print(requirement.parent.text)
    # print(sentences)
    return sentences


def get_req_category(requirement: BeautifulSoup) -> str:
    try:
        # We search for the parent node of the parent node of the requirement title node for the category
        req_category = requirement.parent.parent.find_next("title").text.split("-")[0]
    except AttributeError:
        log.error("We did not find the correct category, setting it to [DEFAULT]...")
        req_category = "[DEFAULT]"
    if " " in req_category:
        # Here we alter the category for elevated category as there is nothing to split there
        req_category = "erhöht"
    # print(req_category)
    return req_category


def get_req_owner(requirement: BeautifulSoup) -> str:
    # print(f"requirement parent: {requirement.parent.parent}")
    try:
        req_owner = (
            requirement.find_previous("informaltable")
            .tbody.find("para", text="Grundsätzlich zuständig")
            .find_next("para")
            .text
            # .parent.find_next_sibling()
            # .para.text
        )
    except AttributeError:
        log.debug(
            "We have the text in an emphasis in the XML-Format, so we search for the emphasis text instead of the paragraph."
        )
        try:
            req_owner = (
                requirement.find_previous("informaltable")
                .tbody.find("emphasis", text="Grundsätzlich zuständig")
                .find_next("emphasis")
                .text
                # .parent.parent.find_next_sibling()
                # .para.emphasis.text
            )
        except AttributeError:
            log.error(
                "We have something strange ongoing and can not find the responsible person, so we use [DEFAULT], set it manually later!"
            )
            req_owner = "[DEFAULT]"
    log.debug(f"owner: {req_owner} for {requirement.text}")
    return req_owner


def create_front_page(writer: pd.ExcelWriter, sheetname: str) -> None:
    # wb = writer.book
    # ws = writer.add_worksheet(sheetname)

    pass


def write_file(writer: pd.ExcelWriter, sheetname: str, data: pd.DataFrame) -> None:
    log.info(f"Nr. of entries: {data['Anf. Nr.'].size}")
    data.to_excel(writer, sheetname, startrow=1, header=False, index=False)
    log.info("Data written into excel.")
    wb = writer.book
    ws = writer.sheets[sheetname]

    multiline_style = wb.add_format({"text_wrap": True})
    singleline_style = wb.add_format({"valign": "vcenter"})

    ws.set_column("A:A", 15, singleline_style)  # Kategorie
    ws.set_column("B:B", 5, singleline_style)  # CIA
    ws.set_column("C:C", 20, singleline_style)  # Anf. Nr.
    ws.set_column("D:D", 45, multiline_style)  # Titel
    ws.set_column("E:E", 30, multiline_style)  # Verantwortung
    ws.set_column("F:F", 50, multiline_style)  # Umsetzungsbeschreibung
    ws.set_column("G:G", 40, singleline_style)  # Referenziertes Dokument
    ws.set_column("H:H", 12, multiline_style)  # Status
    ws.set_column("I:I", 10, multiline_style)  # Risiko
    ws.set_column(
        "J:K", 25, multiline_style
    )  # Risikobeschreibung | Gefährdungszuordnung
    header_format = wb.add_format(
        {"bg_color": "#002060", "bold": True, "font_color": "white"}
    )
    for col_num, value in enumerate(data.columns.values):
        ws.write(0, col_num, value, header_format)

    ws.autofilter("A1:K1")
    ws.freeze_panes(1, 0)
    log.info("Done writing the file")


def create_threats_page(writer, sheetname: str) -> None:
    pass


def read_prev_version(filename: str) -> pd.DataFrame:
    """
    Reads in the previous version of the Grundschutz Baustein to check for any changes in the current version
    :param filename: Excelfile containing the Grundschutz Baustein, what we have to read in
    :return: Returns a dataframe with the relevant info from the file
    """
    pass


if __name__ == "__main__":
    main()

CON seems to be a reserved name on Windows (see Microsoft Forum entry ), that's why the OS won't allow any filenames starting with CON. Instead in this case it writes the content of the file to the debug console.

@jmcnamara that explains why you had no issues on Mac. Again thank you very much for your time and efforts for checking for any bugs and also HUGE thank you for the great package you are maintaining!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM