如何在 python 中使用 GCP DLP API 對文字文件進行去識別/編輯

Question

我在 python 中使用 GCP 的 DLP API 以以下方式編輯圖像，它工作正常：

def redact_image_all_text(
    project,
    filename,
    output_filename,
):
    """Uses the Data Loss Prevention API to redact all text in an image.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        filename: The path to the file to inspect.
        output_filename: The path to which the redacted image will be written.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    # Import the client library
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct the image_redaction_configs, indicating to DLP that all text in
    # the input image should be redacted.
    image_redaction_configs = [{"redact_all_text": True}]

    # Construct the byte_item, containing the file's byte data.
    with open(filename, mode="rb") as f:
        byte_item = {"type_": google.cloud.dlp_v2.FileType.IMAGE, "data": f.read()}

    # Convert the project id into a full resource id.
    parent = f"projects/{project}"

    # Call the API.
    response = dlp.redact_image(
        request={
            "parent": parent,
            "image_redaction_configs": image_redaction_configs,
            "byte_item": byte_item,
        }
    )

    # Write out the results.
    with open(output_filename, mode="wb") as f:
        f.write(response.redacted_image)

    print(
        "Wrote {byte_count} to {filename}".format(
            byte_count=len(response.redacted_image), filename=output_filename
        )
    )

現在我想將其應用於 word 文檔文件。 我見過一些使用 dlp.deidentify_content 的例子，但它似乎只用於文本輸入。

 # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "item": contentItem,
        }
    )

所以，我想知道雲 DLP 是否原生支持對 Word DOC 進行編輯/去標識化。 如果是這樣，我該怎么做？ 如果沒有，是否有一種優雅的方式來對 word 文檔實施 DLP 編輯

Answer 1

其他人是對的-> 盡管 inspect_content 確實支持檢查 docx 文件（不是 doc），但 de-identify 不支持。

如果您想拆分每個段落，使用記錄 object 並將每個段落作為一行傳遞可以減少流量。

如何在 python 中使用 GCP DLP API 對文字文件進行去識別/編輯

問題描述

1 個解決方案

解決方案1
0 2022-08-19 16:14:31

如何在 python 中使用 GCP DLP API 對文字文件進行去識別/編輯

問題描述

1 個解決方案

解決方案1 0 2022-08-19 16:14:31

解決方案1
0 2022-08-19 16:14:31