[英]How to de-identify BigQuery data that stored in RECORD or REPEATED properties?
[英]how to de-identify/ redact word files using GCPs DLP API in python
我在 python 中使用 GCP 的 DLP API 以以下方式編輯圖像,它工作正常:
def redact_image_all_text(
project,
filename,
output_filename,
):
"""Uses the Data Loss Prevention API to redact all text in an image.
Args:
project: The Google Cloud project id to use as a parent resource.
filename: The path to the file to inspect.
output_filename: The path to which the redacted image will be written.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp_v2.DlpServiceClient()
# Construct the image_redaction_configs, indicating to DLP that all text in
# the input image should be redacted.
image_redaction_configs = [{"redact_all_text": True}]
# Construct the byte_item, containing the file's byte data.
with open(filename, mode="rb") as f:
byte_item = {"type_": google.cloud.dlp_v2.FileType.IMAGE, "data": f.read()}
# Convert the project id into a full resource id.
parent = f"projects/{project}"
# Call the API.
response = dlp.redact_image(
request={
"parent": parent,
"image_redaction_configs": image_redaction_configs,
"byte_item": byte_item,
}
)
# Write out the results.
with open(output_filename, mode="wb") as f:
f.write(response.redacted_image)
print(
"Wrote {byte_count} to {filename}".format(
byte_count=len(response.redacted_image), filename=output_filename
)
)
現在我想將其應用於 word 文檔文件。 我見過一些使用 dlp.deidentify_content 的例子,但它似乎只用於文本輸入。
# Call the API
response = dlp.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"item": contentItem,
}
)
所以,我想知道雲 DLP 是否原生支持對 Word DOC 進行編輯/去標識化。 如果是這樣,我該怎么做? 如果沒有,是否有一種優雅的方式來對 word 文檔實施 DLP 編輯
其他人是對的-> 盡管 inspect_content 確實支持檢查 docx 文件(不是 doc),但 de-identify 不支持。
如果您想拆分每個段落,使用記錄 object 並將每個段落作為一行傳遞可以減少流量。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.