仅将唯一值上传到 Big Query

Question

I am trying to load a CSV file to the Big Query.我正在尝试将 CSV 文件加载到 Big Query。 My CSV file could contain duplicate rows.我的 CSV 文件可能包含重复的行。 I am seeking help to know if there is a way, I could upload only the unique values from the CSV file using bq load command from CLI.我正在寻求帮助以了解是否有办法，我可以使用 CLI 中的 bq load 命令仅上传 CSV 文件中的唯一值。

For example, if my CSV file contain below values:例如，如果我的 CSV 文件包含以下值：

emp_id, emp_name
1,a
2,b
3,c
1,a
4,d
5,e
3,c

I want only the unique values to be uploaded to the Big Query Table:我只想将唯一值上传到大查询表：

emp_id, emp_name
1,a
2,b
3,c
4,d
5,e

Currently I am removing the duplicate values manually before uploading to the Big Query Table.目前我在上传到大查询表之前手动删除重复值。 I am expecting to know if there is a switch/parameter, I could use to upload only unique values using "bq load" command from cloud shell.我希望知道是否有开关/参数，我可以使用来自云 shell 的“bq load”命令仅上传唯一值。

Answer 1

bq load is not capable of applying any transform to your data. bq load无法对您的数据应用任何转换。 What you can do is pre process your file prior to loading it via bq load .您可以做的是在通过bq load加载文件之前预处理您的文件。 You can use the command below in cloud shell to remove duplicates:您可以在云 shell 中使用以下命令删除重复项：

head -1 test.csv > temp && sort -u test.csv | head -n -1 > temp_2 && cat temp temp_2 > new_test.csv && rm temp*

Assuming test.csv contains your data, the command above gets the header saving it to temp .假设test.csv包含您的数据，上面的命令获取 header 并将其保存到temp 。 Then separately it gets the unique values and saves it to temp_2 .然后分别获取唯一值并将其保存到temp_2 。 Build the clean file by concatenating temp and temp_2 and naming the file as new_test.csv .通过连接temp和temp_2并将文件命名为new_test.csv来构建干净的文件。

You can now use new_test.csv as input in bq load .您现在可以使用new_test.csv作为bq load中的输入。 See test below:见下面的测试：

But if you have bigger data than this, I recommend using a programming language to create a logic to clean your data.但是如果你有比这更大的数据，我建议使用一种编程语言来创建一个逻辑来清理你的数据。

仅将唯一值上传到 Big Query

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-29 20:07:16

仅将唯一值上传到 Big Query

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-29 20:07:16

解决方案1
0 已采纳 2022-11-29 20:07:16