awk 函数在 csv 中使用正则表达式修改多列

Question

目标：

我需要修改网址以仅保留其中的数字（纬度/经度/ID）：在 .csv 文件中，我有一个“标题中的某些标题”。 这个我需要找到。 在此找到的 Titel 列中，我需要删除 url 的开头和结尾，因此只留下一个数字，即 url 的一部分。 这需要在不同的结构化 csv 上执行，其中有几列具有不同的标题和不同的 url 模式。 有没有办法用 awk 在 bash 中编写函数？

我试过 - 但它不起作用，因为它缺少很多缺失的知识：

#!/bin/bash
CSVFILE=$(find ./aufzubereiten -type f ! -name ".DS_Store") #only one file in this folder.
FILENAME=$(basename "$CSVFILE")

function modify_col() {
    COL= how to find the right column in the csv?
    awk -F',' OFS="," -v pat='"$PAT"' '{sub(/pat/,X,$${COL})} 1' "$CSVFILE" > "$CSVFILE".tmp1 && mv "$CSVFILE".tmp1 "$CSVFILE"
}

COLTITEL="certain Titel in Header"
PAT='/Text1234Text[0-9]{5,8}Text1.html'
PATNEW=''
modify_col

COLTITEL="certain Titel2 in Header"
PAT='/Text2234Text[0-9]{5,8}Text2.html'
PATNEW=''
modify_col

COLTITEL="certain Titel3 in Header"
PAT='/Text3234Text[0-9]{5,8}Text3.html'
PATNEW=''
modify_col

示例文件：

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html

结果应该是：

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

谢谢你的想法:)

Answer 1

您能否尝试使用所示示例进行以下、编写和测试。

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  print
  next
}
{
  for(i=1;i<=NF;i++){
    sub(/^\/Text[0-9]+Text/,"",$i)
    sub(/Text.*/,"",$i)
  }
}
1
'  Input_file

说明：添加对上述代码的详细说明。

awk '
BEGIN{                                 ##Starting BEGIN section of code here.
  FS=OFS=","                           ##Setting FS and OFS to comma here.
}
FNR==1{                                ##Checking condition if FNR==1 then do following.
  print                                ##Printing the current line here.
  next                                 ##next will skip all further statements from here.
}
{
  for(i=1;i<=NF;i++){                  ##Starting a for loop to traverse into all fields here.
    sub(/^\/Text[0-9]+Text/,"",$i)     ##Substituting from starting Text digits Text with NULL in current field.
    sub(/Text.*/,"",$i)                ##Substituting everything from Text to till last of field value with NULL in current field.
  }
}
1                                      ##1 will print edited/non-edited line here.
'  Input_file                          ##Mentioning Input_file name here.

Answer 2

假设：

data 看起来与问题中的示例完全一样，即文字Text显示在每个html文件名中的 3x 位置

样本数据：

$ cat text.dat
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html

一种awk解决方案：

$ awk -F"Text" '
BEGIN  { OFS="," }
FNR==1 { print ; next }
       { print ",,"$3,$6,$9 }
' text.dat

在哪里：

-F"Text" - 使用Text作为我们的输入字段分隔符
OFS="," - 设置输出字段分隔符
FNR==1 {print ; next} FNR==1 {print ; next} - 对于第 1 行（标题行）打印整行并跳到文件中的下一行
print ",,"$3,$6,$9 - 打印 2 个逗号，然后是字段 3、6 和 9（由OFS=","分隔）

结果：

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

Answer 3

这是查找五位或更多位数字并删除其他所有数字的通用解决方案。

awk -F , 'BEGIN { OFS=FS }
  FNR>1{
    for(i=1;i<=NF;++i) {
        gsub(/(^|[^0-9])[0-9]{1,4}([^0-9]|$)/, "", $i);
        gsub(/[^0-9]+/, "", $i);
    }
  } 1' filename

如果您只有一个文件名，则可能没有理由使用find 。 如果您不知道文件名但当前目录中只有一个文件， *将扩展为该文件名。

这有点脆弱，因为如果一个字段中的两个数字被一个非数字字符分隔，它就不会做正确的事情。 解决这个问题并不难，但我很懒，你的要求有点模糊。

Answer 4

我知道 OP 询问是否有办法使用 awk 来实现它，但是从上下文提供的内容来看，我知道任何可以在 bash 脚本中运行的解决方案都可以解决 OP 的问题。

对于这种情况，我相信sed是一个更优雅的解决方案：

sed 's/[^,]\+[^0-9]\([0-9][0-9]\+\)[^,]\+/\1/g' data.csv

它输出任何接近字段末尾的 2 位或更多位数字。 sed的扩展版本可能有助于更好地对其进行可视化：

sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv

输出：

rvb@ubuntu:~$ sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

awk 函数在 csv 中使用正则表达式修改多列

问题描述

4 个解决方案

解决方案1
2 已采纳 2020-01-24 11:52:13

解决方案2
2 2020-01-24 12:35:19

解决方案3
1 2020-01-24 12:47:03

解决方案4
0 2020-01-24 19:16:35

awk 函数在 csv 中使用正则表达式修改多列

问题描述

4 个解决方案

解决方案1 2 已采纳 2020-01-24 11:52:13

解决方案2 2 2020-01-24 12:35:19

解决方案3 1 2020-01-24 12:47:03

解决方案4 0 2020-01-24 19:16:35

解决方案1
2 已采纳 2020-01-24 11:52:13

解决方案2
2 2020-01-24 12:35:19

解决方案3
1 2020-01-24 12:47:03

解决方案4
0 2020-01-24 19:16:35