使用awk或sed进行数据清理和格式化

Question

Here is an excerpt of my text file 这是我的文本文件的摘录

 namq_aux_lp   4 Last update of data 07.07.2014  t
 namq_aux_ulc   4 Last update of data 08.07.2014  
  namq_aux_gph   4 Last update of data 07.07.2014  
  prc_hicp_cann   4 Last update of data 17.07.2014 
 namq_nace10_k   4 Last update of data 02.07.2014  clas
sei_bsco_m   4 Last update of data 10.06.2014  
ei_bsin_m_r2   4 Last update of data 26.06.2014  
 lassei_bsbu_m_r2   4 Last update of data 26.06.2014  
assei_bsrt_m_r2   4 Last update of data 26.06.2014  t
 ei_bssi_m_r2   4 Last update of data 26.06.2014  t
ei_bsse_m_r2   4 Last update of data 26.06.2014  
 ei_bsci_m_r2   4 Last update of data 26.06.2014  
10    sts_trtu_m   4 Last update of data 17.07.2014 c

I'm trying to format it and cleaning it, keeping it the first column and the date. 我正在尝试对其进行格式化和清理，将其保留在第一列和日期中。 However as you can see, there is the 10 on the last line. 但是，如您所见，最后一行是10。 I cannot remove it because if I do, the date for sei_bsco_m will be amputated. 我无法删除它，因为如果这样做， sei_bsco_m的日期将被截肢。

Any help would be appreciated. 任何帮助，将不胜感激。

Note Code is here https://ideone.com/JbuRHK 注释代码在这里https://ideone.com/JbuRHK

Desired output would be : 所需的输出将是：

namq_aux_lp     07.07.2014
namq_aux_ulc    08.07.2014 
...
assei_bsrt_m_r2 26.06.2014
...

Answer 1

Just look for the first date on each line from the 7th field on and print that plus the 6th-previous field: 只需从第7个字段中查找每行的第一个日期，然后打印并加上第6个上一个字段：

$ awk '{
    for (i=7;i<=NF;i++)
        if ($i ~ /^([[:digit:]]{2}\.){2}[[:digit:]]{4}$/) {
            printf "%-20s%10s\n", $(i-6), $i
            next
        }
}' file
namq_aux_lp         07.07.2014
namq_aux_ulc        08.07.2014
namq_aux_gph        07.07.2014
prc_hicp_cann       17.07.2014
namq_nace10_k       02.07.2014
sei_bsco_m          10.06.2014
ei_bsin_m_r2        26.06.2014
lassei_bsbu_m_r2    26.06.2014
assei_bsrt_m_r2     26.06.2014
ei_bssi_m_r2        26.06.2014
ei_bsse_m_r2        26.06.2014
ei_bsci_m_r2        26.06.2014
sts_trtu_m          17.07.2014

The above doesn't care how many leading or trailing undesirable fields you might have, or what those fields might contain, as long as you don't have 7 leading undesirable fields with the 7th one being a date! 上面的内容并不关心您可能有多少个前导或尾随的不希望字段，或者这些字段可能包含什么，只要您没有7个前导不希望出现的字段（第7个为日期）即可！

Alternatively, this just prints whatever is first on each side of the string "4 Last update of data": 另外，这仅打印字符串“ 4 Last update of data”的每一侧的第一行：

$ awk -F'[[:space:]]+[[:digit:]]+ Last update of data[[:space:]]+' '{
    sub(/.*[[:space:]]/,"",$1)
    sub(/[[:space:]].*$/,"",$2)
    printf "%-20s%10s\n", $1, $2
}' file
namq_aux_lp         07.07.2014
namq_aux_ulc        08.07.2014
namq_aux_gph        07.07.2014
prc_hicp_cann       17.07.2014
namq_nace10_k       02.07.2014
sei_bsco_m          10.06.2014
ei_bsin_m_r2        26.06.2014
lassei_bsbu_m_r2    26.06.2014
assei_bsrt_m_r2     26.06.2014
ei_bssi_m_r2        26.06.2014
ei_bsse_m_r2        26.06.2014
ei_bsci_m_r2        26.06.2014
sts_trtu_m          17.07.2014

Answer 2

Here is some that may work: 以下是一些可行的方法：

awk '/^10/ {$1=""}1' file | column -t
namq_aux_lp       4  Last  update  of  data  07.07.2014  t
namq_aux_ulc      4  Last  update  of  data  08.07.2014
namq_aux_gph      4  Last  update  of  data  07.07.2014
prc_hicp_cann     4  Last  update  of  data  17.07.2014
namq_nace10_k     4  Last  update  of  data  02.07.2014  clas
sei_bsco_m        4  Last  update  of  data  10.06.2014
ei_bsin_m_r2      4  Last  update  of  data  26.06.2014
lassei_bsbu_m_r2  4  Last  update  of  data  26.06.2014
assei_bsrt_m_r2   4  Last  update  of  data  26.06.2014  t
ei_bssi_m_r2      4  Last  update  of  data  26.06.2014  t
ei_bsse_m_r2      4  Last  update  of  data  26.06.2014
ei_bsci_m_r2      4  Last  update  of  data  26.06.2014
sts_trtu_m        4  Last  update  of  data  17.07.2014  c

or to get your output: 或获取您的输出：

awk '/^10/ {$1=""}1' file | awk '{print $1,$7}' OFS="\t"
namq_aux_lp     07.07.2014
namq_aux_ulc    08.07.2014
namq_aux_gph    07.07.2014
prc_hicp_cann   17.07.2014
namq_nace10_k   02.07.2014
sei_bsco_m      10.06.2014
ei_bsin_m_r2    26.06.2014
lassei_bsbu_m_r2        26.06.2014
assei_bsrt_m_r2 26.06.2014
ei_bssi_m_r2    26.06.2014
ei_bsse_m_r2    26.06.2014
ei_bsci_m_r2    26.06.2014
sts_trtu_m      17.07.2014

Or like this: 或像这样：

awk '/^10/ {$1=""}1' file | awk '{print $1,$7}' | column -t
namq_aux_lp       07.07.2014
namq_aux_ulc      08.07.2014
namq_aux_gph      07.07.2014
prc_hicp_cann     17.07.2014
namq_nace10_k     02.07.2014
sei_bsco_m        10.06.2014
ei_bsin_m_r2      26.06.2014
lassei_bsbu_m_r2  26.06.2014
assei_bsrt_m_r2   26.06.2014
ei_bssi_m_r2      26.06.2014
ei_bsse_m_r2      26.06.2014
ei_bsci_m_r2      26.06.2014
sts_trtu_m        17.07.2014

Answer 3

You can use sed and column : 您可以使用sed和column ：

sed -nr 's|.*\b(\S+_\S+)\b.*\b([0-9]+[.][0-9]+[.][0-9]+)\b.*|\1\t\2|p' file | column -t

Output: 输出：

namq_aux_lp       07.07.2014
namq_aux_ulc      08.07.2014
namq_aux_gph      07.07.2014
prc_hicp_cann     17.07.2014
namq_nace10_k     02.07.2014
sei_bsco_m        10.06.2014
ei_bsin_m_r2      26.06.2014
lassei_bsbu_m_r2  26.06.2014
assei_bsrt_m_r2   26.06.2014
ei_bssi_m_r2      26.06.2014
ei_bsse_m_r2      26.06.2014
ei_bsci_m_r2      26.06.2014
sts_trtu_m        17.07.2014

Note: 注意：

The column is matched for everything with underscore _ on it. 该列匹配所有带有下划线_的内容。
\\S may not work so you can also consider [^[:space:]] or [^ \\t\\r] over it. \\S可能不起作用，因此您也可以在其上考虑[^[:space:]]或[^ \\t\\r] 。

Answer 4

Yet another solution could be the following: 另一个解决方案可能是：
- removes the first two numbers -删除前两个数字
- removes spaces -删除空间
- prints column 1 and 7 with a tab as OFS (Output Field Separator) -将带有选项卡的列1和7打印为OFS（输出字段分隔符）

$ sed 's/^[0-9][0-9]//' telecharge.txt |  sed 's/ //' | awk '{print $1,$7}' OFS='\t'
namq_aux_lp     07.07.2014
namq_aux_ulc    08.07.2014
namq_aux_gph    07.07.2014
prc_hicp_cann   17.07.2014
namq_nace10_k   02.07.2014
sei_bsco_m      10.06.2014
ei_bsin_m_r2    26.06.2014
lassei_bsbu_m_r2        26.06.2014
assei_bsrt_m_r2 26.06.2014
ei_bssi_m_r2    26.06.2014
ei_bsse_m_r2    26.06.2014
ei_bsci_m_r2    26.06.2014
sts_trtu_m      17.07.2014

使用awk或sed进行数据清理和格式化

问题描述

4 个解决方案

解决方案1
2 已采纳 2014-07-22 12:08:12

解决方案2
1 2014-07-22 11:12:08

解决方案3
1 2014-07-22 11:19:34

解决方案4
0 2014-07-22 11:28:58

使用awk或sed进行数据清理和格式化

问题描述

4 个解决方案

解决方案1 2 已采纳 2014-07-22 12:08:12

解决方案2 1 2014-07-22 11:12:08

解决方案3 1 2014-07-22 11:19:34

解决方案4 0 2014-07-22 11:28:58

解决方案1
2 已采纳 2014-07-22 12:08:12

解决方案2
1 2014-07-22 11:12:08

解决方案3
1 2014-07-22 11:19:34

解决方案4
0 2014-07-22 11:28:58