简体   繁体   English

重击:读入文件,编辑行,输出到新文件

[英]Bash: Read in file, edit line, output to new file

I am new to linux and new to scripting. 我是Linux新手,还是脚本新手。 I am working in a linux environment using bash. 我在使用bash的Linux环境中工作。 I need to do the following things: 1. read a txt file line by line 2. delete the first line 3. remove the middle part of each line after the first 4. copy the changes to a new txt file 我需要做以下事情:1.逐行读取txt文件2.删除第一行3.在第一行之后删除每行的中间部分4.将更改复制到新的txt文件

Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency. 第一行之后的每一行都有三部分,第一行始终以.pdf结尾,第三行始终以R0开头,但中间部分没有一致性。

Example of 2 lines in the file: 文件中两行的示例:

R01234567_High Transcript_01234567.pdf  High School Transcript  R01234567
R01891023_Application_01891023127.pdf   Application R01891023

Here is what I have so far. 这是我到目前为止所拥有的。 I'm just reading the file, printing it to screen and copying it to another file. 我只是在读取文件,将其打印到屏幕上并将其复制到另一个文件中。

#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
   #value='expr ${value} +1';
   echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2"; 
done 

So my question is, how can I delete the middle bit of each line, after .pdf but before the R0...? 所以我的问题是,如何删除.pdf之后但R0之前的每一行的中间位?

Using sed : 使用sed

sed 's/^\(.*\.pdf\).*\(R0.*\)$/\1 \2/g' file.txt 

This will remove everything between .pdf and R0 and replace it with single space. 这将删除.pdfR0之间的所有内容并将其替换为单个空格。

Result for your example: 您的示例的结果:

R01234567_High Transcript_01234567.pdf R01234567
R01891023_Application_01891023127.pdf R01891023

Updated answer assuming tab delim 假设标签为delim的更新答案

Since there is a tab delimiter, then this is a cinch for awk. 由于有一个制表符定界符,因此这是awk的关键。 Borrowing from my originally deleted answer and @geek1011 deleted answer: 借用我最初删除的答案和@ geek1011删除的答案:

awk -F"\t" '{print $1, $NF}' infile.txt

Here awk splits each record in your file by tab, then prints the first field $1 and the last field $NF where NF is the built in awk variable for the record's Number of Fields; 在这里, awk按选项卡拆分文件中的每条记录,然后输出第一个字段$1和最后一个字段$NF ,其中NF是记录的字段数的内置awk变量; by prepending a dollar sign, it says "The value of the last field in the record". 前面加一个美元符号,表示“记录中最后一个字段的值”。


Original answer assuming space delimiter 假定空格分隔符的原始答案

Leaving this here in case someone has space delimited nonsense like I originally assumed. 如果有人像我最初设想的那样,用空格分隔废话,就把它留在这里。

You can use awk instead of using bash to read through the file: 您可以使用awk而不是bash来读取文件:

awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt

awk reads files line by line and processes each record it comes across. awk逐行读取文件并处理遇到的每条记录。 Fields are delimited automatically by white space. 字段由空格自动分隔。 The first field is $1 , the second is $2 and so on. 第一个字段是$1 ,第二个字段是$1 $2 ,依此类推。 awk has built in variables; awk具有内置变量; here we use NF which is the Number of Fields contained in the record, and NR which is the record number currently being processed. 在这里,我们使用NF即记录中包含的字段数)和NR (当前正在处理的记录号)。

This script does the following: 该脚本执行以下操作:

  1. If the record number is greater than 1 (not the header) then 如果记录号大于1(不是标题),则
  2. Loop through each field (separated by white space here) until we find a field that has "pdf" in it ( $i!~/pdf/ ). 遍历每个字段(在这里用空格分隔),直到找到其中包含“ pdf”的字段( $i!~/pdf/ )。 Store everything we find up until that field in a variable called firstRec separated by a space ( firstRec=firstRec" "$i ). 将我们找到的所有内容存储在该字段中,并存储在一个名为firstRec的变量中,该变量之间用一个空格( firstRec=firstRec" "$ifirstRec=firstRec" "$i
  3. print out the firstRec , then print out whatever field we stopped iterating on (the one that contains "pdf") which is $i , and finally print out the last field in the record, which is $NF ( print firstRec,$i,$NF ) 打印出firstRec ,然后打印出我们停止对其进行迭代的任何字段(包含“ pdf”的字段) $i ,最后打印出记录中的最后一个字段$NFprint firstRec,$i,$NF

You can direct this to another file: 您可以将其定向到另一个文件:

awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt

sed may be a cleaner way of going here since, if your pdf file has more than one space separating characters, then you will lose the multiple spaces. sed可能是一种更干净的选择,因为,如果您的pdf文件中有多个空格分隔字符,那么您将丢失多个空格。

The Hard, Unreliable Way 艰难,不可靠的方式

It's a bit verbose, and much less terse and efficient than what would make sense if we knew that the fields were separated by tab literals, but the following loop does this processing in pure native bash with no external tools: 这有点冗长,并且比我们知道字段由制表符分隔开的情况简单和有效得多,但是以下循环在没有外部工具的情况下以纯本机bash进行了此处理:

shopt -s extglob
while IFS= read -r line; do
  [[ $line = *".pdf"*R0* ]] || continue # ignore lines that don't fit our format

  filename=${line%%.pdf*}.pdf
  id=R0${line##*R0}
  printf '%s\t%s\n' "$filename" "$id"
done

${line%%.pdf*} returns everything before the first .pdf in the line; ${line%%.pdf*}返回行中第一个.pdf之前的所有内容; ${line%%.pdf*}.pdf then appends .pdf to that content. ${line%%.pdf*}.pdf然后将.pdf附加到该内容。

Similarly, ${line##*R0} expands to everything after the last R0 ; 同样, ${line##*R0}扩展到最后一个R0之后的所有内容; R0${line##*R0} thus expands to the final field starting with R0 (presuming that that's the only instance of that string in the file). R0${line##*R0}扩展到以R0开头的最后一个字段(假定这是文件中该字符串的唯一实例)。


The Easy Way (Using Tab Delimiters) 简单方法(使用制表符分隔符)

If cat -t file (on MacOS) or cat -A file (on Linux) shows ^I sequences between the fields (but not within the fields), use the following instead: 如果cat -t file (在MacOS上)或cat -A file (在Linux上)显示字段之间的^I序列(但不在字段内),请改用以下命令:

while IFS=$'\t' read -r filename title id; do
  printf '%s\t%s\n' "$filename" "$id"
done

This reads the three tab separated fields into variables named filename , title and id , and emits the filename and id fields. 这会将三个制表符分隔的字段读入名为filenametitleid变量,并发出filenameid字段。

You can use sed on each line like that: 您可以像这样在每line上使用sed

line="R01234567_High Transcript_01234567.pdf  High School Transcript  R01234567"
echo "$line" | sed 's/\.pdf.*R0/\.pdf R0/'
# output 
R01234567_High Transcript_01234567.pdf R01234567

This replace anything between .pdf and R0 with a spacebar. 这将用空格键替换.pdfR0之间的任何内容。 It doesn't deal with some edge cases but it simple and clear 它不处理某些极端情况,但简单明了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM