[英]Bash: Read in file, edit line, output to new file
I am new to linux and new to scripting. 我是Linux新手,还是脚本新手。 I am working in a linux environment using bash.
我在使用bash的Linux环境中工作。 I need to do the following things: 1. read a txt file line by line 2. delete the first line 3. remove the middle part of each line after the first 4. copy the changes to a new txt file
我需要做以下事情:1.逐行读取txt文件2.删除第一行3.在第一行之后删除每行的中间部分4.将更改复制到新的txt文件
Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency. 第一行之后的每一行都有三部分,第一行始终以.pdf结尾,第三行始终以R0开头,但中间部分没有一致性。
Example of 2 lines in the file: 文件中两行的示例:
R01234567_High Transcript_01234567.pdf High School Transcript R01234567 R01891023_Application_01891023127.pdf Application R01891023
Here is what I have so far. 这是我到目前为止所拥有的。 I'm just reading the file, printing it to screen and copying it to another file.
我只是在读取文件,将其打印到屏幕上并将其复制到另一个文件中。
#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
#value='expr ${value} +1';
echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2";
done
So my question is, how can I delete the middle bit of each line, after .pdf but before the R0...? 所以我的问题是,如何删除.pdf之后但R0之前的每一行的中间位?
Using sed
: 使用
sed
:
sed 's/^\(.*\.pdf\).*\(R0.*\)$/\1 \2/g' file.txt
This will remove everything between .pdf
and R0
and replace it with single space. 这将删除
.pdf
和R0
之间的所有内容并将其替换为单个空格。
Result for your example: 您的示例的结果:
R01234567_High Transcript_01234567.pdf R01234567
R01891023_Application_01891023127.pdf R01891023
Since there is a tab delimiter, then this is a cinch for awk. 由于有一个制表符定界符,因此这是awk的关键。 Borrowing from my originally deleted answer and @geek1011 deleted answer:
借用我最初删除的答案和@ geek1011删除的答案:
awk -F"\t" '{print $1, $NF}' infile.txt
Here awk
splits each record in your file by tab, then prints the first field $1
and the last field $NF
where NF
is the built in awk
variable for the record's Number of Fields; 在这里,
awk
按选项卡拆分文件中的每条记录,然后输出第一个字段$1
和最后一个字段$NF
,其中NF
是记录的字段数的内置awk
变量; by prepending a dollar sign, it says "The value of the last field in the record". 前面加一个美元符号,表示“记录中最后一个字段的值”。
Leaving this here in case someone has space delimited nonsense like I originally assumed. 如果有人像我最初设想的那样,用空格分隔废话,就把它留在这里。
You can use awk
instead of using bash to read through the file: 您可以使用
awk
而不是bash来读取文件:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt
awk
reads files line by line and processes each record it comes across. awk
逐行读取文件并处理遇到的每条记录。 Fields are delimited automatically by white space. 字段由空格自动分隔。 The first field is
$1
, the second is $2
and so on. 第一个字段是
$1
,第二个字段是$1
$2
,依此类推。 awk
has built in variables; awk
具有内置变量; here we use NF
which is the Number of Fields contained in the record, and NR
which is the record number currently being processed. 在这里,我们使用
NF
即记录中包含的字段数)和NR
(当前正在处理的记录号)。
This script does the following: 该脚本执行以下操作:
$i!~/pdf/
). $i!~/pdf/
)。 Store everything we find up until that field in a variable called firstRec
separated by a space ( firstRec=firstRec" "$i
). firstRec
的变量中,该变量之间用一个空格( firstRec=firstRec" "$i
) firstRec=firstRec" "$i
。 firstRec
, then print out whatever field we stopped iterating on (the one that contains "pdf") which is $i
, and finally print out the last field in the record, which is $NF
( print firstRec,$i,$NF
) firstRec
,然后打印出我们停止对其进行迭代的任何字段(包含“ pdf”的字段) $i
,最后打印出记录中的最后一个字段$NF
( print firstRec,$i,$NF
) You can direct this to another file: 您可以将其定向到另一个文件:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt
sed
may be a cleaner way of going here since, if your pdf
file has more than one space separating characters, then you will lose the multiple spaces. sed
可能是一种更干净的选择,因为,如果您的pdf
文件中有多个空格分隔字符,那么您将丢失多个空格。
It's a bit verbose, and much less terse and efficient than what would make sense if we knew that the fields were separated by tab literals, but the following loop does this processing in pure native bash with no external tools: 这有点冗长,并且比我们知道字段由制表符分隔开的情况要简单和有效得多,但是以下循环在没有外部工具的情况下以纯本机bash进行了此处理:
shopt -s extglob
while IFS= read -r line; do
[[ $line = *".pdf"*R0* ]] || continue # ignore lines that don't fit our format
filename=${line%%.pdf*}.pdf
id=R0${line##*R0}
printf '%s\t%s\n' "$filename" "$id"
done
${line%%.pdf*}
returns everything before the first .pdf
in the line; ${line%%.pdf*}
返回行中第一个.pdf
之前的所有内容; ${line%%.pdf*}.pdf
then appends .pdf
to that content. ${line%%.pdf*}.pdf
然后将.pdf
附加到该内容。
Similarly, ${line##*R0}
expands to everything after the last R0
; 同样,
${line##*R0}
扩展到最后一个R0
之后的所有内容; R0${line##*R0}
thus expands to the final field starting with R0
(presuming that that's the only instance of that string in the file). R0${line##*R0}
扩展到以R0
开头的最后一个字段(假定这是文件中该字符串的唯一实例)。
If cat -t file
(on MacOS) or cat -A file
(on Linux) shows ^I
sequences between the fields (but not within the fields), use the following instead: 如果
cat -t file
(在MacOS上)或cat -A file
(在Linux上)显示字段之间的^I
序列(但不在字段内),请改用以下命令:
while IFS=$'\t' read -r filename title id; do
printf '%s\t%s\n' "$filename" "$id"
done
This reads the three tab separated fields into variables named filename
, title
and id
, and emits the filename
and id
fields. 这会将三个制表符分隔的字段读入名为
filename
, title
和id
变量,并发出filename
和id
字段。
You can use sed
on each line
like that: 您可以像这样在每
line
上使用sed
:
line="R01234567_High Transcript_01234567.pdf High School Transcript R01234567"
echo "$line" | sed 's/\.pdf.*R0/\.pdf R0/'
# output
R01234567_High Transcript_01234567.pdf R01234567
This replace anything between .pdf
and R0
with a spacebar. 这将用空格键替换
.pdf
和R0
之间的任何内容。 It doesn't deal with some edge cases but it simple and clear 它不处理某些极端情况,但简单明了
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.