简体   繁体   English

Linux如何从行尾^ A知道嵌入式^ A

[英]How does Linux know embedded ^A from an end-of-line ^A

I'm having a data issue with embedded ^A characters, which i can fully reproduce with this small file: 我的嵌入^ A字符有数据问题,我可以用这个小文件完全重现该问题:

在此处输入图片说明

Observe that I have embedded ^A characters. 观察到我已经嵌入了^ A字符。 I put them there using vi with the ^V technique. 我使用^ V技术将它们放在vi中。

Now, notice I also put a line break after the "p,q," string on the third line. 现在,请注意,我还在第三行的“ p,q”字符串之后放置了一个换行符。 That was done with the Enter key, but it just puts in a ^A, we can see here: 这是通过Enter键完成的,但是只放入^ A,我们可以在这里看到:

[ ~/hack ] cat t.csv
a,b,c,d,e
f,g,,i,j
k,l,,n,o
p,q,
,s,t
u,v,w,x,y
[ ~/hack ] xxd < t.csv > u.csv
[ ~/hack ] cat u.csv
0000000: 612c 622c 632c 642c 650a 662c 672c 012c  a,b,c,d,e.f,g,.,
0000010: 692c 6a0a 6b2c 6c2c 012c 6e2c 6f0a 702c  i,j.k,l,.,n,o.p,
0000020: 712c 0a2c 732c 740a 752c 762c 772c 782c  q,.,s,t.u,v,w,x,
0000030: 790a                                     y.
[ ~/hack ]

Note that for the "cat" listing, the double comma has the ^A in it, it just doesn't print to the screen with cat. 请注意,对于“ cat”列表,双逗号中包含^ A,只是不会打印到带有cat的屏幕上。

But notice also, the normal end-of-line is also a ^A. 但也请注意,正常的行尾也是^ A。 This is where it gets tricky...how does Linux differentiate between a ^A that is an embedded character, and one that is the end of line? 这是个棘手的问题……Linux如何区分^ A和嵌入字符之间的区别?

Note in the hex dump, after the "e", is an 0a, as expected. 请注意,十六进制转储中的“ e”之后是预期的0a。 But there is an 0a between the two commas between 'l' and 'n' too. 但是在“ l”和“ n”之间的两个逗号之间也有一个0a。 Yet my manually broken line between 'q' and 's' shows an actual line break--but it's just a 0a like any other!!! 但是我在'q'和's'之间的手动折线显示了实际的换行符-但它与其他任何一个一样只是0a !!!

My ultimate need is I need to programmatically find all broken lines like the p,q,.,s,t one, and get rid of those line breaks. 我的最终需要是,我需要以编程方式找到所有虚线,例如p,q,。,s,t,然后摆脱这些换行符。 But sed can't see that as a line break. 但是sed不能将其视为换行符。 That is, if I replace ^A, it would see the ones on the 'f' and 'k' lines, but it can't find the ones on the 'p' line. 也就是说,如果我替换^ A,它将在'f'和'k'行上看到它们,但无法在'p'行上找到它们。

So, 1) As a matter of conceptual understanding, can someone explain how on Earth Linux knows the difference between the 0a character that is embedded and one that is an end of line, and 2) What is the piece of code that would find the artificial line breaks and mend the line? 因此,1)从概念上理解,有人可以解释一下Linux在地球上如何知道嵌入的0a字符与行尾的字符之间的区别; 2)可以找到该代码的那段代码是什么?人工换行并修补线?

Thanks! 谢谢!

^A is not 0a. ^ A不是0a。 ^A (control-A) is ASCII character 1 (01), while the newline/linefeed character (0a, ASCII 10) is ^J (control-J). ^ A(control-A)是ASCII字符1(01),而换行符/换行符(0a,ASCII 10)是^ J(control-J)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM