简体   繁体   English

Linux join实用程序抱怨输入文件未被排序

[英]Linux join utility complains about input file not being sorted

I have two files: 我有两个文件:

file1 has the format: file1的格式为:

field1;field2;field3;field4

(file1 is initially unsorted) (file1最初未排序)

file2 has the format: file2的格式为:

field1

(file2 is sorted) (file2已排序)

I run the 2 following commands: 我运行以下两个命令:

sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2

I get the following message: 我收到以下消息:

join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order

Why is this happening ? 为什么会这样?

(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success) (我也尝试对file1进行排序,考虑到整条生产线不仅是该生产线的第一批,而且没有成功)

sort -t\\; -c file1 sort -t\\; -c file1 doesn't output anything. sort -t\\; -c file1不输出任何内容。 Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly: 在第27497行附近,情况确实很奇怪,这意味着排序无法正常工作:

              XYZ113017;...
line 27497--> XYZ11301;...
              XYZ11301;...

To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem). 为了补充Wumpus Q. Wumbley从更广泛的角度来看有用的答案 (因为我发现这篇文章研究了一个稍微不同的问题)。

  • When using join , the input files must be sorted by the join field ONLY , otherwise you may see the warning reported by the OP. 使用join输入文件必须仅通过join字段排序 ,否则您可能会看到OP报告的警告。

There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files: 有两种常见情况,在排序输入文件时, 错误地包含了多个感兴趣的字段

  • If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; 如果确实指定了一个字段,很容易忘记你还必须指定一个停止字段 - 即使你只定位一个字段 - 因为如果只指定了一个起始字段,则sort使用该行的其余部分; eg: 例如:

    • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
    • sort -t, -k1,1 ... # Field 1 only
  • If your sort field is the FIRST field in the input , it's tempting to not specify any field selector at all . 如果您的排序字段是输入中的FIRST字段 ,则很难指定任何字段选择器

    • However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field : 但是,如果字段值可以是彼此的前缀子字符串,则排序整行不会(必然)产生与第1字段排序相同的排序顺序
    • sort ... # NOT always the same as 'sort -k1,1'! see below for example

Pitfall example: 陷阱的例子:

#!/usr/bin/env bash

# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF

# NOTE: "WRONG" below refers to deviation from the expected outcome
#       of sorting by field 1 only, based on mistaken assumptions.
#       The commands do work correctly in a technical sense.

echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first

Explanation: 说明:

  • When NOT limiting sorting to the first field, it is the relative sort order of chars. 当不限制排序到第一个字段时,它是字符的相对排序顺序。 ^ and A (column index 6) that matters in this example. ^A (列索引6)在此示例中很重要。 In other words: the field separator is compared to data , which is the source of the problem: ^ has a HIGHER ASCII value than A , and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^ . 换句话说: 将字段分隔符与数据进行比较,数据是问题的根源: ^具有比A更高的ASCII值,因此 “A” 之后排序,导致行以nameAA^排序nameA^

  • Note: It is possible for problems to surface on one platform, but be masked on another , based on locale and character-set settings and/or the sort implementation used; 注意:根据区域设置和字符集设置和/或使用的sort实现,问题可能在一个平台上浮出水面,但在另一个平台上隐藏 ; eg, with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields: 例如,有效的en_US.UTF-8语言环境,作为分隔符和-允许的内部字段:

    • sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values) sort在OSX 10.10.2所使用的(这是一个古老的 GNU sort版本,5.93)排序,-与ASCII值线)
    • sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite : sorts - before , [1] sort在Ubuntu 14.04所用的(GNU sort 8.21)则正好相反 :排序-之前, [1]

[1] I don't know why - if somebody knows, please tell me. [1]我不知道为什么 - 如果有人知道,请告诉我。 Test with sort <<<$'-\\n,' 使用sort <<<$'-\\n,'测试sort <<<$'-\\n,'

sort -k1 uses all fields starting from field 1 as the key. sort -k1使用从字段1开始的所有字段作为键。 You need to specify a stop field. 您需要指定一个停止字段。

sort -t\; -k1,1

... or the gnu sort is just as buggy as every other GNU command ...或者gnu排序和其他GNU命令一样错误

try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior 尝试对Gi1 / 0/11和Gi1 / 0/1进行排序,你将永远无法获得适合于连接输入的实际常规文本排序,因为有人在排序中添加了一些额外的智能,可以自动地使用数字或人工数字排序在这种情况下,甚至无需添加标志来强制执行常规行为

what is suitable for humans is seldom suitable for scripting 什么适合人类很少适合脚本

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM