简体   繁体   English

可逆CSV解析

[英]Reversable CSV parsing

Prolog newbie here. Prolog 新手在这里。 In SWI Prolog, I'm trying to figure out how to parse a simple line of CSV reversably, but I'm stuck.在 SWI Prolog 中,我试图找出如何可逆地解析简单的 CSV 行,但我被卡住了。 Here's what I've got:这是我得到的:

csvstring1(S, L) :-
  split_string(S, ',', ',', T),
  maplist(atom_number, T, L).
   
csvstring2(S, L) :-
  atomic_list_concat(T, ',', S),
  maplist(atom_number, T, L).

% This one is the same except that maplist comes first. 
csvstring3(S, L) :-
  maplist(atom_number, T, L),
  atomic_list_concat(T, ',', S).

Now csvstring1 and csvstring2 work in a "forward" manner:现在 csvstring1 和 csvstring2 以“正向”方式工作:

?- csvstring1('1,2,3,4', L).
L = [1, 2, 3, 4].

?- csvstring2('1,2,3,4', L).
L = [1, 2, 3, 4].

But not csvstring3:但不是 csvstring3:

?- csvstring3('1,2,3,4', L).
ERROR: Arguments are not sufficiently instantiated

Moreover csvstring3 works in reverse, but not the other two predicates:此外 csvstring3 反向工作,但不是其他两个谓词:

?- csvstring3(L, [1,2,3,4]).
L = '1,2,3,4'.

?- csvstring1(L, [1,2,3,4]).
ERROR: Arguments are not sufficiently instantiated

?- csvstring2(L, [1,2,3,4]).
ERROR: Arguments are not sufficiently instantiated

How can I combine these into a single predicate?我如何将这些组合成一个谓词?

I don't know of a particularly newbie friendly way to do it which doesn't compromise somewhere.我不知道有什么特别适合新手的方法,它不会在某处妥协。 This is the easiest:这是最简单的:

csvString_list(String, List) :-
    ground(String),
    atomic_list_concat(Temp, ',', String),
    maplist(atom_number, Temp, List).

csvString_list(String, List) :-
    ground(List),
    maplist(atom_number, Temp, List),
    atomic_list_concat(Temp, ',', String).

but it makes and leaves spurious choicepoints, which is mildly annoying.但它会产生并留下虚假的选择点,这有点烦人。

This cuts the choicepoints which is nice when using it, but poor practise to get into without being aware of what that means:这减少了选择点,这在使用它时很好,但是在不知道这意味着什么的情况下进入的做法很差:

csvString_list(String, List) :-
    ground(String),
    atomic_list_concat(Temp, ',', String),
    maplist(atom_number, Temp, List),
    !.

csvString_list(String, List) :-
    ground(List),
    maplist(atom_number, Temp, List),
    atomic_list_concat(Temp, ',', String).

This uses if/else which is less code:这使用 if/else 代码更少:

csvString_list(String, List) :-
  ground(String) ->
      (atomic_list_concat(Temp, ',', String), maplist(atom_number, Temp, List))
    ; (maplist(atom_number, Temp, List),      atomic_list_concat(Temp, ',', String)).

but is logically bad and you should reify the branching with if_ which isn't builtin to SWI Prolog and is less simple to use.但在逻辑上是错误的,你应该使用 if_ 来具体化分支,它不是 SWI Prolog 的内置并且使用起来不太简单。

Or you could write a grammar with a DCG, which is not newbie territory:或者您可以使用 DCG 编写语法,这不是新手领域:


:- set_prolog_flag(double_quotes, chars).
:- use_module(library(dcg/basics)).

csvTail([N|Ns]) --> [','], number(N), csvTail(Ns). 
csvTail([])     --> [].

csv([N|Ns]) --> number(N), csvTail(Ns).

eg例如

?- phrase(csv(Ns), "11,22,33,44,55").
Ns = [11, 22, 33, 44, 55]


?- phrase(csv([11, 22, 33, 44, 55]), String)
String = [49, 49, ',', 50, 50, ',', 51, 51, ',', 52, 52, ',', 53, 53]

but now you're back to it leaving spurious choicepoints while parsing and you have to deal with the historic split of strings/atoms/character codes in SWI Prolog;但现在你又回到了它,在解析时留下了虚假的选择点你必须处理 SWI Prolog 中字符串/原子/字符代码的历史分裂; that list will unify with "11,22,33,44,55" because of the double_quotes flag but it doesn't look like it will.由于 double_quotes 标志,该列表将与"11,22,33,44,55"统一,但它看起来不像。

split_string is not reversible. split_string是不可逆的。 Can use DCG - here is a simple multi-line DCG parser for CSV:可以使用 DCG——这是一个简单的多行 DCG 解析器,用于 CSV:

% Nicer formatting
% https://www.swi-prolog.org/pldoc/man?section=flags
:- set_prolog_flag(answer_write_options, [quoted(true), portray(true), spacing(next_argument), max_depth(100), attributes(portray)]).

% Show lists of codes as text (if 3 chars or longer)
:- portray_text(true).

csv_lines([]) --> [].
% Newline after every line
csv_lines([H|T]) --> csv_fields(H), [10], csv_lines(T).

csv_fields([H|T]) --> csv_field(H), csv_field_end(T).

csv_field_end([]) --> [].
% Comma between fields
csv_field_end(T) --> [44], csv_fields(T).

csv_field([]) --> [].
csv_field([H|T]) -->
    [H],
    % Fields cannot contain comma, newline or carriage return
    { maplist(dif(H), [44, 10, 13]) },
    csv_field(T).

To demonstrate reversibility:证明可逆性:

% Note: z is char 122
?- phrase(csv_lines([[`def`, `cool`], [`abc`, [122]]]), Lines).
Lines = `def,cool\nabc,z\n` ;
false.

?- phrase(csv_lines(Fields), `def,cool\nabc,z\n`).
Fields = [[`def`, `cool`], [`abc`, [122]]] ;
false.

To parse the field contents and maintain reversibility, can use eg atom_codes .要解析字段内容并保持可逆性,可以使用例如atom_codes

Others have given some advice and a lot of code.其他人给出了一些建议和大量代码。 With SWI-Prolog, to parse comma-separated integers, you would use library(dcg/basics) and library(dcg/high_order) to do that trivially:使用 SWI-Prolog,要解析逗号分隔的整数,您可以使用 library(dcg/basics) 和 library(dcg/high_order) 来简单地执行此操作:

?- use_module(library(dcg/basics)),
   use_module(library(dcg/high_order)),
   portray_text(true).
true.

?- phrase(sequence(integer, ",", Ns), `1,2,3,4`).
Ns = [1, 2, 3, 4].

?- phrase(sequence(integer, ",", [-7,6,42]), S).
S = `-7,6,42`.

Of course, if you are trying to parse real CSV files, you should be using a CSV parser.当然,如果你试图解析真正的 CSV 文件,你应该使用 CSV 解析器。 Here is a minimal example of reading a CSV file and writing its output as a TSV (tab-separated) file.这是读取 CSV 文件并将其 output 写入 TSV(制表符分隔)文件的最小示例。 If this is your input in a file called example.csv :如果这是您在名为example.csv的文件中的输入:

$ cat example.csv
id,name,salary,department
1,john,2000,sales
2,Andrew,5000,finance
3,Mark,8000,hr
4,Rey,5000,marketing
5,Tan,4000,IT

You can read it from the file and write it with tabs as separators like this:您可以从文件中读取它并使用制表符作为分隔符编写它,如下所示:

?- csv_read_file('example.csv', Data),
   csv_write_file('example.tsv', Data).
Data = [row(id, name, salary, department),
        row(1, john, 2000, sales),
        row(2, 'Andrew', 5000, finance),
        row(3, 'Mark', 8000, hr),
        row(4, 'Rey', 5000, marketing),
        row(5, 'Tan', 4000, 'IT')].

The library guesses the field separator from the filename extension.该库根据文件扩展名猜测字段分隔符。 Here it correctly guessed that 'csv' means the comma "," and 'tsv' means the tab.在这里,它正确地猜测“csv”表示逗号“,”,而“tsv”表示制表符。 We can make the tab explicitly visible with cat -t .我们可以使用cat -t使选项卡明确可见。

$ cat example.tsv 
id  name    salary  department
1   john    2000    sales
2   Andrew  5000    finance
3   Mark    8000    hr
4   Rey 5000    marketing
5   Tan 4000    IT
$ cat -t example.tsv 
id^Iname^Isalary^Idepartment^M
1^Ijohn^I2000^Isales^M
2^IAndrew^I5000^Ifinance^M
3^IMark^I8000^Ihr^M
4^IRey^I5000^Imarketing^M
5^ITan^I4000^IIT^M

How can I combine these into a single predicate?我如何将这些组合成一个谓词?

csvstring(S, L) :-
  (  ground(S)
  -> atomic_list_concat(T, ',', S),
     maplist(atom_number, T, L)
  ;  maplist(atom_number, T, L),
     atomic_list_concat(T, ',', S)
  ).

... micro test... ...微测试...

?- csvstring('1,2,3,4', L).
L = [1, 2, 3, 4].

?- csvstring(L, [1,2,3,4]).
L = '1,2,3,4'.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM