简体   繁体   中英

Parse a csv file using Regex in java with '|' as seperator

Can anyone help me with Regex ? I got an Java Program, that reads in .csv files to load it into a Database.

Currently ist uses Pattern csvPattern = Pattern.compile("\\\\s*(\\"[^\\"]*\\"|[^|]*)\\\\s*,?");

but wenn I do matcher = csvPattern.matcher(line); to read the files line by line I only get null-values. The files have the following format (many morre lines, some with comma in it, '|' as a seperator and at the end of each line):

abstact of the first file:

0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |

second:

|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|

third:

5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|. slyly regular pinto bea|
6|Supplier#000000006|tQxuVm7s7CnK|14|24-696-997-4969|1365.79|final accounts. regular dolphins use against the furiously ironic decoys. |
7|Supplier#000000007|s,4TicNGB4uO6PaSqNBUq|23|33-990-965-2201|6820.35|s unwind silently furiously regular courts. final requests are deposits. requests wake quietly blit|
8|Supplier#000000008|9Sq4bBH2FQEmaFOocY45sRTxo6yuoG|17|27-498-742-3860|7627.85|al pinto beans. asymptotes haggl|
9|Supplier#000000009|1KhUgZegwM3ua7dsYmekYBsK|10|20-403-398-8662|5302.37|s. unusual, even requests along the furiously regular pac|

fourth:

1|2|3325|771.64|, even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful|
1|2502|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
1|5002|3956|337.09|after the fluffily ironic deposits? blithely special dependencies integrate furiously even excuses. blithely silent theodolites could have to haggle pending, express requests; fu|
1|7502|4069|357.84|al, regular dependencies serve carefully after the quickly final pinto beans. furiously even deposits sleep quickly final, silent pinto beans. fluffily reg|

fifth:

1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |

sixth:

134823|saddle midnight thistle honeydew lime|Manufacturer#4|Brand#43|STANDARD BURNISHED BRASS|44|WRAP CAN|1857.82|ges. furiously ir|
134824|coral red indian thistle sandy|Manufacturer#5|Brand#55|PROMO BURNISHED COPPER|29|LG JAR|1858.82|final p|
134825|saddle purple orchid cornsilk medium|Manufacturer#4|Brand#44|PROMO POLISHED NICKEL|21|LG CASE|1859.82|nal accounts us|
134826|turquoise sky lime cornsilk peach|Manufacturer#1|Brand#11|SMALL BURNISHED TIN|25|SM CAN|1860.82| haggle|

seventh:

0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|

eighth:

4|136777|O|32151.78|1995-10-11|5-LOW|Clerk#000000124|0|sits. slyly regular warthogs cajole. regular, regular theodolites acro|
5|44485|F|144659.20|1994-07-30|5-LOW|Clerk#000000925|0|quickly. bold deposits sleep slyly. packages use slyly|

(the csv were created using the DBGen-tool from TPC fpr tpc-h, in case you wonder)

I hope you understand what I need and can help me out on this. Thank you very much!

EDIT: Using String.split("|");' sure seems obvious, but the thing is, that the programm I'm working with is quite complex and relies on the regex.pattern and regex.matcher at various parts. So since I'm not very familiar with the program and java itself, the only solution for me is to use the given code and just replace the regular expression by one that works for me.

EDIT2: the thing is I'm trying to use this TPC-H implementation from OLTP-Bench: https://github.com/ben-reilly/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/TPCHLoader.java#L347

where the problematic line is 347. It's a full implemetation of the TPC-H database benchmark, but without a data generator. So I use the dbgen tool provided by the TPC to generate the csv files. I can't get in contact with the developer sadly.

I would suggest using String.split("|"); . This will give you an array of strings representing the text around your line.

If however, you really want to use a regex:

Pattern csvPattern = Pattern.compile("\\s*(\\d*)\\|(([^|]+)\\|)+");

This one should match a number(or none), a pipe, and then a repeating pattern of Something|

Given your source, you could probably just replace the comma with a pipe, since from the comments, all that pattern does is split the string on a delimiter (except the ones in double quotes)

eg: from

\\s*(\"[^\"]*\"|[^,]*)\\s*,?

to

\\s*(\"[^\"]*\"|[^|]*)\\s*\\|?

As for your number exception, you need to debug the way you're calling the CSV loader.

I've never used that tool before, but if you look at line 352

for (int i = 0; i < types.length; ++i) {

Now look at the switch block that starts at line 362: it defines the types that each field should be casted to.

switch(types[i]) {
    case DOUBLE:
        prepStmt.setDouble(i+1, Double.parseDouble(field));
        break;
...

This type of conversion is likely going to cause issues if you don't properly specify the types.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM