简体   繁体   中英

Regex to find multiple occurrence of string in column data for Oracle DB (11g)

In one of the column, I have JSON data stored in text format. Sample data looks like:

{
    "dummy_time": "2016-04-27T18:44:55+00:00",
    "timeout_data": "2016-04-29T16:22:35+00:00,2016-04-30T16:22:35+00:00,2016-05-29T16:22:35+00:00",
    "time_id": "T101",
    "time_desc": "bla bla bla"
}

I need to write SQL query to find out rows for whome this column has more than one date for key timeout_data. I am not that good with REGEX. I wrote simple query to find two occurences of dates in key timeout_data:

SELECT *
FROM table1 f
WHERE REGEXP_LIKE(f.data, '.+\"timeout_data\": \".+,.+\",\"time_id\".*')
AND ROWNUM<6;

And this query is running since ages. Table has around 35,000,000 rows. I am not sure how to find more than one occurrences of dates in key timeout_data. Please let me know if you require more information.

Maybe instr() will be faster.

SELECT 'test'
FROM  (SELECT '{
    "dummy_time": "2016-04-27T18:44:55+00:00",
    "timeout_data": "2016-04-29T16:22:35+00:00,2016-05-29T16:22:35+00:00",
    "time_id": "T101",
    "time_desc": "bla bla bla"
}' data FROM dual)
WHERE instr(data,',',instr(data,'"timeout_data"')) < instr(data,'"',instr(data,'"timeout_data"'),4);

You can test it on a subset of your data. Also, instr(data,'"timeout_data"') is used twice, maybe factoring it out will speed up the process.

Update: There seems to be problem with your REGEX - needs to have .+ before \\"time_id\\"

Also - REGEX is not the limiting factor here. On my machine a loop of 100k SELECT .. FROM dual takes 4.3 seconds. With addition of REGEX it takes additional 0.1 seconds.

1) Extract line with timeout_data

2) Count occurrence of pattern.

select REGEXP_COUNT(regexp_substr(json,'^.*timeout_data.*$', 1,1,'m'), '\d{4}-\d{2}-\d{2}T')
from ( 
select '{
    "dummy_time": "2016-04-27T18:44:55+00:00",
    "timeout_data": "2016-04-29T16:22:35+00:00,2016-04-30T16:22:35+00:00,2016-05-29T16:22:35+00:00",
    "time_id": "T101",
    "time_desc": "bla bla bla"
}' json from dual
);

You can find an ISO8601 formatted date using the regular expression:

\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}

If you wanted multiple comma-separated copies of it then you can use:

pattern(,pattern)+

(or remove the trailing + if you only want find exactly two matches)

So

\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}(,\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2})+

Which would make your query:

SELECT *
FROM table1 f
WHERE REGEXP_LIKE(f.data, '"timeout_data":\s*"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}(,\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2})+"')
AND ROWNUM<6;

However, if you don't want to specifically match the dates and just want to test if there are multiple comma-separated values then you can use:

[^,"]+(,[^,"]+)+

(Note: ensuring that the pattern does not match the closing double quote - otherwise the pattern might match beyond the end of the attribute.)

Which would make your query:

SELECT *
FROM table1 f
WHERE REGEXP_LIKE(f.data, '"timeout_data":\s*"[^,"]+([^,"]+)+"')
AND ROWNUM<6;

Note: you do not need to look for anything before or after the segment you are filtering for and you do not need to escape the double quotes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM