简体   繁体   中英

Flink SQL doesn't unpack gzipped source on the fly - but still parses PART of it

I've met a strange issue and i need to ask you guys if im not missing anything. I have an issue with parsing gzipped json in plain file, but i'm cutting this to much more simple case:

I have filesystem raw source, and simple sql which counts lines. For non-compressed test file of 1k lines, i get 1k as result of count. for same file, gzipped with terminal, i get 12 as a result.

Strangiest thing is that if applied to json log file (that's my initial task), Flink actually parses PART of json objects from gzipped file.

This is my SQL:

def main():
        CREATE TABLE logs_source (
            raw_row STRING
        ) WITH (
            'connector' = 'filesystem',
            'path' = '{logs_path}',
            'source.monitor-interval' = '10',
            'format' = 'raw'

        CREATE TABLE print_sink (
            ip_number BIGINT NOT NULL
        ) WITH (
            'connector' = 'print'
            INSERT INTO print_sink
                FROM logs_source

It's written somewhere in documentation that gzip is decoded on the fly, based on the extension (i have filename like *.log.gz).

I searched for any options or parameters to enable parsing for gzipped files specifically - but i've failed...

Flink version 1.16.0, im using pyflink, Python 3.9

What's the issue here? Thanks for any ideas!

I have exact same problem. Looks like flink is only reading part of json file it zipped. It works fine when processing uncompressed json file. Flink version 1.16.0, Flink with java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM