简体   繁体   中英

Can I stream data from a writer to a reader in golang?

I want to process a number of files whose contents don't fit in the memory of my worker. The solution I found so far involves saving the results to the processing to the /tmp directory before uploading it to S3.

import (
    "bufio"
    "bytes"
    "context"
    "fmt"
    "log"
    "os"
    "runtime"
    "strings"
    "sync"

    "github.com/aws/aws-sdk-go-v2/service/s3"
    "github.com/korovkin/limiter"
    "github.com/xitongsys/parquet-go/parquet"
    "github.com/xitongsys/parquet-go/writer"
)

func DownloadWarc(
    ctx context.Context,
    s3Client *s3.Client,
    warcs []*types.Warc,
    path string,
) error {
    key := fmt.Sprintf("parsed_warc/%s.parquet", path)

    filename := fmt.Sprintf("/tmp/%s", path)
    file, err := os.Create(filename)
    if err != nil {
        return fmt.Errorf("error creating file: %s", err)
    }
    defer file.Close()

    bytesWriter := bufio.NewWriter(file)
    pw, err := writer.NewParquetWriterFromWriter(bytesWriter, new(Page), 4)
    if err != nil {
        return fmt.Errorf("Can't create parquet writer: %s", err)
    }

    pw.RowGroupSize = 128 * 1024 * 1024 //128M
    pw.CompressionType = parquet.CompressionCodec_SNAPPY

    mutex := sync.Mutex{}
    numWorkers := runtime.NumCPU() * 2
    fmt.Printf("Using %d workers\n", numWorkers)
    limit := limiter.NewConcurrencyLimiter(numWorkers)

    for i, warc := range warcs {
        limit.Execute(func() {
            log.Printf("%d: %+v", i, warc)
            body, err := GetWarc(ctx, s3Client, warc)
            if err != nil {
                fmt.Printf("error getting warc: %s", err)
                return
            }

            page, err := Parse(body)
            if err != nil {
                key := fmt.Sprintf("unparsed_warc/%s.warc", path)
                s3Client.PutObject(
                    ctx,
                    &s3.PutObjectInput{
                        Body:   bytes.NewReader(body),
                        Bucket: &s3Record.Bucket.Name,
                        Key:    &key,
                    },
                )
                fmt.Printf("error getting page %s: %s", key, err)
                return
            }

            mutex.Lock()
            err = pw.Write(page)
            pw.Flush(true)
            mutex.Unlock()
            if err != nil {
                fmt.Printf("error writing page: %s", err)
                return
            }
        })
    }

    limit.WaitAndClose()
    err = pw.WriteStop()
    if err != nil {
        return fmt.Errorf("error writing stop: %s", err)
    }
    bytesWriter.Flush()

    file.Seek(0, 0)
    _, err = s3Client.PutObject(
        ctx,
        &s3.PutObjectInput{
            Body:   file,
            Bucket: &s3Record.Bucket.Name,
            Key:    &key,
        },
    )
    if err != nil {
        return fmt.Errorf("error uploading warc: %s", err)
    }

    return nil
}

Is there a way to avoid saving the contents into a temp file and use only a limited size byte buffer between the writer and the upload function?

In other words can I begin to stream data to a reader while still writing to the same buffer?

Yes there is a way to write the same content to multiple writers. Using io.MultiWriter might allow you to not use a temp file. However, it might still be good to use a temp file.

I often use io.MultiWriter to write to a list of checksum (sha256...) calculators. Actually, last time I read the the S3 client code, I noticed it does this under the hood to calculate the checksum. MultiWriter is pretty useful for piping big files between cloud places.

Also, if you end up using temp files. You may want to use os.CreateTemp to create temporary files. If you don't, you may run into issues with your created file names if your code is running in two processes or your files have the same name.

Feel free to clarify your question. I can try to answer again:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM