简体   繁体   中英

How to speed up this long running postgres query with a large json document?

We aggregate different marketplace listing offerings (ebay, Amazon) into one application. The listings table has certain shared columns like:

sku, title, price, quantity, state, etc..., common to all listings...

Rather than create separate polymorphic tables to store columns that are marketplace specific, we took advantage of postgres 9.3 json column and created a listing_data column within the listings table to hold all the original listing json data as we get it from the marketplace api in the same listings table.

The web page to show all eBay listings for one account would make a paginated query like this, we're using select to restrict some columns and deep fetching some values out of the nested json values:

SELECT
    id,
    sku,
    title,
    price,
    quantity,
    state,
    listing_data->>'listing_type' as listing_type,
    listing_data->>'listing_duration' as listing_duration,
    json_extract_path(listing_data, 'listing_details', 'end_time') as end_time,
    listing_data->>'start_price' as start_price,
    listing_data->>'buy_it_now_price' as buy_it_now_price,
    listing_data->>'reserve_price' as reserve_price,
    json_extract_path(listing_data, 'variations', 'variations') as ebay_variations
FROM "listings"
WHERE
    "listings"."channel_id" = $1 AND
    ("listings"."state" NOT IN ('deleted', 'archived', 'importing')) AND
    "listings"."state" IN ('online')
ORDER BY created_at DESC
LIMIT 25 OFFSET 0

The problem is we are seeing this query sometimes taking longer than 30 seconds and timing out on heroku. We are on the Heroku Ika postgres plan with 7Gig of postgres memory. What we've discovered in practice is that clients tend to put huge amounts of HTML (including even embedded binary video and flash applications!), just the ebay description can be up to 500K.

Here is an example explain analyze output similar to select statement above:

Limit  (cost=0.11..58.72 rows=25 width=205) (actual time=998.693..1005.286 rows=25 loops=1)
  ->  Index Scan Backward using listings_manager_new on listings  (cost=0.11..121084.58 rows=51651 width=205) (actual time=998.691..1005.273 rows=25 loops=1)
        Index Cond: ((channel_id = xyz) AND ((state)::text = 'online'::text) AND ((type)::text = 'ListingRegular'::text))
Total runtime: 1005.330 ms

I interpret this to mean postgres is using the index, but still has a high cost? I've read that 9.3 stores json as text blob, so it's expensive to extract the json data values from a large json document, even if we are ignoring the description key, the whole json document needs to be parsed. We are not filtering based on json data, I'm hoping that json parsing cost is only associated with the 25 results limited by the pagination, but I'm not sure.

I've read some other stackoverflow and blogs that suggest that the size of a row or having a 'wide' table impacts performance because postgres queries over 8k pages, and larger rows require more disk IO to go through several more pages. I don't know if this is true only for sequential scans or applies to using an index as well.

It might make sense to move the json column out of the the main listings table and have 1-1 has one association with a separate table that contains only the json, but this would then require a join on the query.

Before I do anything I thought I would reach out and get some other opinions or advice for how I might analyze what our bottleneck is and what might be a solution for speeding up this query.

Postgresql uses a technique called TOAST to store large attributes, which would otherwise be too large to store in a page. Such attributes are stored out of line, in a separate file referenced from the row they belong to.

Your JSON attributes are stored in a single column, so if the description field is as big as you say, then it is quite likely that the whole of the JSON data will be stored using TOAST for many such rows.

If a query references this column at all, then the whole column value needs to be read in, which causes a lot of I/O. If the column were referenced in the WHERE clause it would have the biggest impact, but this does not appear to be the case for the sample query you have shown. But even if it only appears in the SELECT clause, it means the TOAST data has to read in for all matching rows.

My advice would be:

  • If you don't really need the massive description field, then filter it out before storing in the database
  • If you do need it, consider splitting the JSON data into two fields, with frequently accessed fields in one column and other larger and less frequently used fields in another.
  • It may be a good idea to have the larger less frequently used fields in a separate table altogether.
  • Avoid using the JSON fields in WHERE clauses - try to limit the query using data in the other columns
  • The above may make you rethink your original decstructure not to use an inherited table structure, particularly if there components of the JSON structure which would be useful in the WHERE clauses.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM