简体   繁体   中英

Join and filter JSON files using jq

i am working on a Yelp json corpus with jq , desperately trying to accomplish some join & filter task. The business.json contains categories and business_id , from which I can get all ids of restaurants, using which I want to filter the review.json to extract all reviews for restaurants.

Sounds straight forward in RDBMS but I want to learn the jq way.

Can anyone help?

Things I have tried.

  1. Extracted business ID and saved in id.txt. But it is impossible to refer to id.txt in jq .

  2. In a script loop all ids and execute jq --arg id $line '. | select( .business_id | contains($id))' reviews.json jq --arg id $line '. | select( .business_id | contains($id))' reviews.json

  3. Joining the two json files maybe possible but I am reluctant to do, due to the size of the files (~1G)

Edited according to comments:

Simplified sample input: business.json

{

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",

"full_address": "4840 E Indian School Rd\\nSte 101\\nPhoenix, AZ 85018", > >

"categories": ["Restaurant"]

}

reviews.json

{

"date": "2012-05-15",

"text": "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new position there in June. He will be missed very much. \\n\\nI think finding a new doctor in NYC that you actually like might almost be as awful as trying to find a date!",

"type": "review",

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }

Best attemp: Able to march documents with multiple ids, like

jq '. | select( .business_id | contains("LRKJF43s9-3jG9Lgx4zODg", "uGykseHzyS5xAMWoN6YUqA"))' reviews.json

But couldn't replace the query strings with variables,

jq --arg t vcNAWiLM4dR7D2nwwJ7nCA '. | select( .business_id | contains(env.t))' reviews.json jq --arg t vcNAWiLM4dR7D2nwwJ7nCA '. | select( .business_id | contains(env.t))' reviews.json doesn't work

It's not clear to me from your description whether each business and each review is a top-level object. However it appears that you can arrange that both businesses and reviews can be presented as streams, so in the following, I will assume that:

(a) both reviews.json and businesses.json are files of JSON objects;
(b) it is acceptable to read all the reviews into memory.

(If, conversely, it is only acceptable to read the businesses into memory, the following can easily be revised.)

The logic is: read all the reviews, and then for each restaurant, extract the reviews for that restaurant.

select(.categories | index("Restaurant"))
| .business_id as $business_id
| $reviews[]
| select( .type == "review" and .business_id == $business_id)

Invocation:

$ jq --slurpfile reviews reviews.json yelp.jq businesses.json

Please note that the --slurpfile option is NOT available in jq 1.4.

(If reviews.json is already an array of JSON objects, then you could use --argfile reviews reviews.json, and thus would not need jq 1.5.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM