简体   繁体   中英

Removing pattern from multiple lines using sed or awk in two places in the same line

I have a JSON file with 12,166,466 of lines. I want to remove quotes from values on keys: "timestamp": "1538564256", and "score": "10", to look like "timestamp": 1538564256, and "score": 10, .

Input:

{
    "title": "DNS domain", ,
    "timestamp": "1538564256",
    "domain": {
        "dns": [
            "www.google.com"
        ]
    },
    "score": "10",
    "link": "www.bit.ky/sdasd/asddsa"
    "id": "c-1eOWYB9XD0VZRJuWL6"
}, {
    "title": "DNS domain",
    "timestamp": "1538564256",
    "domain": {
        "dns": [
            "google.de"
        ]
    },
    "score": "10",
    "link": "www.bit.ky/sdasd/asddsa",
    "id": "du1eOWYB9XD0VZRJuWL6"
}
}

Expected output:

{
    "title": "DNS domain", ,
    "timestamp": 1538564256,
    "domain": {
        "dns": [
            "www.google.com"
        ]
    },
    "score": 10,
    "link": "www.bit.ky/sdasd/asddsa"
    "id": "c-1eOWYB9XD0VZRJuWL6"
}, {
    "title": "DNS domain",
    "timestamp": 1538564256,
    "domain": {
        "dns": [
            "google.de"
        ]
    },
    **"score": 10,**
    "link": "www.bit.ky/sdasd/asddsa",
    "id": "du1eOWYB9XD0VZRJuWL6"
}
}

I have tried:

sed -E '
s/"timestamp": "/"timestamp": /g
s/"score": "/"score": /g
'

the first part is quite straightforward, but how to remove ", at that the end of the line that contains "timestamp" and "score"? How do I access that using sed or even awk, or other tool with the mind that I have 12 million lines to process?

Assuming that you fix your JSON input file like this:

<file jq .
[
  {
    "title": "DNS domain",
    "timestamp": "1538564256",
    "domain": {
      "dns": [
        "www.google.com"
      ]
    },
    "score": "10",
    "link": "www.bit.ky/sdasd/asddsa",
    "id": "c-1eOWYB9XD0VZRJuWL6"
  },
  {
    "title": "DNS domain",
    "timestamp": "1538564256",
    "domain": {
      "dns": [
        "google.de"
      ]
    },
    "score": "10",
    "link": "www.bit.ky/sdasd/asddsa",
    "id": "du1eOWYB9XD0VZRJuWL6"
  }
]

You can use jq and its tonumber function to change the wanted strings to values:

<file jq '.[].timestamp |= tonumber | .[].score |= tonumber'

If the JSON structure matches roughly your example (eg, there won't be any other whitespace characters between "timestamp" , the colon, and the value), then this awk should be ok. If available, using jq for JSON transformation is the better choice by far!

awk '{print gensub(/("(timestamp|score)": )"([0-9]+)"/, "\\1\\3", "g")}' file
  1. Be warned that tonumber can lose precision. If using tonumber is inadmissible, and if the output is produced by jq (or otherwise linearized vertically), then using awk as proposed elsewhere on this page is a good way to go. (If your awk does not have gensub, then the awk program can be easily adapted.) Here is the same thing using sed , assuming its flag for extended regex processing is -E :

    sed -E -e 's/"(timestamp|score)": "([0-9]+)"/"\\1": \\2/'

  2. For reference, if there's any doubt about where the relevant keys are located, here's a filter in jq that is agnostic about that:

    walk(if type == "object" then if has("timestamp") then .timestamp|=tonumber else . end | if has("score") then .score|=tonumber else end else . end)

If your jq does not have walk/1 , then simply snarf its def from the web, eg from https://raw.githubusercontent.com/stedolan/jq/master/src/builtin.jq

  1. If you wanted to convert all number-valued strings to numbers, you could write:

    walk(if type=="object" then map_values(tonumber? // .) else . end)

This might work for you (GNU sed):

sed ':a;/"timestamp":\s*"1538564256",/{s/"//3g;:b;n;/timestamp/ba;/"score":\s*"10"/s/"//3g;Tb}' file

On encountering a line that contains "timestamp": "1538564256", remove the 3rd or more " 's. Then read on until another line containing timestamp and repeat or a line containing "score": "10 and remove the 3rd or more " 's.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM