简体   繁体   中英

Split JSON data into multiple files using Python

I have this file where i have parsed from the internet. In there it consists of a json formatted file.

I am trying to split this file into smaller parts.

for example :

Original file :

{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 11,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ],
  "request": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 1,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ]
 },
 "context": {
  "title": "CS Curriculum",
  "facets": [
   [
    {
     "label": "lectures",
     "anchor": "Lectures",
     "label_with_op": "more:lectures"
    }
   ],
   [
    {
     "label": "assignments",
     "anchor": "Assignments",
     "label_with_op": "more:assignments"
    }
   ],
   [
    {
     "label": "reference",
     "anchor": "Reference",
     "label_with_op": "more:reference"
    }
   ]
  ]
 },
 "searchInformation": {
  "searchTime": 0.239874,
  "formattedSearchTime": "0.24",
  "totalResults": "14700000",
  "formattedTotalResults": "14,700,000"
 },
 "items": [
  {
   "kind": "customsearch#result",
   "title": "Lecture slides on PageRank",
   "htmlTitle": "Lecture slides on \u003cb\u003ePageRank\u003c/b\u003e",
   "link": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "displayLink": "www.cs.utexas.edu",
   "snippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & PageRank. \nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
   "htmlSnippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & \u003cb\u003ePageRank\u003c/b\u003e. \u003cbr\u003e\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
   "cacheId": "CwgPK6hTEZQJ",
   "mime": "application/vnd.ms-powerpoint",
   "fileFormat": "Microsoft Powerpoint",
   "formattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "htmlFormattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-\u003cb\u003epagerank\u003c/b\u003e.ppt",
   "pagemap": {
    "metatags": [
     {
      "author": "jhebert",
      "last saved by": "Google"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "The PageRank Citation Ranking: Bringing Order to the Web January ...",
   "htmlTitle": "The \u003cb\u003ePageRank\u003c/b\u003e Citation Ranking: Bringing Order to the Web January \u003cb\u003e...\u003c/b\u003e",
   "link": "http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/pagerank.pdf",
   "displayLink": "www.cis.upenn.edu",
   "snippet": "Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. We show how to ... \nThis ranking, called PageRank, helps search engines and.",
   "htmlSnippet": "Jan 29, 1998 \u003cb\u003e...\u003c/b\u003e We compare \u003cb\u003ePageRank\u003c/b\u003e to an idealized random Web surfer. We show how to ... \u003cbr\u003e\nThis ranking, called \u003cb\u003ePageRank\u003c/b\u003e, helps search engines and.",
   "cacheId": "akmuPYNhiKMJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../pagerank.pdf",
   "htmlFormattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../\u003cb\u003epagerank\u003c/b\u003e.pdf",
   "pagemap": {
    "cse_image": [
     {
      "src": "x-raw-image:///9a2d934c7c41f83c4c97c3fb9a4cb4cc8fbcb453aaf1002ed6f970005773aa0e"
     }
    ],
    "cse_thumbnail": [
     {
      "width": "262",
      "height": "193",
      "src": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQcCouA-BJlMWA0HZNMSxsXzbqIZzgu6tXXRqiuse2sttpJaNK2b0cNbm4"
     }
    ],
    "metatags": [
     {
      "producer": "AFPL Ghostscript 7.0",
      "creator": "dvipsk 5.58f Copyright 1986, 1994 Radical Eye Software",
      "title": "prpaperdraft.dvi"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...",
   "htmlTitle": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES \u003cb\u003e...\u003c/b\u003e",
   "link": "http://stanford.edu/class/math51/PageRank.pdf",
   "displayLink": "stanford.edu",
   "snippet": "Google's method1 is called the PageRank algorithm and was developed by \nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "htmlSnippet": "Google's method1 is called the \u003cb\u003ePageRank\u003c/b\u003e algorithm and was developed by \u003cbr\u003e\nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "cacheId": "RKV6ZEmHrjUJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "stanford.edu/class/math51/PageRank.pdf",
   "htmlFormattedUrl": "stanford.edu/class/math51/\u003cb\u003ePageRank\u003c/b\u003e.pdf",
   "pagemap": {
    "metatags": [
     {
      "producer": "pdfTeX-1.40.13",
      "creator": "TeX",
      "creationdate": "D:20130604152429-07'00'",
      "moddate": "D:20130604152429-07'00'",
      "fullbanner": "This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012) kpathsea version 6.1.0"
     }
    ]
   }
  },

File after processing

{u'snippet': u'Distributed Computing Seminar. Lecture 5: Graph Algorithms & PageRank. \\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.', u'title': u'Lecture slides on PageRank'} {u'snippet': u'Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. We show how to ... \\nThis ranking, called PageRank, helps search engines and.', u'title': u'The PageRank Citation Ranking: Bringing Order to the Web January ...'} {u'snippet': u"Google's method1 is called the PageRank algorithm and was developed by \\nGoogle founders Sergey Brin and Larry Page while they were graduate students.", u'title': u'MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...'}

into three different text files/.txt files/.json files

each beggining with {u'snippet' ... '}

an trying to do this to run a text comparison process

PS: I have edited out the only parts that i need which are title and snippet parts. Thus i might have lost the json formatting in those processes.

Since you seem to be able to find the correct parts of your input, you should be able to write it to independent files. I assume that you have some kind of loop where you find the relevant data:

fileno = 1
while True:  # or whatever you use to loop over your input
    # parse input
    # ...
    # have the 'snippet'-part in a variable
    with open('file_{:02d}.txt'.format(fileno), 'w') as f:
        fileno += 1
        f.write(snippet_var + "\n")

This will give you numbered files, starting from 1 with leading zeros. If my assumptions are wrong, please update your question to display your current way of doing things.

Furthermore, I would advise against your "preprocessing" if all you want to do is extracting the "snippet" property of JSON objects.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM