簡體   English   中英

如何使用 htlm 讀取 json 作為 pandas dataframe

[英]How to read json with htlm as a pandas dataframe

我正在嘗試將這些數據: https://cv.iptc.org/newscodes/mediatopic/轉換為 pandas dataframe。

我主要對“Q-code”感興趣,例如Concept ID (QCode) = medtop:01000000和所屬的Name(en-GB) ,例如藝術、文化、娛樂和媒體

到目前為止,我最好的嘗試是將數據下載為 json 文件。 在網站頂部有一個鏈接“以其他格式查看此方案:NewsML G2 Knowledge Item | RDF/XML | RDF/海龜 | JSON-LD''

當我下載 json 文件時,我必須刪除第一部分:

"@context": "https://www.iptc.org/std/IKOS/IKOS.jsonld", 
"uri": "http://cv.iptc.org/newscodes/mediatopic/", 
"type": "http://www.w3.org/2004/02/skos/core#ConceptScheme", 
"prefSchemeAlias": "medtop", 
"authority": "http://www.iptc.org", 
"copyrightHolder": "IPTC, International Press Telecommunications Council - https://iptc.org", 
"licenceLink": "http://creativecommons.org/licenses/by/4.0/", 
"dateReleased": "2022-07-07T12:00:00+00:00", 
"prefLabel" : {
"en-GB" : "Media Topic"},
"definition" : {
"en-GB" : "Indicates a subject of an item."},
"note" : {
"en-GB" : "The Media Topic NewsCodes has been IPTC's primary subject taxonomy since 2010, with a focus on classification of text. The development started with our previous Subject Codes taxonomy and extended the tree to 5 levels and reused the same 17 top level terms. The terms below the top level have been revised and rearranged. Most Media Topic concepts provide a mapping back to one of the Subject Codes, and many provide a mapping to Wikidata."},
"hasTopConcept" : [
"http://cv.iptc.org/newscodes/mediatopic/01000000", "http://cv.iptc.org/newscodes/mediatopic/02000000", "http://cv.iptc.org/newscodes/mediatopic/03000000", "http://cv.iptc.org/newscodes/mediatopic/04000000", "http://cv.iptc.org/newscodes/mediatopic/05000000", "http://cv.iptc.org/newscodes/mediatopic/06000000", "http://cv.iptc.org/newscodes/mediatopic/07000000", "http://cv.iptc.org/newscodes/mediatopic/08000000", "http://cv.iptc.org/newscodes/mediatopic/09000000", "http://cv.iptc.org/newscodes/mediatopic/10000000", "http://cv.iptc.org/newscodes/mediatopic/11000000", "http://cv.iptc.org/newscodes/mediatopic/12000000", "http://cv.iptc.org/newscodes/mediatopic/13000000", "http://cv.iptc.org/newscodes/mediatopic/14000000", "http://cv.iptc.org/newscodes/mediatopic/15000000", "http://cv.iptc.org/newscodes/mediatopic/16000000", "http://cv.iptc.org/newscodes/mediatopic/17000000"
],

然后我在我的 jupyter 筆記本中加載了 json 文件:

import json 
df = pd.read_json("cptall-en-GB.json")

結果是兩列,一列有索引,一列只有一個長字符串,包含所有信息。 前兩個示例結果如下:

{'conceptSet': {0: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/01000000',
   'qcode': 'medtop:01000000',
   'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
   'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
   'modified': '2021-02-18T12:00:00+00:00',
   'prefLabel': {'en-GB': 'arts, culture, entertainment and media'},
   'definition': {'en-GB': 'All forms of arts, entertainment, cultural heritage and media'},
   'narrower': ['medtop:20000002', 'medtop:20000038', 'medtop:20000045'],
   'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/01000000'],
   'created': '2009-10-22T02:00:00+00:00'},
  1: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/02000000',
   'qcode': 'medtop:02000000',
   'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
   'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
   'modified': '2021-05-05T12:00:00+00:00',
   'prefLabel': {'en-GB': 'crime, law and justice'},
   'definition': {'en-GB': 'The establishment and/or statement of the rules of behaviour in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organisations and bodies involved in these activities'},
   'narrower': ['medtop:20000082',
    'medtop:20000106',
    'medtop:20000119',
    'medtop:20000121',
    'medtop:20000129'],
   'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/02000000',
    'https://www.wikidata.org/entity/Q146491'],
   'created': '2009-10-22T02:00:00+00:00'}}}

關於如何將其變成更好看的 dataframe 的任何建議,所有字段都作為列?

你可以刪除

"conceptSet":

從一開始和

}

從最后。 之后嘗試再次閱讀。 我已經下載並導入了數據集。 對我來說工作得很好。

我的代碼:

import pandas as pd

df = pd.read_json(r"C:\Users\hp\Downloads\cptall-en-US.json")
print(df)

我的 Output:

                                                   uri  ... hasFacet
0     http://cv.iptc.org/newscodes/mediatopic/01000000  ...      NaN
1     http://cv.iptc.org/newscodes/mediatopic/02000000  ...      NaN
2     http://cv.iptc.org/newscodes/mediatopic/03000000  ...      NaN
3     http://cv.iptc.org/newscodes/mediatopic/04000000  ...      NaN
4     http://cv.iptc.org/newscodes/mediatopic/05000000  ...      NaN
...                                                ...  ...      ...
1351  http://cv.iptc.org/newscodes/mediatopic/20001355  ...      NaN
1352  http://cv.iptc.org/newscodes/mediatopic/20001356  ...      NaN
1353  http://cv.iptc.org/newscodes/mediatopic/20001357  ...      NaN
1354  http://cv.iptc.org/newscodes/mediatopic/20001358  ...      NaN
1355  http://cv.iptc.org/newscodes/mediatopic/20001359  ...      NaN

[1356 rows x 15 columns]

我的 JSON 文件中的一些數據:

[
    {
        "uri": "http://cv.iptc.org/newscodes/mediatopic/01000000",
        "qcode": "medtop:01000000",
        "type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
        ],
        "inScheme": [
            "http://cv.iptc.org/newscodes/mediatopic/"
        ],
        "modified": "2021-02-18T12:00:00+00:00",
        "prefLabel": {
            "en-US": "arts, culture, entertainment and media"
        },
        "definition": {
            "en-US": "All forms of arts, entertainment, cultural heritage and media"
        },
        "narrower": [
            "medtop:20000002",
            "medtop:20000038",
            "medtop:20000045"
        ],
        "exactMatch": [
            "http://cv.iptc.org/newscodes/subjectcode/01000000"
        ],
        "created": "2009-10-22T02:00:00+00:00"
    },
    {
        "uri": "http://cv.iptc.org/newscodes/mediatopic/02000000",
        "qcode": "medtop:02000000",
        "type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
        ],
        "inScheme": [
            "http://cv.iptc.org/newscodes/mediatopic/"
        ],
        "modified": "2021-05-05T12:00:00+00:00",
        "prefLabel": {
            "en-US": "crime, law and justice"
        },
        "definition": {
            "en-US": "The establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organizations and bodies involved in these activities"
        },
        "narrower": [
            "medtop:20000082",
            "medtop:20000106",
            "medtop:20000119",
            "medtop:20000121",
            "medtop:20000129"
        ],
        "exactMatch": [
            "http://cv.iptc.org/newscodes/subjectcode/02000000",
            "https://www.wikidata.org/entity/Q146491"
        ],
        "created": "2009-10-22T02:00:00+00:00"
    }]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM