简体   繁体   中英

Parsing a json file to pandas dataframe

I would need to parse some json files to a pandas dataframe. I want to have one column with the words present in the text, and another column with the corresponding entity – the entity will be the “Type” of the text below, when the “value” corresponds to the word, otherwise I want to assign the label 'O'.

Below is an example. This is the JSON file:

       {"Text": "I currently use a Netgear Nighthawk AC1900. I find it reliable.",
        "Entities": [
        {
            "Type": "ORGANIZATION ", 
            "Value": "Netgear"
        }, 
        {
            "Type": "DEVICE ", 
            "Value": "Nighthawk AC1900"
        }]
       }

Here is what I want to get:

              WORD                TAG
              I                    O
              currently            O
              use                  O
              a                    O
              Netgear              ORGANIZATION
              Nighthawk AC1900     DEVICE
              .                    O
              I                    O
              find                 O
              it                   O
              reliable             O
              .                    O

Can someone help me with the parsing? I can`t use the split() because sometime the values consists of two words. Hope this is clear. Thank you!

This is a difficult problem and will depend on what data isn't in this example and the output required. Do you have repeating data in the entity values? is order important? Did you want repetition in the output?

There are a few tools that can be used:

  • make a trie out of the Entity values before you search the string. This is good if you have overlapping versions of the same name like "Netgear" and "Netgear INC." and you want the longest version.
  • nltk.PunktSentenceTokenizer This one is finicky to work with about the Nouns. This tutorial does a better job of explaining how to deal with them.

I don't know if what you need is strictly what you post as a desired output. The solution I am giving you is "dirty" (more elements and the column TAG is placed first) You can manage to clean it and put it in the format you need. As you didn't provided a piece of code to start on, you can finish it. Eventually you will find out that the purpose of stackoverflow is not to get people to write the code for you, but people to help you out with the code you are trying.

import json
import pandas as pd

#open and reading of the json:
with open('netgear.json','r') as jfile:
   data = jfile.read()

info = json.loads(data)

#json into content 
words,tags = info['Text'].split(),info['Entities']

#list to handle the Entities
prelist = []

for i in tags:

    j = list(i.values())
    #['ORGANIZATION ', 'Netgear']
    #['DEVICE ', 'Nighthawk AC1900']    

    prelist.append(j)

#DataFrames to be merged
dft = pd.DataFrame(prelist,columns=['TAG','WORD'])  
dfw = pd.DataFrame(words,columns=['WORD'])  

#combine the dataFrames and NaN into 0
df = dfw.merge(dft, on='WORD', how='outer').fillna(0)

This is the output:

                 WORD            TAG
0                  I              0
1                  I              0
2          currently              0
3                use              0
4                  a              0
5            Netgear  ORGANIZATION 
6          Nighthawk              0
7            AC1900.              0
8               find              0
9                 it              0
10         reliable.              0
11  Nighthawk AC1900        DEVICE 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM