简体   繁体   English

使用生成器将 JSON 和 TSV 数据转换为字典

[英]Use a Generator To Convert JSON and TSV Data into a Dictionary

We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.我们需要从文件 file.data 中获取数据到 DataFrame。问题是文件每一行的数据是 JSON 或制表符分隔值 (TSV) 格式。

The JSON lines are in the correct format, they just need to be converted to native Python dicts. JSON 行的格式正确,它们只需要转换为本机 Python 字典。

The TSV lines need to be converted in to dicts that match the JSON format. TSV 行需要转换为与 JSON 格式匹配的字典。

Here is a sample of the file:以下是该文件的示例:

{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons    Persistent contextually-based standardization   018.666.0600    America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net   (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}

Write a generator that takes an iterator as an argument.编写一个以迭代器作为参数的生成器。 It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:它应该解析迭代器中的值并以正确的格式生成每个值:一个带有键的字典:

  • company公司
  • catch_phrase口号
  • phone电话
  • timezone时区
  • client_count client_count

My code so far:到目前为止我的代码:

df = pd.read_csv("file.data", sep="\t")
    for col in df[["company"]]:
        obj = df[col]
        for item in obj.values:
            json_obj = json.loads(item)

Don't use pandas to read the entire file.不要使用 pandas 来读取整个文件。 Instead, read the file line by line, and create a list of dicts.相反,逐行读取文件,并创建一个字典列表。 Then use pandas to get your dataframe.然后使用 pandas 获取您的 dataframe。

dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
    for line in f:
        line = line.strip()
        try:
            d = json.loads(line)
            dict_data.append(d)
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list

After this, we have在此之后,我们有

dict_data = [{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

tsv_data = [['Bennett and Sons',
  'Persistent contextually-based standardization',
  '018.666.0600',
  'America/Los_Angeles',
  '492'],
 ['Ferguson-Garner',
  'Multi-layered tertiary neural-net',
  '(086)401-8955x53502',
  'America/Los_Angeles',
  '528']]

Notice that everything in tsv_data is a string, so we're going to have to fix that at some point.请注意, tsv_data中的所有内容都是字符串,因此我们必须在某个时候修复它。

Now, create a dataframe using the two lists dict_data and tsv_data , change the data type for the tsv dataframe, and join the two.现在,使用dict_datatsv_data这两个列表创建一个 dataframe,更改tsv dataframe 的数据类型,然后加入两者。

data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)


for column in df_tsv:
    df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)

df_all = df_dict.append(df_tsv).reset_index(drop=True)

df_all looks like this: df_all看起来像这样:

company公司 catch_phrase口号 phone电话 timezone时区 client_count client_count
0 0 Watkins Inc沃特金斯公司 Integrated radical installation集成激进安装 7712422719 7712422719 America/New_York美国/纽约 442 442
1 1个 Pennington PLC彭宁顿公司 Future-proofed tertiary frame面向未来的三级框架 +1-312-296-2956x137 +1-312-296-2956x137 America/Indiana/Indianapolis美国/印第安纳州/印第安纳波利斯 638 638
2 2个 Be.nett and Sons贝内特父子 Persistent contextually-based standardization持续的基于上下文的标准化 018.666.0600 018.666.0600 America/Los_Angeles美国/洛杉矶 492 492
3 3个 Ferguson-Garner弗格森-加纳 Multi-layered tertiary neural.net多层三级神经网络 (086)401-8955x53502 (086)401-8955x53502 America/Los_Angeles美国/洛杉矶 528 528

Applying this to work with a generator function like you originally wanted:将其应用于您最初想要的生成器 function:

def parse_file(file_iterator):
    dict_keys_types = None

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:现在,您可以将文件迭代器传递给这个 function,它会生成您想要的字典:

list(parse_file(f))

[{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Bennett and Sons',
  'catch_phrase': 'Persistent contextually-based standardization',
  'phone': '018.666.0600',
  'timezone': 'America/Los_Angeles',
  'client_count': 492},
 {'company': 'Ferguson-Garner',
  'catch_phrase': 'Multi-layered tertiary neural-net',
  'phone': '(086)401-8955x53502',
  'timezone': 'America/Los_Angeles',
  'client_count': 528},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]


When the first line of the file is not a json dict, this will cause an error because it won't have the keys and data types.当文件的第一行不是json 字典时,这将导致错误,因为它没有键和数据类型。 Instead of inferring the keys and values from the first json dict you see, you could either hardcode the keys and data types, or put the tsv lines that come before a dict into a separate list to be parsed later.不是从您看到的第一个 json 字典中推断键和值,您可以对键和数据类型进行硬编码,或者将字典之前的 tsv 行放入一个单独的列表中,以便稍后解析。

Hardcode approach:硬编码方法:

def parse_file(file_iterator):
    dict_keys_types = [('company', str),
         ('catch_phrase', str),
         ('phone', str),
         ('timezone', str),
         ('client_count', int)]

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict

Save-for-later approach:保存以备后用的方法:

def parse_file(file_iterator):
    dict_keys_types = None
    unused_tsv_lines = []
    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            if dict_keys_types: # Check if this is set already
                # If it is, 
                # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
                tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
                yield tsv_dict
            else: # Else add to unused_tsv_lines
                unused_tsv_lines.append(tsv_data)

    # After you've finished reading the file, try to reparse the lines
    # you skipped before
    if dict_keys_types: # Before parsing, make sure dict_keys_types was set
        for tsv_data in unused_tsv_lines:
            # With each line, do the same thing as before
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM