[英]Use a Generator To Convert JSON and TSV Data into a Dictionary
We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.我们需要从文件 file.data 中获取数据到 DataFrame。问题是文件每一行的数据是 JSON 或制表符分隔值 (TSV) 格式。
The JSON lines are in the correct format, they just need to be converted to native Python dicts. JSON 行的格式正确,它们只需要转换为本机 Python 字典。
The TSV lines need to be converted in to dicts that match the JSON format. TSV 行需要转换为与 JSON 格式匹配的字典。
Here is a sample of the file:以下是该文件的示例:
{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}
Write a generator that takes an iterator as an argument.编写一个以迭代器作为参数的生成器。 It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:
它应该解析迭代器中的值并以正确的格式生成每个值:一个带有键的字典:
My code so far:到目前为止我的代码:
df = pd.read_csv("file.data", sep="\t")
for col in df[["company"]]:
obj = df[col]
for item in obj.values:
json_obj = json.loads(item)
Don't use pandas to read the entire file.不要使用 pandas 来读取整个文件。 Instead, read the file line by line, and create a list of dicts.
相反,逐行读取文件,并创建一个字典列表。 Then use pandas to get your dataframe.
然后使用 pandas 获取您的 dataframe。
dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
for line in f:
line = line.strip()
try:
d = json.loads(line)
dict_data.append(d)
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list
After this, we have在此之后,我们有
dict_data = [{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
tsv_data = [['Bennett and Sons',
'Persistent contextually-based standardization',
'018.666.0600',
'America/Los_Angeles',
'492'],
['Ferguson-Garner',
'Multi-layered tertiary neural-net',
'(086)401-8955x53502',
'America/Los_Angeles',
'528']]
Notice that everything in tsv_data
is a string, so we're going to have to fix that at some point.请注意,
tsv_data
中的所有内容都是字符串,因此我们必须在某个时候修复它。
Now, create a dataframe using the two lists dict_data
and tsv_data
, change the data type for the tsv
dataframe, and join the two.现在,使用
dict_data
和tsv_data
这两个列表创建一个 dataframe,更改tsv
dataframe 的数据类型,然后加入两者。
data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)
for column in df_tsv:
df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)
df_all = df_dict.append(df_tsv).reset_index(drop=True)
df_all
looks like this: df_all
看起来像这样:
company![]() |
catch_phrase![]() |
phone![]() |
timezone![]() |
client_count ![]() |
|
---|---|---|---|---|---|
0 ![]() |
Watkins Inc![]() |
Integrated radical installation![]() |
7712422719 ![]() |
America/New_York![]() |
442 ![]() |
1 ![]() |
Pennington PLC![]() |
Future-proofed tertiary frame![]() |
+1-312-296-2956x137 ![]() |
America/Indiana/Indianapolis![]() |
638 ![]() |
2 ![]() |
Be.nett and Sons![]() |
Persistent contextually-based standardization![]() |
018.666.0600 ![]() |
America/Los_Angeles![]() |
492 ![]() |
3 ![]() |
Ferguson-Garner![]() |
Multi-layered tertiary neural.net![]() |
(086)401-8955x53502 ![]() |
America/Los_Angeles![]() |
528 ![]() |
Applying this to work with a generator function like you originally wanted:将其应用于您最初想要的生成器 function:
def parse_file(file_iterator):
dict_keys_types = None
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:现在,您可以将文件迭代器传递给这个 function,它会生成您想要的字典:
list(parse_file(f))
[{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Bennett and Sons',
'catch_phrase': 'Persistent contextually-based standardization',
'phone': '018.666.0600',
'timezone': 'America/Los_Angeles',
'client_count': 492},
{'company': 'Ferguson-Garner',
'catch_phrase': 'Multi-layered tertiary neural-net',
'phone': '(086)401-8955x53502',
'timezone': 'America/Los_Angeles',
'client_count': 528},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
When the first line of the file is not a json dict, this will cause an error because it won't have the keys and data types.当文件的第一行不是json 字典时,这将导致错误,因为它没有键和数据类型。 Instead of inferring the keys and values from the first json dict you see, you could either hardcode the keys and data types, or put the tsv lines that come before a dict into a separate list to be parsed later.
不是从您看到的第一个 json 字典中推断键和值,您可以对键和数据类型进行硬编码,或者将字典之前的 tsv 行放入一个单独的列表中,以便稍后解析。
Hardcode approach:硬编码方法:
def parse_file(file_iterator):
dict_keys_types = [('company', str),
('catch_phrase', str),
('phone', str),
('timezone', str),
('client_count', int)]
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Save-for-later approach:保存以备后用的方法:
def parse_file(file_iterator):
dict_keys_types = None
unused_tsv_lines = []
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
if dict_keys_types: # Check if this is set already
# If it is,
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
else: # Else add to unused_tsv_lines
unused_tsv_lines.append(tsv_data)
# After you've finished reading the file, try to reparse the lines
# you skipped before
if dict_keys_types: # Before parsing, make sure dict_keys_types was set
for tsv_data in unused_tsv_lines:
# With each line, do the same thing as before
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.