[英]remove all the EOFs (extra empty lines) at the end of jsonl files
我正在使用在 VSCode 編輯器中看起來像這樣的 jsonl 文件:
第一個.jsonl
1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}
3.
4.
5.
6.
第二個.jsonl
1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}
3.
4.
然后還有更多,具有隨機數量的結束線/ EOF 標記。 我想在每個文件的末尾有單行或空行。 我不斷收到此錯誤raise JSONDecodeError("Expecting value", s, err.value) from Nonejson.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
using this method:
filenames = glob.glob("folder_with_all_jsonl/*.jsonl")
#read file by file, write file by file. Simple.
for f in filenames:
#path to the jsonl file/s
data_json = io.open(f, mode='r', encoding='utf-8-sig') # Opens in the JSONL file
data_python = extract_json(data_json)
#.....code omitted
for line in data_python: # it would fail here because of an empty line
print(line.get(objectId))
#and so on
我手動刪除了一些額外的行,並且能夠處理我的 2 個 jsonl 文件。
我看過這些 SO 板:
1> 使用 Python 刪除 json 文件中的新換行符。
請給我提示/幫助。 我會很感激的!!
我希望每個文件都采用這種格式:first.jsonl
1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}
編輯:我使用了正陽宋的回答和 chepner 的建議我實際上有兩個 4gb 文件,這樣做:
results = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
for line in infile:
try:
results.append(json.loads(line)) # read each line of the file
except ValueError:
print(f)
with open(f,'w', encoding= 'utf-8-sig') as outfile:
for result in results:
outfile.write(json.dumps(result) + "\n")
導致錯誤line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
我在我的個人 windows 機器上。
編輯 2:我遷移到我的工作機器,我能夠解決這個問題。 任何輸入我們如何在個人機器上防止這種情況? 像並行處理??
只是為了響應您的最后一個代碼段。
你可以換行
json.dump(result, outfile, indent=None)
類似於:
for one_item in result:
outfile.write(json.dumps(one_item)+"\n")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.