簡體   English   中英

刪除分隔文件中的嵌套換行符?

[英]Remove nested newline characters in delimited file?

我有一個用尖號分隔的文件。 文件中唯一的插入號是定界符-文本中沒有。 其中一些字段是自由文本字段,包含嵌入式換行符。 這使得解析文件非常困難。 記錄的末尾需要換行符,但需要從帶有文本的字段中刪除它們。

這是來自全球綜合運輸信息系統的開源海上海盜行為數據。 這是三個記錄,在標題行之前。 在第一個中,船名是NORMANNIA,在第二個中,船名是“未知”,第三個是KOTA BINTANG。

ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken
NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^-
Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened.

Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam)

ReCAAP ISC via Focal Point (Singapore)"^-
KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^-

您會注意到第一條和第三條記錄很好並且易於解析。 第二個記錄“ Unkown”具有一些嵌套的換行符。

我應該如何刪除python腳本中的嵌套換行符(但不刪除記錄末尾的換行符)(或者,如果有更簡單的方法,則應除外),以便可以將這些數據導入SAS?

將數據加載到字符串中然后執行

import re
newa=re.sub('\n','',a)

而且newa中不會有換行符

newa=re.sub('\n(?!$)','',a)

它把那些留在行的末尾,但去掉其余的

我看到您已將其標記為正則表達式,但我建議您使用內置的CSV庫來對此進行解析。 CSV庫將正確解析文件,並在應有的位置保留換行符。

Python CSV范例: http//docs.python.org/2/library/csv.html

通過計算遇到的定界符的數目並在達到與單個記錄相關的數目時手動切換到新記錄來解決該問題。 然后,我剝離了所有換行符,並將數據寫回到新文件中。 本質上,它是原始文件,其中從字段中刪除了換行符,但每條記錄的末尾都有換行符。 這是代碼:

f = open("events.csv", "r")

carets_per_record = 33

final_file = []
temp_file  = []
temp_str   = ''
temp_cnt   = 0

building   = False

for i, line in enumerate(f):

    # If there are no carets on the line, we are building a string
    if line.count('^') == 0:
        building = True

    # If we are not building a string, then set temp_str equal to the line
    if building is False:
        temp_str = line
    else:
        temp_str = temp_str + " " + line

    # Count the number of carets on the line
    temp_cnt = temp_str.count('^')

    # If we do not have the proper number of carets, then we are building
    if temp_cnt < carets_per_record:
        building = True

    # If we do have the proper number of carets, then we are finished
    # and we can push this line to the list
    elif temp_cnt == carets_per_record:
        building = False
        temp_file.append(temp_str)

# Strip embedded newline characters from the temp file
for i, item in enumerate(temp_file):
    final_file.append(temp_file[i].replace('\n', ''))

# Write the final_file list out to a csv final_file
g = open("new_events.csv", "wb")


# Write the lines back to the file
for item in enumerate(final_file):
    # item is a tuple, so we get the content part and append a new line
     g.write(item[1] + '\n')

# Close the files we were working with
f.close()
g.close()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM