簡體   English   中英

如何從txt文件中刪除奇怪的編碼

[英]How to remove weird encoding from txt file

我正在嘗試處理這樣的文本文件:

http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt

如果在文件中間看到,則如下所示:

</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EXCEL
<SEQUENCE>21
<FILENAME>Financial_Report.xlsx
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 Financial_Report.xlsx
M4$L#!!0`!@`(````(0!):[_C#0,``+!)```3``@"6T-O;G1E;G1?5'EP97-=
M+GAM;""B!`(HH``"````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M``````````````````````````````````````#,W,M.VT`4QO%]I;Z#Y6V5
M>([OK@@L>EFV2*4/,+4GQ,(W>08*;]^)N0BA%(2*U/^&B,2>\\6+G[+YSM')
M==\%5V:V[3AL0EFK,#!#/3;M<+X)?YY]795A8)T>&MV-@]F$-\:&)\?OWQV=
MW4S&!O[NP6["G7/3QRBR]<[TVJ['R0S^D^TX]]KY?^?S:-+UA3XW4:Q4'M7C
MX,S@5FY_1GA\]-EL]67G@B_7_NW;)+/I;!A\NKUP/VL3ZFGJVEH[GS2Z&IHG
M4U9W$];^SN4:NVLG^\''"*.#$_:?_'W`W7W?_:.9V\8$IWIVWW3O8T377?1[
MG"]^C>/%^OE##J0<M]NV-LU87_;^":SM-!O=V)TQKN_6R^NZU^UPG_N9^<O%
M-EI>Y(V#[+_?<O`K<\20'`DD1PK)D4%RY)`<!21'"<E107*(H@2AB"H44H5B
MJE!0%8JJ0F%5**X*!5:AR!I39(TILL8466.*K#%%UI@B:TR1-:;(&E-DC2FR
M)A19$XJL"476A")K0I$UH<B:4&1-*+(F%%D3BJPI1=:4(FM*D36ER)I29$TI
MLJ8465.*K"E%UI0B:T:1-:/(FE%DS2BR9A19,XJL&476C")K1I$UH\B:4V3-
M*;+F%%ESBJPY1=:<(FM.D36GR)I39,TILA8460N*K`5%UH(B:T&1M:#(6E!D
M+2BR%A19"XJL)476DB)K29&UI,A:4F0M*;*6%%E+BJPE1=:2(FM%D;6BR%I1
M9*THLE8462N*K!5%UHHB:T61M:+(*HI"JRB*K:(HN(JBZ"J*PJLHBJ^B*,"*
MH@@KBD*L*(RQH#H6QEA.(8O3R.)4LCB=+$XIB]/*XM2R,+TLP12S!-/,$DPU
M2S#=+,&4LP33SA),/4LP_2S!%+0$T]"2_U;1<GX?CHF6O__^`W8YYH6%+-;=
M=,:^\1*%VT-?FKS3LVE^N-EO#GKS`(_/?BZ'WZMS.H^3]1N&9O/ZIW"_0FA_
M]VKR!YG9M>9AB="A93P/$_UVHM</?+(-R.SW'S6F.3`[6O8M'?\!``#__P,`
M4$L#!!0`!@`(````(0"U53`C]0```$P"```+``@"7W)E;',O+G)E;',@H@0"

這似乎是一個excel文件嗎? 還是XBRL文檔? 那是什么 ? 我如何擺脫它(或以某種方式“處理”它?)這種情況持續了數千行,所以我猜它是某些附件的某些鏈接的某種編碼? 知道如何處理嗎?

我正在嘗試在Python中使用BeautifulSoup:

from bs4 import BeautifulSoup

with open("textWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("processedText.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

但並非所有內容都被刪除了,而且我注意到在某些情況下甚至沒有刪除所有html標記。有時運行兩次代碼會比第一次運行BeautifulSoup代碼時刪除的內容多。

您正在查看的編碼是uuencode 在Python中,您可以使用uu模塊來解碼此blob,或者只是使用stringdata.decode('uu')

uuencode是一種舊格式,最初用於在電子郵件中嵌入二進制文件(然后僅允許使用7位US-ASCII;該格式還具有與使用其自身令人困惑的字符編碼的當今大鐵系統的互操作性的一些讓步) 。 這些天,您可能希望看到base64扮演這個角色。

我發布了后續問題的答案,該問題顯示了如何在從文件句柄讀取或迭代一堆文本行時刪除uuencode blob。

使用此處提供的sed命令可以有效地解決此問題: sed命令-在文件夾的所有文本(.txt)文件中應用

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM