[英]How to Read .txt in Pandas
我正在嘗試將包含兩個數據系列的txt文件拉入pandas。 到目前為止,我已經嘗試了下面的變體,我從堆棧上的其他帖子中獲取。 到目前為止,它只會作為一個系列閱讀。
我正在使用的數據可在此處獲得
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None)
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t")
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+")
我確定我做的事情顯然是錯的,但是我看不到它。
嘗試使用sep=r'\\s{2,}'
作為分隔符 - 它意味着使用兩個或多個空格或制表符作為分隔符:
In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name'])
In [29]: df
Out[29]:
ID Name
0 A000 Cholera due to Vibrio cholerae 01, biovar cholerae
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
5 A0102 Typhoid fever with heart involvement
6 A0103 Typhoid pneumonia
7 A0104 Typhoid arthritis
8 A0105 Typhoid osteomyelitis
9 A0109 Typhoid fever with other complications
10 A011 Paratyphoid fever A
11 A012 Paratyphoid fever B
12 A013 Paratyphoid fever C
13 A014 Paratyphoid fever, unspecified
14 A020 Salmonella enteritis
15 A021 Salmonella sepsis
16 A0220 Localized salmonella infection, unspecified
17 A0221 Salmonella meningitis
18 A0222 Salmonella pneumonia
19 A0223 Salmonella arthritis
20 A0224 Salmonella osteomyelitis
21 A0225 Salmonella pyelonephritis
22 A0229 Salmonella with other localized infection
23 A028 Other specified salmonella infections
24 A029 Salmonella infection, unspecified
.. ... ...
671 B188 Other chronic viral hepatitis
672 B189 Chronic viral hepatitis, unspecified
673 B190 Unspecified viral hepatitis with hepatic coma
674 B1910 Unspecified viral hepatitis B without hepatic coma
675 B1911 Unspecified viral hepatitis B with hepatic coma
676 B1920 Unspecified viral hepatitis C without hepatic coma
677 B1921 Unspecified viral hepatitis C with hepatic coma
678 B199 Unspecified viral hepatitis without hepatic coma
679 B20 Human immunodeficiency virus [HIV] disease
680 B250 Cytomegaloviral pneumonitis
681 B251 Cytomegaloviral hepatitis
682 B252 Cytomegaloviral pancreatitis
683 B258 Other cytomegaloviral diseases
684 B259 Cytomegaloviral disease, unspecified
685 B260 Mumps orchitis
686 B261 Mumps meningitis
687 B262 Mumps encephalitis
688 B263 Mumps pancreatitis
689 B2681 Mumps hepatitis
690 B2682 Mumps myocarditis
691 B2683 Mumps nephritis
692 B2684 Mumps polyneuropathy
693 B2685 Mumps arthritis
694 B2689 Other mumps complications
695 B269 Mumps without complication
[696 rows x 2 columns]
或者你可以使用read_fwf()方法
您的文件是固定寬度的文件,因此您可以使用read_fwf
,此處默認參數可以推斷列寬:
In [106]:
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None)
df.head()
Out[106]:
0 1
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
如果您知道列名稱所需的名稱,則可以將這些名稱傳遞給read_fwf
:
In [107]:
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2'])
df.head()
Out[107]:
col1 col2
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
或者只是在閱讀后覆蓋columns
屬性:
df.columns = ['col1', 'col2']
至於你嘗試失敗的原因, read_table
使用tabs作為默認分隔符,但文件只有空格並且是固定寬度
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.