Python如何在讀取文本文件時跳過空行

Question

我正在嘗試解決 Coursera Introduction to data science一個問題：

從 university_towns.txt 列表中返回城鎮及其所在州的數據幀。 DataFrame 的格式應該是： DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] )
 The following cleaning needs to be done: 1. For "State", removing characters from "[" to the end. 2. For "RegionName", when applicable, removing every character from " (" to the end. 3. Depending on how you read the data, you may need to remove newline character '\\n'.

我的腳本如下所示：

uni_towns = pd.read_csv('university_towns.txt', header=None, names={'RegionName'})
uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')
uni_towns['State'] = uni_towns['State'].replace('', np.nan).ffill()
import re
# Removing (...) from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
split_string = "("
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])
# Removing [...] from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns['State'] = uni_towns['State'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns = pd.DataFrame(uni_towns,columns = ['State','RegionName']).sort_values(by=['State', 'RegionName'])
return uni_towns

第一行顯然是關於讀取文本文件，然后RegionName中包含單詞edit所有字段也是狀態：

uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')

然后我從每個RegionName行中刪除括號 () 和方括號 [] 之間的RegionName ：

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))

因此，如果一個值像Alabama[edit]或Tuscaloosa (University of Alabama) ，它們將變成， Alabama和Tuscaloosa 。

然后我對State列做同樣的事情，因為我將RegionName一些值RegionName到其中，如果它包含[edit] 。

我正在使用以下內容，因為很少有行具有類似 ``Tuscaloosa（阿拉巴馬大學， where there is only (` 並且它沒有被正則表達式模式檢測到：

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])

最終結果為： 567 rows × 2 columns

州地區名稱

0 阿拉巴馬州阿拉巴馬州

1 阿拉巴馬州奧本

2 阿拉巴馬州佛羅倫薩

3 阿拉巴馬州傑克遜維爾

...

564 威斯康星州懷特沃特

551 威斯康星州威斯康星州

566 懷俄明州拉勒米

565懷俄明州懷俄明州

而正確的結果應該是`517 行 x 2 列。

查看txt文件后，我看到某些行在讀取時使用\\n連續 2 行，但腳本未檢測到\\n之前的第二行仍在同一行內。

這是正文內容。

Answer 1

Pandas 文檔顯示read_csv函數有一個skip_blank_lines選項。 所以你可以添加skip_blank_lines=True到read_csv調用。

Answer 2

last_data=[]
for line in lines:
  last_data.append(line.strip("\n") # so it will remove any new lines comes last of string

# or you can say if line equals "\n" continue

Python如何在讀取文本文件時跳過空行

問題描述

2 個解決方案

解決方案1
0 2020-09-02 10:40:33

解決方案2
0 2020-09-02 10:50:05

Python如何在讀取文本文件時跳過空行

問題描述

2 個解決方案

解決方案1 0 2020-09-02 10:40:33

解決方案2 0 2020-09-02 10:50:05

解決方案1
0 2020-09-02 10:40:33

解決方案2
0 2020-09-02 10:50:05