简体   繁体   中英

Python How to skip empty lines when reading a text file

I am trying to solve a question from Introduction to data science on coursera:

Returns a DataFrame of towns and the states they are in from the university_towns.txt list. The format of the DataFrame should be: DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] )

 The following cleaning needs to be done: 1. For "State", removing characters from "[" to the end. 2. For "RegionName", when applicable, removing every character from " (" to the end. 3. Depending on how you read the data, you may need to remove newline character '\\n'.

My script was like the following:

uni_towns = pd.read_csv('university_towns.txt', header=None, names={'RegionName'})
uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')
uni_towns['State'] = uni_towns['State'].replace('', np.nan).ffill()
import re
# Removing (...) from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
split_string = "("
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])
# Removing [...] from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns['State'] = uni_towns['State'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns = pd.DataFrame(uni_towns,columns = ['State','RegionName']).sort_values(by=['State', 'RegionName'])
return uni_towns

The first line is obsviously about reading the text file, then all fields in RegionName that contains the word edit are states as well:

uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')

Then I am removing everything between parentheses () and square brackets [] from each of RegionName rows:

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))

So if a value was like Alabama[edit] or Tuscaloosa (University of Alabama) , they will become, Alabama and Tuscaloosa .

Then I am doing the same thing for the State columns, as I moved some values from RegionName into it if it contains [edit] .

I am using the following because there is few rows having something like ``Tuscaloosa (University of Alabama where there is only (` and it wasn't detected by the regex pattern:

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])

The final result is: 567 rows × 2 columns

State RegionName

0 Alabama Alabama

1 Alabama Auburn

2 Alabama Florence

3 Alabama Jacksonville

...

564 Wisconsin Whitewater

551 Wisconsin Wisconsin

566 Wyoming Laramie

565 Wyoming Wyoming

While the correct result should be `517 rows x 2 columns.

After looking into the txt file, I saw that some of the rows are taking 2 consecutive lines with \\n when read, but the script is not detecting that the second line before \\n is still within the same row.

Here is the text content .

The Pandas documentation shows that the read_csv function has a skip_blank_lines option. So you could add skip_blank_lines=True to to read_csv call.

last_data=[]
for line in lines:
  last_data.append(line.strip("\n") # so it will remove any new lines comes last of string

# or you can say if line equals "\n" continue 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM