I am trying to solve a question from Introduction to data science
on coursera:
Returns a DataFrame of towns and the states they are in from the university_towns.txt list. The format of the DataFrame should be: DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] )
The following cleaning needs to be done: 1. For "State", removing characters from "[" to the end. 2. For "RegionName", when applicable, removing every character from " (" to the end. 3. Depending on how you read the data, you may need to remove newline character '\\n'.
My script was like the following:
uni_towns = pd.read_csv('university_towns.txt', header=None, names={'RegionName'})
uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')
uni_towns['State'] = uni_towns['State'].replace('', np.nan).ffill()
import re
# Removing (...) from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
split_string = "("
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])
# Removing [...] from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns['State'] = uni_towns['State'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns = pd.DataFrame(uni_towns,columns = ['State','RegionName']).sort_values(by=['State', 'RegionName'])
return uni_towns
The first line is obsviously about reading the text file, then all fields in RegionName
that contains the word edit
are states as well:
uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')
Then I am removing everything between parentheses () and square brackets [] from each of RegionName
rows:
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
So if a value was like Alabama[edit]
or Tuscaloosa (University of Alabama)
, they will become, Alabama
and Tuscaloosa
.
Then I am doing the same thing for the State
columns, as I moved some values from RegionName
into it if it contains [edit]
.
I am using the following because there is few rows having something like ``Tuscaloosa (University of Alabama where there is only
(` and it wasn't detected by the regex pattern:
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])
The final result is: 567 rows × 2 columns
State RegionName
0 Alabama Alabama
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
...
564 Wisconsin Whitewater
551 Wisconsin Wisconsin
566 Wyoming Laramie
565 Wyoming Wyoming
While the correct result should be `517 rows x 2 columns.
After looking into the txt
file, I saw that some of the rows are taking 2 consecutive lines with \\n
when read, but the script is not detecting that the second line before \\n
is still within the same row.
Here is the text content .
The Pandas documentation shows that the read_csv
function has a skip_blank_lines
option. So you could add skip_blank_lines=True
to to read_csv
call.
last_data=[]
for line in lines:
last_data.append(line.strip("\n") # so it will remove any new lines comes last of string
# or you can say if line equals "\n" continue
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.