cleaning text data using python

Question

I am learning python with examples. Maybe I have to learn data structure to fully internalize the functions, but I hope someone can help me on this stage.

Now, I am cleaning text data stacked by panda's data frame.

I have got the following result and what I want to do is just taking last two elements for each split text.

[['Australian Centre for Ancient DNA',
  ' School of Biological Sciences',
  ' University of Adelaide',
  ' Adelaide',
  ' South Australia 5005',
  ' Australia'],
 ['Department of Ecology and Evolutionary Biology',
  ' Ramaley Biology',
  ' University of Colorado',
  ' Boulder',
  ' CO 80309',
  ' USA']]

So, my trial was something like

df["zip"] = df["Af_split_split"]
i = 0
j = 0 
df.iloc[i,7][j] = df.iloc[i,6][j][len(df.iloc[i,6][j])-2:len(df.iloc[i,6][j])-1]

However, when I tried it, elements in another column in data-frame were also changed. (See, the first row of Af_split, Af_split_split, zip have the same value)

How can I handle this problem?

Answer 1

If I understand your problem correctly, from the symptoms you're describing, your issue is a classic one: you need to copy a list and modify the copy without modifying the original. This has been answered on stack overflow already, see here: How to clone or copy a list?

For your specific example, the solution is to modify your line where you assign to df["zip"] to this:

df["zip"] = df["Af_split_split"][:]

That slice operator with no numbers will create a new copy of the list (instead of creating a pointer) so that modifications to the copy do not affect the original.

cleaning text data using python

Question

1 answers

solution1
0 2017-03-02 23:10:51

cleaning text data using python

Question

1 answers

solution1 0 2017-03-02 23:10:51

solution1
0 2017-03-02 23:10:51