How to define the first column as names in pd.read_csv

Question

I fail to read/insert the first column of csv file, I have already set the names in the csv file, although, if I type as name=['...','...' etc], python will set them again, and I will end up having the names 2 times, I want to implement the data from the csv into names of pd.read_csv.

    import pandas as pd
    import tkFileDialog
    import numpy as np
    import warnings
    warnings.filterwarnings('ignore')

    rating=tkFileDialog.askopenfilename()
    df = pd.read_csv(rating, sep='\t')
    print df.head()


    movies=tkFileDialog.askopenfilename()
    movie_titles=pd.read_csv(movies)
    print movie_titles.head

    df=pd.merge(df,movies,on='movieId')
    print df.head()

And the error is:

Traceback (most recent call last):
  File "C:/Users/Umer Selmani/Desktop/MP2/test panda.py", line 16, in <module>
    df=pd.merge(df,movies,on='movieId')
  File "C:\Users\Umer Selmani\Desktop\MP2\venv\lib\site-packages\pandas\core\reshape\merge.py", line 47, in merge
    validate=validate)
  File "C:\Users\Umer Selmani\Desktop\MP2\venv\lib\site-packages\pandas\core\reshape\merge.py", line 480, in __init__
    right = validate_operand(right)
  File "C:\Users\Umer Selmani\Desktop\MP2\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1752, in validate_operand
    'a {obj} was passed'.format(obj=type(obj)))
TypeError: Can only merge Series or DataFrame objects, a <type 'unicode'> was passed

Answer 1

The following line:

df=pd.merge(df, movies, on='movieId')

Should be:

df=pd.merge(df, movie_titles, on='movieId')

The movies variable contains a string, not a dataframe.

Answer 2

I am not sure if I understood what you want to do, but as I can see, there are three possible issues there:

df is incorrectly trying to merge itself;
merge generating duplicated columns (and values);
merge trying to work with unicode ;

The first issue is an error. Your variable df is trying to merge itself to another one ( movie_titles ) but the syntax is not correct.

Try this, instead:

df = df.merge(movie_titles, on='movieId')

The second issue is not a problem: it is default, actually. When you merge two datasets with same column headers, you get header_x and header_y .

For instance:

    header1_x    header2_x    header1_y    header2_y
0           a            f            a            f
1           b            g            b            g
2           c            h            c            h
3           d            i            d            i

One way of solving it --one that is not going to take you too much thinking-- is dropping the columns you do not want:

df = df[[header1_x, header2_x]]

The third issue is related to unicode object. It means the header movieId probably is not encoded correctly.

If it persists after you work on the previous issues, try unicodedata (see doc ):

import unicodedata
unicodedata.normalize("NFKD", df).encode("ascii',"ignore')

How to define the first column as names in pd.read_csv

Question

2 answers

solution1
2 2019-07-21 18:55:29

solution2
2 ACCPTED 2019-07-21 19:12:05

How to define the first column as names in pd.read_csv

Question

2 answers

solution1 2 2019-07-21 18:55:29

solution2 2 ACCPTED 2019-07-21 19:12:05

solution1
2 2019-07-21 18:55:29

solution2
2 ACCPTED 2019-07-21 19:12:05