简体   繁体   中英

How to read the csv file properly if each row contains different number of fields (number quite big)?

I have a text file from amazon, containing the following info:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant

As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part

And such an error occurs:

ValueError: Passed header names mismatches usecols

When I tried to read all the columns:

pd.read_csv(filename, sep = " ", header = None)

And the error this time is:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320

And given the review text is so long in many rows , the method of adding header names for each column in this question can not work.

I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance!

EDIT:

The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3     

Do you have any idea to read it properly? It would be appreciated for any help!

As suggested, DictReader could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant

If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

import pandas as pd
import csv


rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123   
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review  
0  I love this phone as it is easy to used  
1                          Suck restaurant  

row[::-1] is used to reverse the text of the whole line, the [2:] skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.

It looks like this is a fixed width file. Pandas supplies read_fwf for this exact purpose. The following code reads the file correctly for me. You may want to mess around with the widths a little if it doesn't work perfectly.

pandas.read_fwf('test.fwf', 
                 widths=[13, 12, 13, 5, 100], 
                 names=['user', 'item', 'time', 'rating', 'review'])

If the columns still line up with the edited version (where the rating comes first), you just need to add the correct specification. A guide line like the following helps to do this quickly:

0        1         2         3         4         5         6         7         8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                          hjf2329ccc   TGjsk123     14423321    3     

So the new command becomes:

pandas.read_fwf('test.fwf', 
                colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]], 
                names=['review', 'user', 'item', 'time', 'rating'])

Since the first four (now last four) of the fields are never going to contain spaces or need to be surrounded by quotes, let's forget about the csv library and use python's awesome string handling directly. Here is a one-liner that splits each line into exactly five columns, courtesy of the maxsplit argument to rsplit() :

with open("myfile.dat") as data:
    frame = pd.DataFrame(line.strip().rsplit(maxsplit=4) for line in data)

The above should solve your problem, but I prefer to unpack it into a generator function that is easier to understand, and can be extended if necessary:

def splitfields(data):
    """Generator that parses the data correctly into fields"""
    for line in data:
        fields = line.rsplit(maxsplit=4)
        fields[0] = fields[0].strip()   # trim line-initial spaces
        yield fields

with open("myfile.dat") as data:
    frame = pd.DataFrame(splitfields(data))

Both versions avoid having to build a large ordinary array in memory only to hand it over to the DataFrame constructor. As each line of input is read from the file, it is parsed and immediately added to the dataframe.

The above is for the format in the updated question, which has the free text on the left. (For the original format, use line.split instead of line.rsplit and strip the last field, not the first.)

    I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3

There's more you could do depending on what the data actually looks like: If the fields are separated by exactly four spaces (as it seems from your example), you could split on " " instead of splitting on all whitespace. That would also work correctly if some other fields can contain spaces. In general, pre-parsing like this is flexible and extensible; I leave the code simple since there's no evidence from your question that more is needed.

Usecols refers to the name of the columns in the input file. If your file doesn't have those columns named like that ( user, item, rating ) it won't know which columns you're referring to. Instead you should pass an index like usecols=[0,1,2] .

Also, names refers to what you're calling the columns you import. So, I think you cannot have four names upon importing 3 columns. Does this work?

pd.read_csv(filename, sep = " ", 
                      header = None, 
                      names = ["user","item","rating"], 
                      usecols = [0,1,2])

The tokenizing error looks like a problem with the delimiter. It may try to parse your review text column as many columns, because "I" "love" "this" ... are all separated by spaces. Hopefully if you're only reading the first three columns you can avoid throwing an error, but if not you could consider parsing row-by-row (for example, here: http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/ ) and writing to a DataFrame from there.

I think the best approach is using pandas read_csv :

 import pandas as pd
import io

temp=u"""  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """


#estimated max length of columns 
N = 20

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep = "\s+", #separator is arbitrary whitespace 
                 header = None, #first row is not header, read all data to df
                 names=range(N)) 
print df
           0         1         2   3     4           5     6      7     8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone    as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant    so      I  love   

        9      10    11     12   13      14        15  16  17  18  19  
0       it     is  easy     to  use     NaN       NaN NaN NaN NaN NaN  
1  cooking  pizza  with  onion  ham  garlic  tomatoes NaN NaN NaN NaN

#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
         user      item      time
0  disjiad123  TYh23hs9  13160032
1  hjf2329ccc  TGjsk123  14423321

If you need all columns, you need preprocessing for founding max length of columns for parameter usecols and then postprocessing join last columns to one:

import pandas as pd
import csv

#preprocessing
def get_max_len():
    with open('file1.csv', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            num.append(len(''.join(row).split()))
        m = max(num)
        #print m
        return m


df = pd.read_csv('file1.csv', 
                         sep = "\s+", #separator is arbitrary whitespace 
                         header = None, #first row is not header, read all data to df
                         usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
           0         1         2   3     4           5     6      7    8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone   as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant   NaN    NaN  NaN   

    9    10    11   12   13  
0   it   is  easy   to  use  
1  NaN  NaN   NaN  NaN  NaN 
#df from 4 col to last
print df.ix[:, 4:]
     4           5     6      7    8    9    10    11   12   13
0     I        love  this  phone   as   it   is  easy   to  use
1  Suck  restaurant   NaN    NaN  NaN  NaN  NaN   NaN  NaN  NaN

#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})

#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]

#filter only string columns
print df[cols]
         user      item      time  rating  \
0  disjiad123  TYh23hs9  13160032       5   
1  hjf2329ccc  TGjsk123  14423321       3   

                              review text  
0  I love this phone as it is easy to use  
1                         Suck restaurant  

I would iterate over each line and replace the consecutive spaces with semicolon. Then call str.split() and choose semicolon as separator. It could look like the following:

data = [["user","item","rating", "review"]]
with open("your.csv") as f:
    for line in f.readlines():
        for i in range(10, 1, -1):
            line = line.replace(' '*i, ';')
        data += [line.split(';')]

I think OP is using Amazon's review data , and if so I have also found this input file to be difficult to read. I'm not 100% sure, but I think the reason that pandas.read_csv has a hard time with this is that the review_body column has tabs that replace newline characters (for whatever reason).

I tried a few of the solutions and I ended up building a new solution based on that proposed by @alexis. The solutions here did not work because the files in the link I provided have the following column names (note that "review_body" is neither at the end or the beginning of the list):

['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date']

I apologize in advance for the similarity of the variable names. For example, there is a stopCol and stopCols . I know ... very bad form.

    # declare dictionary to contain columns from left-to-right search
    forwCols = {}
    # declare dictionary to contain "review_body" column
    stopCols = {}
    # declare dictionary to contain columns from right-to-left search
    revrCols = {}

    with open(filstr,'r') as TSVfile:
        lines    = TSVfile.readlines()
        # The header should have the maximum num of cols
        numCols  = len(lines[0].split())
        # Find which column index corresponds to 'review body' col
        stopCol  = lines[0].split().index('review_body')
        colNames = lines[0].split()

    for lineInt in range(1,len(lines)):
        # populate dict with cols until the column with tabs
        forwCols[lineInt] = lines[lineInt].\
                            split('\t',maxsplit=14)[:stopCol]
        # reverse list
        revrCols[lineInt] = lines[lineInt].rsplit('\t',maxsplit=2)[2:]
        forwLine = '\t'.join(forwCols[lineInt])
        revrLine = '\t'.join(revrCols[lineInt])
        # this next line removes the contents of the line that exists in
        # the dicts that are created already
        stopCols[lineInt] = \
                lines[lineInt].replace(forwLine,'').replace(revrLine,'')

    # Create three DFs using the three dicts just created
    revDF  = pd.DataFrame.from_dict(forwCols,orient='index',\
                            columns=colNames[:stopCol])
    dateDF = pd.DataFrame.from_dict(revrCols,orient='index',columns=['review_date'])
    revbodyDF = pd.DataFrame.from_dict(stopCols,orient='index',\
                                       columns=['review_body'])

    # join the three DFs together on indices
    combineDF1 = revbodyDF.merge(right=dateDF,how='outer',left_index=True,\
                                 right_index=True)
    combineDF = revDF.merge(right=combineDF1,how='outer',\
                                 left_index=True,right_index=True)

The solution above is a brute-force approach, but it is the only way I could see that this might work where the column containing tabs isn't the first or last column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM