Example from “Python for Data Analysis”, Chapter 2

Question

I'm following along with the examples in Wes McKinney's "Python for Data Analysis".

In Chapter 2, we are asked to count the number of times each time zone appears in the 'tz' position, where some entries do not have a 'tz'.

McKinney's count of "America/New_York" comes out to 1251 (there are 2 in the first 10/3440 lines, as you can see below), whereas mine comes out to 1. Trying to figure out why it shows '1'?

I am using Python 2.7, installed at McKinney's instruction in the text from Enthought (epd-7.3-1-win-x86_64.msi). Data comes from https://github.com/Canuckish/pydata-book/tree/master/ch02 . In case you can't tell from the title of the book I am new to Python, so please provide instructions on how to get any info I have not provided.

import json

path = 'usagov_bitly_data2012-03-16-1331923249.txt'

open(path).readline()

records = [json.loads(line) for line in open(path)]
records[0]
records[1]
print records[0]['tz']

The last line here will show 'America/New_York', the analog for records[1] shows 'America/Denver'

#count unique time zones rating movies
#NOTE: NOT every JSON entry has a tz, so first line won't work
time_zones = [rec['tz'] for rec in records]

time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]

This shows the first ten time zone entries, where 8-10 are blank...

#counting using a dict to store counts
def get_counts(sequence):
    counts = {}
        for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
        return counts

counts = get_counts(time_zones)
counts['America/New_York']

this = 1, but should be 1251

len(time_zones)

this = 3440, as it should

Answer 1

'America/New_York' timezone occurs 1251 times in the input:

import json
from collections import Counter

with open(path) as file:
    c = Counter(json.loads(line).get('tz') for line in file)
print(c['America/New_York']) # -> 1251

It is not clear why the count is 1 for your code. Perhaps the code indentation is not correct:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
    else: #XXX wrong indentation
        counts[x] = 1 # it is run after the loop if there is no `break` 
    return counts

See Why does python use 'else' after for and while loops?

The correct indentation should be:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else: 
            counts[x] = 1 # it is run every iteration if x not in counts
    return counts

Check that you do not mix spaces and tabs for indentation, run your script using python -tt to find out.

Example from “Python for Data Analysis”, Chapter 2

Question

1 answers

solution1
0 ACCPTED 2014-05-23 02:53:06

Example from “Python for Data Analysis”, Chapter 2

Question

1 answers

solution1 0 ACCPTED 2014-05-23 02:53:06

solution1
0 ACCPTED 2014-05-23 02:53:06