Decomposition - Fri, Oct 18

Modular Design

Separation of Concerns

A design principle: Isolate different parts of a program that addresses different concerns

Phases of the Hog Project

Phases of the Ants Project

The Yelp Example

Find the 3 most similar restaurants to each Thai restaurant in the dataset.

results = search('Thai')
for r in results:
    print(r, 'is similar to', r.similar(3))

To do this, we need to define search and similar:

def search(query):
    results = [r for r in Restaurant.all if query in r.name]

How do you rank these results?

def search(query, ranking=lambda r: -r.stars):
    ...
    return sorted(results, key=ranking)

To implement similar, we need to create the Restaurant class:

class Restaurant:
    all = []
    def __init__(self, name, stars):
        self.name, self.stars = name, stars
        Restaurant.all.append(self)
    
    def similar(self, k):
        "Return the K most similar restaurants to SELF."
        ...

Let's create a repr function:

class Restaurant:
    ...
    def __repr__(self):
        return self.name

Does this work so far?

>>> Restaurant('Top Dog', 5)
>>> Restaurant('Thai Basil', 4)
>>> Restaurant('Thai Delight', 5)
>>> results = search('Thai')
>>> for r in results:
...     print(r, 'is similar to', r.similar(3))
... 
Thai Delight is similar to None
Thai Basil is similar to None

We're ready to write the similar function!

class Restaurant:
    def similar(self, k, similarity=reviewer_overlap):
        "Return the K most similar restaurants to SELF."
        others = list(Restaurant.all)
        others.remove(self)
        
        # need to curry similarity for sorted keys!
        f = lambda r: similarity(r, self)
        return sorted(others, key=f, reverse=True)[:k]

We just used a function (as the default for similarity) that has yet to be defined:

def reviewer_overlap(r, s):
    "Number of users who reviewed both R and S"
    return len([u for u in r.reviewers if u in s.reviewers])

The Restaurant class now needs reviewers!

class Restaurant:
    def __init__(self, name, stars, reviewers):
    ...
    self.reviewers = reviewers # list of user_ids
    ...

We've been using fake data, so let's replace this with real data:

import json

reviewers_by_restaurant = {}
for line in open('reviews.json'):
    r = json.loads(line)
    biz = r['business_id']
    if biz in reviewers_by_restaurant:
        reviewers_by_restaurant[biz].append(r['user_id'])
    else:
        reviewers_by_restaurant[biz] = [r['user_id']]

for line in open('restaurants.json'):
    r = json.loads(line)
    reviewers = reviewers_by_restaurant[r['business_id']]
    Restaurant(r['name'], r['stars'], reviewers)

Let's speed up the reviewer_overlap function:

# last line in the demo above:
    Restaurant(r['name'], r['stars'], sorted(reviewers))

Linear-Time Intersection of Sorted Lists

Given two sorted lists with no repeats, return the number of elements that appear in both.

Using a sorted list, you can ignore all elements that are smaller or larger than the elements in the other list, as it's obviously not going to be present.

def fast_overlap(s, t):
    i, j, count = 0, 0, 0
    while i < len(s) and j < len(t):
        if s[i] == t[j]:
            count, i, j = count + 1, i + 1, j + 1
        elif s[i] < t[j]:
            i = i + 1
        else:
            j = j + 1
    return count

This is linear time, as opposed to quadratic list comparison.