Home Practicum 1 Practicum 2 Practicum 3 Practicum 4 Practicum 5 Practicum 6 Practicum 7 Practicum 8 Practicum 9 Practicum 10 Practicum 11

Welcome to DS2000 CS Practicum 10!


Today we will practice dictionaries, lists, sets, and file handling. We will work with real review data from yelp.com. There are two files:

Let's assume that the first part of our project is to write a function that takes user_id as argument and returns the number of reviews that user wrote, and their average score. One way to do it would be to read the entire file line by line and if the review corresponds to that user, add the score to the list. While it might be ok with 5000 reviews, the full dataset has 7 million of them and it takes a looooong time to go through the whole file.

What we're going to do instead is to read the whole file only once and load the important bits into a dictionary, where the keys are user_ids and the values are the scores they left. The end result will look somewhat like this:

scores = {
 'AHIF1b3RAMbs2VYb84d2WA': [1.0, 4.0, 3.0, 5.0, 5.0, 2.0, 1.0, 1.0, 5.0],
 'k3Hzz9vl5np_Jj33VTiAQg': [5.0],
 'uh5PUKFXAXrdnK-O8jRgfQ': [5.0],
 'JKyoQtpc3r-Hk-tlS5QJ5g': [5.0],
 'PFAmnXQglWdj-Xcg93w5oA': [5.0],
 'A8ZPlZvlY4t8UwyW8JB1mA': [4.0, 2.0],
 '4tYd1N1gK1Fl2hTBts-I4g': [5.0],
 '-aGusrG93bPe5y9C4Cw6Mw': [4.0],
 'pPGtRiEMo4sCYSUnYrLBqg': [5.0],
 'HvLZ0Uuy0NN1Hp9n66VAcw': [5.0, 5.0, 5.0],
 'bNNrOSGOhXaZTG-oSnFChg': [1.0],
 'B4bNgWW_QHl8rQT_vFVKkA': [5.0],
To do this, we can write a function like that:
def load_reviews():
	scores = {} # initialize an empty dictionary
	with open('review.json','r', encoding='utf-8' ) as review_file:
		for line in review_file: # read line by line
			review = json.loads(line) # load the line as dictionary
			if review['user_id'] not in scores.keys(): 
				# if we have no scores of that user yet, make an empty list
				scores[review['user_id']] = []
			# add the star rating from current review to the list
			# of stars given by the current user
	return scores

Exercise 1

  1. Download the review.json file.
  2. Paste the load_reviews() function definition in your submission file.
  3. Load the reviews in your main function.
  4. Write a function count_items(scores) that tells you how many unique users we have in the dataset. Hint: keys in the dictionary are unique, each corresponds to one user.
  5. Write a function activity(key, scores) that takes the id of a user as well as your computed scores dictionary and tells us how many reviews a user left and what their average score is. Test that it returns 9, 3.0 for user 'AHIF1b3RAMbs2VYb84d2WA'.

Exercise 2

In the previous exercise we used the dictionary to group reviews by user so we can find out about user activity levels and their average scores. Let's find out a similar thing about businesses.

  1. Modify the load_reviews() function so that it groups reviews by 'business_id' (still reading from the review.json file, so that now the keys of the dictionary are business ids and the values are the lists of star ratings each business received.
  2. How many businesses were reviewed? How many reviews did 'zd1fJLPz0ZeV4aoSIsRYcg' get and what's their average score?

Exercise 3: Extra challenge

In the previous exercise we grouped reviews by business_id but these IDs don't really tell us anything. Where is 'zd1fJLPz0ZeV4aoSIsRYcg'?! Yelp gave us this information in the business.json.

  1. Write a load_businesses() function that's similar to load_reviews(). It should read the business.json file, and store the results in a dictionary where the keys are business ids, and the values are dictionaries that contain the name, city, and categories of each business.
  2. Write a function summary(business_id, scores, businesses) that takes the business_id, your loaded scores, and your loaded business information to nicely print something like: Ava's Cafe is in Rock Hill. They have 57 reviews with the average of 4.86 stars. They are in Cafes, Desserts, Food, Restaurants, Coffee & Tea, Breakfast & Brunch categories..