Your cart is currently empty!
Overview of the Assignment In this assignment, you will implement the SON algorithm using the Apache Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world dataset generated from Yelp dataset. The goal of this assignment is to apply the algorithms you have learned…
In this assignment, you will implement the SON algorithm using the Apache Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world dataset generated from Yelp dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets more efficiently in a distributed environment.
2.1 Programming Requirements
2.2 Programming Environment
Python 3.6, Scala 2.11 and Spark 2.3.3
We will use these library versions to compile and test your code. There will be a 20% penalty if we cannot run your code due to the library version inconsistency.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code!
TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.
2.4 What you need to turn in
Your submission must be a zip file with name: firstname_lastname_hw2.zip (all lowercase). You need to pack the following files in the zip file (see Figure 1):
firstname_lastname_task1.py firstname_lastname_task2.py
b1. [OPTIONAL] two Scala scripts, named: (all lowercase)
firstname_lastname_task1.scala
firstname_lastname_task2.scala
b2. [OPTIONAL] one jar package, named: firstname_lastname_hw2.jar (all lowercase)
Figure 1: Submission Structure
In this assignment, you will use one simulated dataset and one real-world. In task 1, you will build and test your program with a small simulated CSV file that has been provided to you. In task 2, you will build and test your program with a large real-world data which generated by a subset using business.json and review.json from the Yelp dataset (https://www.yelp.com/dataset) with the same structure as the simulated data. For saving the time, the dataset has been provided to you.
Figure 2: Input Data Format
In this assignment, you will implement the SON algorithm to solve all tasks (Task 1 and 2) on top of Apache Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given input file within the required time. You can refer to the Chapter 6 from the Mining of Massive Datasets
book and concentrate on section 6.4 – Limited-Pass Algorithms. You need to use A-Priori algorithm to process each chunk of the data.
4.1 Task 1: Simulated data (6 pts)
There is a CSV file (test_data.csv) posted on the Blackboard. You can use this test file to debug your code.
In this task, you need to build two kinds of market-basket model.
Case 1 (3 pts):
You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each user containing the business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider this product was rated only once. More specifically, the business ids within each basket are unique. The generated baskets are similar to:
user1: [business11, business12, business13, …]
user2: [business21, business22, business23, …]
user3: [business31, business32, business33, …]
Case 2 (3 pts):
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each business containing the user ids that commented on this business. Similar to case 1, the user ids within each basket are unique. The generated baskets are similar to:
business1:[user11,user12,user13,…]
business2:[user21,user22,user23,…]
business3:[user31, user32, user33, …]
Input format:
Output format:
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
You should use “Candidates:” as the tag. For each line you should output the candidates of frequent itemsets you found after the first pass of SON algorithm followed by an empty line after each combination. The printed itemsets must be sorted in lexicographical order (Both user_id and business_id are type of string).
You should use “Frequent Itemsets:” as the tag. For each line you should output the final frequent itemsets you found after finishing the SON algorithm. The format is the same with the intermediate results. The printed itemsets must be sorted in lexicographical order.
Here is an example of the output file:
Both the intermediate results and final results should be saved in ONE output result file “firstname_lastname_task1.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task1.py <case number> <support> <input_file_path> <output_file_path> Scala:
spark-submit –class firstname_lastname_task1 firstname_lastname_hw2.jar <case number> <support> <input_file_path> <output_file_path>
4.2 Task 2: SON algorithm on Yelp data (6.5 pts)
In task2, you will explore the Yelp dataset to find the frequent business sets (only case 1).
We generated a sample dataset from business.json and review.json in the state of Arizona in CSV format. You can download this dataset(AZ_Yelp.csv) from blackboard.
Apply SON algorithm
The requirements for task 2 are similar to task 1. However, you will test your implementation with the large dataset(AZ_Yelp.csv). For this purpose, you need to report the total execution time. For this execution time, we take into account also the time from reading the file till writing the results to the output file. You are asked to find the frequent business sets (only case 1) from the file we provide. The following are the steps you need to do:
Input format:
Output format:
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
The output file format is the same with task 1. Both the intermediate results and final results should be saved in ONE output result file “firstname_lastname_task2.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task2.py <filter threshold> <support> <input_file_path> <output_file_path> Scala:
spark-submit –class firstname_lastname_task2 firstname_lastname_hw2.jar <filter threshold> <support> <input_file_path> <output_file_path>
Task 1: | |||||||
Input File | Case | Support | Runtime (sec) | ||||
test_data.csv | 1 | 4 | <=100 | ||||
test_data.csv | 2 | 7 | <=200 | ||||
Task 2: | |||||||
Input File | Filter Threshold | Support | Runtime (sec) | ||||
AZ_Yelp.csv | 70 | 50 | <=1000 |
(% penalty = % penalty of possible points you get)
Example situations
Task | Score for Python | Score for Scala | Total | |
(10% of previous column if correct) | ||||
Task1 | Correct: 6 points | Correct: 6 * 10% | 6.6 | |
Task1 | Wrong: 0 point | Correct: 0 * 10% | 0.0 | |
Task1 | Partially correct: 3 points | Correct: 3 * 10% | 3.3 | |
Task1 | Partially correct: 3 points | Wrong: 0 | 3.0 |