Your cart is currently empty!
Goal: Gain experience developing a machine learning project on a real-world dataset by utilizing the concepts and algorithms that you have learned in class. This is an individual project for graduate students, while undergraduate students are allowed (but not required) to work in pairs. You can choose one from three possible tasks: Option 1…
Goal: Gain experience developing a machine learning project on a real-world dataset by utilizing the concepts and algorithms that you have learned in class. This is an individual project for graduate students, while undergraduate students are allowed (but not required) to work in pairs. You can choose one from three possible tasks:
Option 1 : You are working for a non- profit that is recruiting student volunteers to help with Alzheimer’s patients. You have been tasked with predicting how suitable a person is for this task by predicting how empathetic he or she is. Using the Young People Survey dataset (https://www.kaggle.com/miroslavsabo/young-people-survey/), predict a person’s “empathy” on a scale from 1 to 5. You can use any of the other attributes in the dataset to make this prediction.
Option 2: You have joined a startup that delivers healthy meals to people. You have been tasked with doing a marketing study and understanding how likely a person is to “pay more money for good, quality or healthy food” (on a scale from 1 to 5) using the Young People Survey dataset (https://www.kaggle.com/miroslavsabo/young-people-survey/). You can use any of the other attributes in the dataset to make this prediction.
Option 3 : You have been hired as the first data scientist at a news organization and tasked with creating value for the company from their terabytes of data. As a first task, you decide to use machine learning to do something interesting with the Vox news corpus (https://data.world/elenadata/vox-articles). You need to define a supervised or unsupervised learning problem and solve it. You must do something more complicated than binary classification.
For all tasks, you can use existing python packages, such as sklearn, libsvm, TensorFlow, keras, etc. but make sure you give credit in your write-up. In your evaluation, you will need to define simple classifiers as baselines and show that your proposed method is performing better than the baselines. Split the data into train/dev/test and tune hyperparameters on the dev data, and report final results on the test data. You are welcome to report on multiple methods that you have tried.
Be sure to answer the following questions: (a) what is your data and task? (b) what ML solution did you choose and, most importantly, why was this an appropriate choice? (c) how did you choose to evaluate success? (d) what software did you use and why did you choose it? (e) what are the results? (f) show some examples from the development data that your approach got correct and some it got wrong: if you were to try to fix the ones it got wrong, what would you do?
What to submit:
Your entire homework will be considered late if any of these parts are submitted late.