Homework #2 Solution



  • Aim

  • Please classify the patched images with various methods taught in class up to Chapte Seven of the textbook. Other than the necessary data preprocessing such as scaling, nor malizing etc., it is demanded in Homework #2 assignment to practice cross-validation an ensemble methods. You may apply new methods or use new packages to improve the classi cation performance, but if you do so, you have to give a brief introduction of th key concepts and provide necessary citations, instead of just direct copy paste or impor ing. However, in this assignment, you notare allowedto use any neural network related models (e.g., multilayer perceptron, CNN, etc). In case any neural network related metho is applied, you will receive no credits. Once an algorithm package is merged or importe into your code, please list the package link in your reference and describe its mathematic concepts in your report followed by the reason for adoption.

  • Dataset Description

  • DeepSat(SAT-6) Airborne Dataset is downloaded from https://www.kaggle.com/crawford/ deepsat-sat6 [1][2]. In order to save storage space and speed up the learning process, only

  • a portion of the original dataset, labeled with `building’, `grassland’, ‘road’, is given in this

  • assignment. Each picture is a 28*28 pixel 4-band (red, green, blue and near infrared) image.

  • The whole dataset is saved as ‘CSV’ files. Here is the dataset format:

  • X *.csv : 4-band (‘R’ed,‘G’reen,‘B’lue and near ‘I’nfrared) image data.

  • { Each cell represents one pixel value 0 to 255 in ‘R’ed,‘G’reen,‘B’lue or near ‘I’nfrared.

  • { Each row is a separated 28*28 pixel 4-band image, which is shown in 1-D array

  • format

  • fcolorg frowIdxg fcolIdxg=

  • [R 0 0, R 0 1, . . . , R 27 27, G 0 0, . . . , G 27 27, B 0 0, . . . , B 27 27, I 0 0, . . . ,

  • I 27 27].

  • However, there is no header shown in the ‘CSV’ files.

  • y *.csv : label data, where the row indexing matches to that in X *.csv. Each label is 1×3 one-hot encoded vector standing for ‘building’, ‘grassland’ and ‘road.’

  • You may refer to other’s code on Kaggle [3]. If you are interested, you may also modify your code or ipynb for the full dataset and submit it to the Kaggle.

  • Submission Format

  • You have to submit a compressed file hw2 studentID.zip which contains the following files:

  • hw2 studentID.ipynb: detailed report, Python codes, results, discussion and math-ematical descriptions;

  • hw2 studentID.tplx: extra Latex related setting, including the bibliography;

  • hw2 studentID.bib: citations in the ”bibtex” format;

  • hw2 studentID.pdf: the pdf version of your report which is exported by your ipynb with

    1. %% jupyter nbconvert – -to latex – -template hw2 studentID.tplx hw2 studentID.ipynb

    2. %% pdflatex hw2 studentID.tex

    3. %% bibtex hw2 studentID

    4. %% pdflatex hw2 studentID.tex

    5. %% pdflatex hw2 studentID.tex

  • Other files or folders in a workable path hierarchy to your jupyter notebook (ipynb).

  • Coding Guidelines

  • For the purpose of individual demonstration with TA, you are required to create a func-tion code in your jupyter notebook, as specified below, to reduce the data dimensionality, learn a classification model, and evaluate the performance of the learned model.

  • PipelineModel=hw2 studentID demo(in x, in label, mode)

  • { in x : [string] CSV file for ‘data’.

  • { in label: [string] None or CSV file for ‘label’, which contains labels to the cor-responding instances in in x.

  • { mode: [string] ‘train’ for building models; ‘test’ for using built model to evaluate performance.

  • This function should return a best model trained with cross-validation in your program. Also, set this pipeline model as global variable. Please note that the HW2 demonstra-tion will be graded based on the final ranking of accuracy. Every demonstration should be completed within the selected time slot.

  • If mode=‘train’ , please return a PipelineModel trained via cross-validation in your program. When mode=‘test’ , please dump the results to files,

  • hw2 studentID results.csv : save predict labels with the same format as the file assigned in in label when the mode=‘train’.

  • 2. hw2 studentID performance.csv : show an ‘accuracy’ in ‘%’ in type “float” without any extra ‘string’ characters.

  • Report Requirement

  • List names of packages used in your program;

  • Describe the pipeline combinations in your program;

  • Describe the cross-validation methods in your program;

  • For better explanation, draw flowcharts of the methods or procedures used in the program;

  • Describe the mathematical concepts of any new algorithms or models employed as well as the roles they play in your feature selection/extraction or classification task in Markdown cells [4];

  • Discuss the performance among different classifiers with/without feature selection/extraction.

  • 5.1 Basic Requirement

  • Combine feature engineering and classifier into pipelines [5]. In your program, the pipeline combinations should at least cover 3 different feature engineerings and 6 differ-ent classifiers, including bagging and AdaBoost. There will be more than one pipeline in your program. Some classifiers can turn into feature engineerings, in that case, you might need SelectFromModel [6] to merge them as part of feature engineerings in your pipeline.

  • Apply cross-validation method to find better parameter combinations to each pipeline. If you apply GridSearchCV and the program is halt for a long time, please remove n job setting. In addition, you could set verbose to make sure your cross-validation is still running.

  • Please make sure hw2 studentID demo is functional and return a trained pipline model with highest accuracy when mode=‘train’.

  • If you apply new methods or use new packages to improve the classification perfor-mance, you have to give a brief introduction of the key concepts and provide necessary citations/links, instead of just direct copy paste or importing.

  • Please submit your ‘report’ in English. Be aware that a ‘report’ is much more than a ‘program.’

  • References

  • Deepsat(sat-6) airborne dataset in kaggle. https://www.kaggle.com/crawford/ deepsat-sat6. Accessed: 2018-05-01.

  • Sat-4 and sat-6 airborne datasets. http://csc.lsu.edu/~saikat/deepsat/. Accessed: 2018-05-01.

  • Other’s code on kaggle. https://www.kaggle.com/crawford/deepsat-sat6/kernels. Accessed: 2018-05-01.

  • Markdown. https://daringfireball.net/projects/markdown/basics. Accessed: 2018-03-29.

  • Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/

  • modules/pipeline.html. Accessed: 2018-05-07.

  • Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/ modules/generated/sklearn.feature_selection.SelectFromModel.html. Accessed: 2018-05-07.

  • 4