$35.00
Description

Aim
Please cluster the words among instances(documents) together to nd the keywordcombinations in clusters/topics from the given dataset with various methods taught in class up to Chapter Eleven of the textbook. Other than the necessary data preprocessing such as scaling, normalization etc., it is demanded in Homework # 3 assignment to practice NLP and Clustering. Also, please try to summarize your observation from the clustering results. You may apply new methods or use new packages to improve the the quality of clustering, but if you do so, you have to give a brief introduction of the key concepts and provide necessary citations, instead of just direct copy paste or importing. However, in this assignment, you are not allowed to use any neural network related models (e.g., multilayer perceptron etc). In case any neural network related method is applied, you will receive no credits. Once an algorithm package is merged or imported into your code, please list the package link in your reference and describe its mathematical concepts in your report followed by the reason for adoption.

Dataset Description
Arti cial intelligent becomes a hot area for research in machine learning. Since most of the researches in Google are more application oriented, we are interested in what kinds of
AI topics being investigated in Google’s published researches. Here we o er the dataset,
Google AI published Research, which is crawled from https://ai.google/research/pubs/ [1].
In this dataset, we only o er `title’ and `abstract’ and concatenate both.

Submission Format
You have to submit a compressed le hw3 studentID.zip which contains the following les:

hw3 studentID.ipynb: detailed report, Python codes, results, discussion and mathematical descriptions;

hw3 studentID.tplx: extra Latex related setting, including the bibliography;

hw3 studentID.bib: citations in the “bibtex” format;

hw3 studentID.pdf: the pdf version of your report which is exported by your ipynb with


%% jupyter nbconvert – to latex – template hw3 studentID.tplx hw3 studentID.ipynb



%% pd atex hw3 studentID.tex



%% bibtex hw3 studentID



%% pd atex hw3 studentID.tex



%% pd atex hw3 studentID.tex


Other les or folders in a workable path hierarchy to your jupyter notebook (ipynb).

Coding Guidelines
For the purpose of individual demonstration with TA, you are required to create a function code in your jupyter notebook, as speci ed below, to reduce the data dimensionality, learn a classi cation model, and evaluate the performance of the learned model.
hw3 student ID demo(in x, in label, mode) { in x : [string] CSV le for `data’.
{ mode: [string] mode=`preprocessing’ for transforming the text instances into a tokenized word vector matrix M 2 R^{D v}, which is an matrix for demonstrating the contents in D documents with v words. Each row represents a document instance while each column stands for a selected word. The matrix M should be
a matrix whose (i; j)th entry is the count of jth selected word appearing in the ith document. M can be computed via CountVectorizer. Please set matrix M as global and return M when mode=`preprocessing’. In the meantime, please transpose these v words into a column with the same index order as the columns in M. Then record this column of words into HW3 studentID words.csv with header `words’.
mode=`clustering’ for building models and dumping the clustering result and some clustering parameters.
In mode=`clustering’, please output the following `CSV’ les with headers. In the following, `avg silhouette’2 [ 1; 1] is the average of all silhouette scores for all v words in
HW3 studentID words.csv. Also, most of methods below are based on the subpackage in `sklearn’.
KMeans: Please transpose the matrix M.
le 1 ‘HW3 studentID KMeans.csv’ with header
avg silhouette, n clusters
n clusters: is n clusters
le 2 ‘HW3 studentID KMeans output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with `NA’. The header for this this le is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
KMeans++: Please transpose the matrix M.
le 1 ‘HW3 studentID KMeanspp.csv’ with header
avg silhouette, n clusters
n clusters: is n clusters
le 2 ‘HW3 studentID KMeanspp output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with `NA’. The header for this this le is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
Fuzzy KMeans: Please transpose the matrix M.
le 1 ‘HW3 studentID FKMeans.csv’ with header
avg silhouette, n clusters, fuzzy coe ,HW3 silhouette thr The notations are described later.
le 2 ‘HW3 studentID FKMeans output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with `NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
Agglomerative: Please transpose the matrix M.
le 1 ‘HW3 studentID Agglomerative.csv’ with header avg silhouette, n clusters, a nity, linkage
n clusters: is n clusters;
a nity : is a nity;
linkage: is linkage;
le 2 ‘HW3 studentID Agglomerative output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with `NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
LatentDirichletAllocation(LDA): Please `Do Not’ transpose the matrix M.
le 1 ‘HW3 studentID LDA.csv’ with header
avg silhouette, n clusters, learning method,HW3 silhouette thr
n clusters: is n components;
learning method: is learning method ;
HW3 silhouette thr: is the threshold to the new silhouette score especially for soft clustering labels. It will be explained later.
le 2 ‘HW3 studentID LDA output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster do not have more than 20 words, please ll the rest with `NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
Each method will be graded as 20% of demonstration. Please note that, every method should write ‘HW3 studentID fmethodg.csv’ with the highest average sihouette value or the elbow of average sihouette value in the gure of avg sihouette v.s. n clusters, including the corresponded `n clusters’. Note that
method2fLDA, Agglomerative,KMeans,KMeanspp,FKMeansg.
In this homework assignment and demonstration, we need to install the package of fuzzy KMeans and put extra function for evaluating the sihouette score for soft clustering labels in Fuzzy KMeans and Latent Dirichlet Allocation.
For Fuzzy KMeans, please install sklearn extensions with [2] pip install sklearn extensions – upgrade;
from sklearn extensions.fuzzy kmeans import FuzzyKMeans import numpy as np
fuzzy kmeans=FuzzyKMeans(k=n clusters,m=fuzzy coe ) fuzzy kmeans model=fuzzy kmeans. t(np.transpose(M)) soft cluster label=fuzzy kmeans model.fuzzy labels
In method of fuzzy KMeans,
{ n clusters: is k;
{ fuzzy coe : is m;
{ HW3 silhouette thr: is the threshold to the new silhouette score especially for soft clustering labels. It will be explained later;
In order to evaluate the quality of LDA and Fuzzy KMeans among di erent number of topics/clusters, we estimate their silhouette score as following. Assume there are v words and k topics/clusters. Let Y 2 R^{k v} be the matrix in soft cluster labels and the summation of each row in Y is one. Let s be the threshold to select elements in Y
~
^{Y}ij ^{=}
8
>_{>}Y_{ij}; if Y_{ij} s
<
Y_{ij}; if max Y_{ij} s _{:}
j

^{>}0; otherwise
j
:
Note that when max Y_{ij}
s, there^{>}are not clusters for the word w_{i} could dominate in
^
the soft clustering labels. We just keep its original values. Let Y be the normalized
~
^
Y where the summation of each row in Y is equal to one. Then we can compute the
parameter of a silhouette score for each word w_{i}; i 2 f1; : : : ; vg under jth topic/cluster as
v
^{P} ^
Y_{rj}d(w_{i}; w_{r})
a_{ij} = ^{r}^{=1} _{v}

^
^{Y}rj

r=1
v
^
; w_{r})
Y_{r} d(w_{i}
b
min
r=1
ij
=
2f
; =j
P
v
1;:::;k
^
g
6
r^{P}
^{Y}r
=1
where d(w_{i}; w_{r}) is the metric between word w_{i} and word w_{r}:
Hence the silhouette score _{i} for the word w_{i} would be

k
f
g
^{X}j
_{i} =
Y^{^}_{ij}
^{b}ij ^{a}ij
;
max b_{ij}; a_{ij}
=1
which is also in the range [ 1; 1]:

Report Requirement
List names of packages used in your program; A owchart for Preprocessing.
Compare results among 5 methods; Describe Observation and conclusion;
Describe the mathematical concepts of any new algorithms or models employed as well as the roles they play in your feature selection/extraction or classi cation task in Markdown cells [3].
5.1 Basic Requirement
Implement ve methods after the preprocessing is nished.
Based on the average of silhouette scores, decide the the clustering number. The grading will also refer to this value.
Please make sure hw3 student ID demo is functional and can output the required les in both mode=`preprocessing’ and mode=`clustering’.
If you apply new methods or use new packages to improve the classi cation performance, you have to give a brief introduction of the key concepts and provide necessary citations/links, instead of just direct copy paste or importing.
Please submit your `report’ in English. Be aware that a `report’ is much more than a `program.’
References

Google ai published research. https://ai.google/research/pubs/. Accessed: 20180517.

Fuzzy kmeans from the third party (open source). http://wdm0006.github.io/ sklearnextensions/fuzzy_k_means.html. Accessed: 20180522.

Markdown. https://daringfireball.net/projects/markdown/basics. Accessed: 20180329.
6