Write a python code to implement the below algorithm
K-Means algorithm
write the K-Means algorithm from scratch. code will firstly read
the data set and the number of clusters as inputs. Then, it will
cluster the users based on the datapoint provided in that dataset.
As a output, code will return the clustering results,
following Output format (a).
{"0": 0, "1":0, "2":1}
Bradley-Fayyad-Reina (BFR) algorithm
write Bradley-Fayyad-Reina (BFR) algorithms from scratch.
implement K-Means as the main-memory clustering algorithm that you
will use in BFR.
load the data points from a file and process these data points
with the BFR algorithm. See below pseudocode for a general
idea.
In BFR, there are three sets of points that you need to keep
track of: Discard set (DS), Compression set (CS),
Retained set (RS).
For each cluster in the DS and CS, the cluster is summarized
by: N: The number of points
SUM: the sum of the coordinates of the
points SUMSQ: the sum of squares of
coordinates
The conceptual steps of the BFR algorithm: Please refer
to the slide.
The implementation details of the BFR algorithm: Please
refer to the section 7 Appendix.
Execution commands
Python command: $ python3 kmeans.py <input_file>
<n_cluster> <out_file1>
Scala command: $ spark-submit --class kmeans hw5.jar
<input_file> <n_cluster> <out_file1>
Python command: $ python3 bfr.py <input_path>
<n_cluster> <out_file1> <out_file2>
Scala command: $ spark-submit --class bfr hw5.jar <input_
path> <n_cluster> <out_file1> <out_file2>
Output format
Both for K-means and BFR: write your
clustering results in the JSON format (see Figure 2). The key is
the data point index (as string). The value is its corresponding
cluster index (Any order of cluster indices is fine).
Figure 2: An example of the output clustering results
Only for BFR: output the intermediate results
in the CSV format (see Figure 3). Each line represents the
following components in each iteration: "round id" (starting from
1), "the number of clusters in the discard set", "the total number
of the discarded points", "the number of clusters in the
compression set", "the total number of the compressed points", and
"the number of points in the retained set". We will mainly measure
following criteria: The total number of rounds must be
the number of data files/chunks in the
folder. The total number of the discarded
points should only go up with iterations.
Figure 3: An example of the intermediate results
Grading
We will use normalized mutual information (NMI) score to
evaluate your clustering results. The NMI should
be above 0.8 for all the datasets. We
will also evaluate the intermediate results to ensure your BFB is
correctly processing the data points. We will grade your code with
2 cases for k-means, and all the 10 cases for BFR.
provided with text files to do clustering Load the data points
from one file. Run a clustering algorithm: K-Means or an improved
version of K-Means (e.g. K-Means++) on the data points or a random
sample of the data points (in case your code cannot handle all the
data points in the first file). Initialize the algorithm with a
large number of centroids (e.g. 3 or 5 times of K) and use
Euclidean distance as the similarity measurement.
Among the result clusters of step 2, move all the clusters that
contain only one or very few points (up to you) to RS as outliers.
You will now have 2 groups of data points: outlier data points and
non-outlier data points. Run the clustering algorithm again to
cluster the non-outlier data points into K clusters. Use these K
clusters as your DS. Discard their points and generate the DS
statistics. Run the clustering algorithm again to cluster the
outlier data points into a large number of c
Write a python code to implement the below algorithm K-Means algorithm write the K-Means algorithm from scratch. code wi
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am