The mnist dataset is a collection of 55,000 images of handwritten digits. Each digit is a 28 × 28 pixel image, but each
Posted: Tue Jul 12, 2022 8:11 am
The mnist dataset is a collection of 55,000 images ofhandwritten digits. Each digit is a 28 × 28 pixel image, but eachimage is represented as a one-dimensional array of length 784(i.e., 28 ×28), where each array entry is a floating point numberfrom 0 to 1.
In the homework, we will in particular explore a semi-supervisedextension of the k-means algorithm. The k-means algorithm producesa clustering from a given dataset, and is considered anunsupervised learning algorithm. In a semi-supervised extension, weshall provide “hints” to the k-means algorithm, to help it producean improved clustering.
Suppose that a data scientist runs the k-means algorithm ontheir dataset, and the user finds two clusters that they believeshould have been joined into a single cluster. Alternatively,suppose that a user finds a single cluster that should have beensplit into two. As an example, if we run k-means on the mnistdataset, where we intentionally withhold the label of each digit,we may obtain 10 clusters as follows:
The ninth image (the fourth image in the second row) appears tobe a cluster that combines both the 3-digit and the 8-digit.Further, there seem to be multiple clusters representing the9-digit (the first, fourth and tenth clusters).
A data scientist may want to provide some partial clusteringinformation to help the k-means algorithm find better clusters.Such partial information could be in the form of a “must-link”constraint, that asserts that a pair of images should be assignedto the same cluster. It may also come in the form of a “must-notlink” constraint, that asserts that two examples should be assignedto two different clusters. When we take advantage of this type ofpartial information about clusters, we say that we are performing“semi-supervised” clustering, in contrast to the usual“unsupervised” clustering when we do not have such partialinformation. This is also in constrast to the usual “supervised”clustering, where we are provided fully labeled data.
Consider the basic outline of a k-means clustering algorithm,without “must-link” constraints:
• pick k initial cluster centers (by picking k randomimages)
• for each iteration: 1 – e-step: given the cluster centers,(re-)assign each image to its closest cluster – m-step: given theassignments of images to clusters, update the cluster centersConsider the basic outline of a modified k-means clusteringalgorithm, which uses “must-link” constraints:
• pick k initial cluster centers (by picking k random“must-link” constraints) • for each iteration:
– e-step: given the cluster centers, (re-)assign each“must-link”constraint to its closest cluster
– m-step: given the assignments of images to clusters, updatethe cluster centers
As presented above, there are two main updates that we need tomake to the k-means algorithm so that it can support “must-link”constraints.
1. first, we can pick k initial cluster centers by randomlypicking k images from the dataset. If we have “must-link”constraints, we can pick k different “must-link” constraintsinstead, each constraint representing its own cluster center. If aconstraint represents multiple images that must belong to the samecluster, then the center of the constraint can represent theinitial center of a cluster.
2. second, consider the e-step, where we assign each image toits closest cluster. If we have a “must-link” constraint, we knowthat each image in the constraint must be assigned to the samecluster, so instead of separately assigning each image to its ownclosest cluster, we can simply assign all of the images in theconstraint to the same “closest” cluster, although we now have todefine what is a closest cluster, given a set of images.
In this homework, we will augment an implementation of thek-means algorithm to support “must-link” constraints. We will usethe mnist dataset to explore the idea of semi-supervised learning.Normally, when we perform clustering, we have no “ground truth”clusters. However, we can use a dataset normally used forclassification (such as mnist), where we have known labels (thedigits), and then try to reproduce those known labels throughclustering. This will give us the ability to tell if our“semi-supervised” clustering is working or not. In particular, wewant to see how many “must-link” constraints that we may need toobtain before we obtain the desired clustering. In this homework, a“must-link” constraint is represented as a set of images from thedata, and in particular, by their IDs (their index in thedataset).
Consider the following three images of the digit 8:
Suppose that the first image on the left is assigned the ID 200by the dataset, and 205 and 215 are the IDs of the second and thirdimage respectively. The set of IDs {200, 205, 215} represents a“must-link” constraint if it asserts that the three above images(denoted by their IDs) must be assigned the same label in aclustering.
The mnist dataset has been included in this homework. Make sureyou import the dataset into your python project. An mnist.py scripthas been included to read in the dataset, and to simulate“must-link” constraints. A kmeans.py script has been included,which implements a basic k-means algorithm. The goal of thishomework is to augment this simple implementation so that itsupports “must-link” constraints. A 2 main.py script has beenincluded, which shows you how to load the dataset, and how to usethe k-means clustering that is included, and shows how to visualizethe results. In particular, images are saved to an output folder(make sure an output folder exists in your project).
In the script main.py, the following line simulates “must-link”constraints: constraints = dataset.simulate_constraints(1000) Thisfunction call will partition all IDs into sets, where each set is acollection of IDs that represent a “mustlink” constraint. It is apartition of the IDs, in the sense that every ID appears in one andexactly one set of the returned list of sets. Some of these setsmay contain a single ID, which means that particular ID is notconstrained to be the same as any other ID. The argument of thisfunction (1,000 in the above example) is (roughly) the number ofconstraints to generate (in terms of the number of IDs that areconstrained to be the same). The higher this number, the more labelinformation we give to the k-means clustering algorithm. When thisnumber is 0, the problem is unconstrained, and k-means runs asnormal. Note that the ID of an image corresponds to its index inthe dataset, which ranges from 0 to 54,999, supposing that thereare 55,000 images in the dataset, or from 0 to N − 1 if the datasetis of size N.
The main script that you need to run is main.py. The only scriptthat you need to modify is kmeans.py. Using a python debugger (suchas PyCharm’s debugger) is highly suggested for this project. Inparticular, the debugger will allow you to step through the codeline-by-line interactively, so that you can get a sense of how thecode is working. Submitting the kmeans.py script is notrequired.
In particular, you will need to modify the following twofunctions in kmeans.py:
• k sets(): this function picks k of the constraint sets in thelist of constraints to use in order to pick initial clustercenters. By default, it will pick the first k constraint setsappearing in the list. In the unconstrained case, this is basicallypicking k arbitrary points in the training set to use as thecluster centers. However, this is likely to perform poorly in theconstrained case. In particular, a bad selection of initialclusters may result in some cluster being unused.
• summarize set(constraint): this function takes as input a setof IDs to constrain together. This set of constrained IDs must allbe assigned to the same cluster. However, we need to decide how todecide which cluster to assign all these IDs to. The desired outputof this function is another point that will serve as arepresentative, when deciding which cluster center that the setwill be assigned to. That is, the cluster center that is closest tothis representative point will be selected. By default, the firstID in this constraint is returned as the representative. In theunconstrained case, there is a single ID in the constraint set, andthe single ID will be its own representative, which is the expectedbehaviour in the unconstrained k-means algorithm. However, thischoice may not perform well in the constrained case.
Questions: MODIFY k_set() and sumarize_set() in the codebelow
7507 0 12139
15- 30- 25- 8 15- 21 25- 8 30- 15- 21 25- 8
In the homework, we will in particular explore a semi-supervisedextension of the k-means algorithm. The k-means algorithm producesa clustering from a given dataset, and is considered anunsupervised learning algorithm. In a semi-supervised extension, weshall provide “hints” to the k-means algorithm, to help it producean improved clustering.
Suppose that a data scientist runs the k-means algorithm ontheir dataset, and the user finds two clusters that they believeshould have been joined into a single cluster. Alternatively,suppose that a user finds a single cluster that should have beensplit into two. As an example, if we run k-means on the mnistdataset, where we intentionally withhold the label of each digit,we may obtain 10 clusters as follows:
The ninth image (the fourth image in the second row) appears tobe a cluster that combines both the 3-digit and the 8-digit.Further, there seem to be multiple clusters representing the9-digit (the first, fourth and tenth clusters).
A data scientist may want to provide some partial clusteringinformation to help the k-means algorithm find better clusters.Such partial information could be in the form of a “must-link”constraint, that asserts that a pair of images should be assignedto the same cluster. It may also come in the form of a “must-notlink” constraint, that asserts that two examples should be assignedto two different clusters. When we take advantage of this type ofpartial information about clusters, we say that we are performing“semi-supervised” clustering, in contrast to the usual“unsupervised” clustering when we do not have such partialinformation. This is also in constrast to the usual “supervised”clustering, where we are provided fully labeled data.
Consider the basic outline of a k-means clustering algorithm,without “must-link” constraints:
• pick k initial cluster centers (by picking k randomimages)
• for each iteration: 1 – e-step: given the cluster centers,(re-)assign each image to its closest cluster – m-step: given theassignments of images to clusters, update the cluster centersConsider the basic outline of a modified k-means clusteringalgorithm, which uses “must-link” constraints:
• pick k initial cluster centers (by picking k random“must-link” constraints) • for each iteration:
– e-step: given the cluster centers, (re-)assign each“must-link”constraint to its closest cluster
– m-step: given the assignments of images to clusters, updatethe cluster centers
As presented above, there are two main updates that we need tomake to the k-means algorithm so that it can support “must-link”constraints.
1. first, we can pick k initial cluster centers by randomlypicking k images from the dataset. If we have “must-link”constraints, we can pick k different “must-link” constraintsinstead, each constraint representing its own cluster center. If aconstraint represents multiple images that must belong to the samecluster, then the center of the constraint can represent theinitial center of a cluster.
2. second, consider the e-step, where we assign each image toits closest cluster. If we have a “must-link” constraint, we knowthat each image in the constraint must be assigned to the samecluster, so instead of separately assigning each image to its ownclosest cluster, we can simply assign all of the images in theconstraint to the same “closest” cluster, although we now have todefine what is a closest cluster, given a set of images.
In this homework, we will augment an implementation of thek-means algorithm to support “must-link” constraints. We will usethe mnist dataset to explore the idea of semi-supervised learning.Normally, when we perform clustering, we have no “ground truth”clusters. However, we can use a dataset normally used forclassification (such as mnist), where we have known labels (thedigits), and then try to reproduce those known labels throughclustering. This will give us the ability to tell if our“semi-supervised” clustering is working or not. In particular, wewant to see how many “must-link” constraints that we may need toobtain before we obtain the desired clustering. In this homework, a“must-link” constraint is represented as a set of images from thedata, and in particular, by their IDs (their index in thedataset).
Consider the following three images of the digit 8:
Suppose that the first image on the left is assigned the ID 200by the dataset, and 205 and 215 are the IDs of the second and thirdimage respectively. The set of IDs {200, 205, 215} represents a“must-link” constraint if it asserts that the three above images(denoted by their IDs) must be assigned the same label in aclustering.
The mnist dataset has been included in this homework. Make sureyou import the dataset into your python project. An mnist.py scripthas been included to read in the dataset, and to simulate“must-link” constraints. A kmeans.py script has been included,which implements a basic k-means algorithm. The goal of thishomework is to augment this simple implementation so that itsupports “must-link” constraints. A 2 main.py script has beenincluded, which shows you how to load the dataset, and how to usethe k-means clustering that is included, and shows how to visualizethe results. In particular, images are saved to an output folder(make sure an output folder exists in your project).
In the script main.py, the following line simulates “must-link”constraints: constraints = dataset.simulate_constraints(1000) Thisfunction call will partition all IDs into sets, where each set is acollection of IDs that represent a “mustlink” constraint. It is apartition of the IDs, in the sense that every ID appears in one andexactly one set of the returned list of sets. Some of these setsmay contain a single ID, which means that particular ID is notconstrained to be the same as any other ID. The argument of thisfunction (1,000 in the above example) is (roughly) the number ofconstraints to generate (in terms of the number of IDs that areconstrained to be the same). The higher this number, the more labelinformation we give to the k-means clustering algorithm. When thisnumber is 0, the problem is unconstrained, and k-means runs asnormal. Note that the ID of an image corresponds to its index inthe dataset, which ranges from 0 to 54,999, supposing that thereare 55,000 images in the dataset, or from 0 to N − 1 if the datasetis of size N.
The main script that you need to run is main.py. The only scriptthat you need to modify is kmeans.py. Using a python debugger (suchas PyCharm’s debugger) is highly suggested for this project. Inparticular, the debugger will allow you to step through the codeline-by-line interactively, so that you can get a sense of how thecode is working. Submitting the kmeans.py script is notrequired.
In particular, you will need to modify the following twofunctions in kmeans.py:
• k sets(): this function picks k of the constraint sets in thelist of constraints to use in order to pick initial clustercenters. By default, it will pick the first k constraint setsappearing in the list. In the unconstrained case, this is basicallypicking k arbitrary points in the training set to use as thecluster centers. However, this is likely to perform poorly in theconstrained case. In particular, a bad selection of initialclusters may result in some cluster being unused.
• summarize set(constraint): this function takes as input a setof IDs to constrain together. This set of constrained IDs must allbe assigned to the same cluster. However, we need to decide how todecide which cluster to assign all these IDs to. The desired outputof this function is another point that will serve as arepresentative, when deciding which cluster center that the setwill be assigned to. That is, the cluster center that is closest tothis representative point will be selected. By default, the firstID in this constraint is returned as the representative. In theunconstrained case, there is a single ID in the constraint set, andthe single ID will be its own representative, which is the expectedbehaviour in the unconstrained k-means algorithm. However, thischoice may not perform well in the constrained case.
Questions: MODIFY k_set() and sumarize_set() in the codebelow
7507 0 12139
15- 30- 25- 8 15- 21 25- 8 30- 15- 21 25- 8