PLEASE HELP MODIFY THE FUNCTION _k_set() And _summarize_set() • k sets(): this function picks k of the constraint sets i
Posted: Sun Jul 10, 2022 11:27 am
PLEASE HELP MODIFY THE FUNCTION _k_set() And_summarize_set()
• k sets(): this function picks k of the constraint sets in thelist of constraints to use in order to pick initial clustercenters. By default, it will pick the first k constraint setsappearing in the list. In the unconstrained case, this is basicallypicking k arbitrary points in the training set to use as thecluster centers. However, this is likely to perform poorly in theconstrained case. In particular, a bad selection of initialclusters may result in some cluster being unused.
• summarize set(constraint): this function takes as input a setof IDs to constrain together. This set of constrained IDs must allbe assigned to the same cluster. However, we need to decide how todecide which cluster to assign all these IDs to. The desired outputof this function is another point that will serve as arepresentative, when deciding which cluster center that the setwill be assigned to. That is, the cluster center that is closest tothis representative point will be selected. By default, the firstID in this constraint is returned as the representative. In theunconstrained case, there is a single ID in the constraint set, andthe single ID will be its own representative, which is the expectedbehaviour in the unconstrained k-means algorithm. However, thischoice may not perform well in the constrained case.
Questions: Answer the following questions regardingrunning k-means on this dataset.
1. Consider the implementation of the k sets() function, wherewe pick the initial cluster centers, from the list of constraints.When we have no “must-link” constraints, then it is typical to justselect k images from the dataset at random, and use the locationsas initial cluster centers. However, when we use “must-link”constraints, the “must-link” constraints provides some partialinformation about the clusters, which we could take advantage ofwhen selecting initial cluster centers. Consider what happens as weincrease the number of “must-link” constraints that we provide.With few constraints, most images are independent, and the onesthat are linked together are linked to a small number of otherimages. With many constraints, few images are independent, and theones that are linked together are linked to a large number of otherimages. Based on this observation, provide an argument that weshould pick the k-largest sets of linked images to serve as initialcluster centers.
2. Consider the implementation of the summarize set(constraint)function. In the e-step of the kmeans algorithm, given a set ofimages that are constrained to be in the same cluster, we need tofind a single cluster center to assign the entire set ofimages to. Normally, we would simply assign an image to its closestcluster. However, every image that is linked together by a“must-link” constraint should be assigned to the same cluster. Tosimplify this task, given a set of images that are linked togetherby a “must-link” constraint, we can somehow pick a single image torepresent the entire set. We find the cluster that the singlerepresentative image is closest to, and every image linked togetheris assigned to that same closest cluster. Propose a way to picksuch a representative image, and provide an argument for why itmight work well.
3. Once you have implemented the k sets() and summarizeset(constraint) functions, as discussed in the previous twoquestions, try running k-means clustering with 100 random“must-link” constraints. Include a picture of the 10 foundclusters. Do similarly with 1,000 constraints, 10,000 constraintsand 100,000 constraints (include a picture of the resultingclusters in each case). Do the 10 clusters found tend to become the10 expected digits as you increase the number of “must-link”constraints? You may want to try multiple random number seeds, tosee the effect of different runs.
4. Roughly, how many “must-link” constraints do we need untilk-means clustering finds the 10 expected digits as clusters? Is itguaranteed to find the 10 expected digits? Why or why not? What ifwe use every possible “must-link” constraint?
Source code in Python:
• k sets(): this function picks k of the constraint sets in thelist of constraints to use in order to pick initial clustercenters. By default, it will pick the first k constraint setsappearing in the list. In the unconstrained case, this is basicallypicking k arbitrary points in the training set to use as thecluster centers. However, this is likely to perform poorly in theconstrained case. In particular, a bad selection of initialclusters may result in some cluster being unused.
• summarize set(constraint): this function takes as input a setof IDs to constrain together. This set of constrained IDs must allbe assigned to the same cluster. However, we need to decide how todecide which cluster to assign all these IDs to. The desired outputof this function is another point that will serve as arepresentative, when deciding which cluster center that the setwill be assigned to. That is, the cluster center that is closest tothis representative point will be selected. By default, the firstID in this constraint is returned as the representative. In theunconstrained case, there is a single ID in the constraint set, andthe single ID will be its own representative, which is the expectedbehaviour in the unconstrained k-means algorithm. However, thischoice may not perform well in the constrained case.
Questions: Answer the following questions regardingrunning k-means on this dataset.
1. Consider the implementation of the k sets() function, wherewe pick the initial cluster centers, from the list of constraints.When we have no “must-link” constraints, then it is typical to justselect k images from the dataset at random, and use the locationsas initial cluster centers. However, when we use “must-link”constraints, the “must-link” constraints provides some partialinformation about the clusters, which we could take advantage ofwhen selecting initial cluster centers. Consider what happens as weincrease the number of “must-link” constraints that we provide.With few constraints, most images are independent, and the onesthat are linked together are linked to a small number of otherimages. With many constraints, few images are independent, and theones that are linked together are linked to a large number of otherimages. Based on this observation, provide an argument that weshould pick the k-largest sets of linked images to serve as initialcluster centers.
2. Consider the implementation of the summarize set(constraint)function. In the e-step of the kmeans algorithm, given a set ofimages that are constrained to be in the same cluster, we need tofind a single cluster center to assign the entire set ofimages to. Normally, we would simply assign an image to its closestcluster. However, every image that is linked together by a“must-link” constraint should be assigned to the same cluster. Tosimplify this task, given a set of images that are linked togetherby a “must-link” constraint, we can somehow pick a single image torepresent the entire set. We find the cluster that the singlerepresentative image is closest to, and every image linked togetheris assigned to that same closest cluster. Propose a way to picksuch a representative image, and provide an argument for why itmight work well.
3. Once you have implemented the k sets() and summarizeset(constraint) functions, as discussed in the previous twoquestions, try running k-means clustering with 100 random“must-link” constraints. Include a picture of the 10 foundclusters. Do similarly with 1,000 constraints, 10,000 constraintsand 100,000 constraints (include a picture of the resultingclusters in each case). Do the 10 clusters found tend to become the10 expected digits as you increase the number of “must-link”constraints? You may want to try multiple random number seeds, tosee the effect of different runs.
4. Roughly, how many “must-link” constraints do we need untilk-means clustering finds the 10 expected digits as clusters? Is itguaranteed to find the 10 expected digits? Why or why not? What ifwe use every possible “must-link” constraint?
Source code in Python: