Pick out the problems that interest you to study.
1. rewrite kmeans clustering algo to incorporate true random selection of initial k centroids.
2. rewrite kmeans clustering to use separate reducers for each of the k cluster centers
3. generate test data to test kMeans algo for scalability. use the inputGen.py function to expand the input.txt data set in several dimension. more data points, more centroids, and more attributes (inputGen.py writes lines with two real numbers per line (or sample). increase that significantly). generate a large enough data sets that one processor takes a significant amount of time and then run with 3, and more to determine limits of linear run-time reduction. is that limit different when you increase the number of attribute dimensions versus the number of data points?
4. code up the canopy clustering algo outlined in class.
5. how would you combine canopy clustering and k-means?
Comments (0)
You don't have permission to comment on this page.