| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

View
 

FrontPage

This version was saved 12 years, 7 months ago View current version     Page history
Saved by mike@mbowles.com
on February 15, 2012 at 2:49:12 pm
 

Machine Learning on Big Data with MapReduce

Course objectives:  
Participants will learn to adapt and execute machine learning algorithms in the map reduce framework.  Participants should finish the class able to author their own machine learning algorithms for map reduce and to run them on Amazon Web Services.  Amazon is providing AWS credits for class participants. 


Participants will learn to use python code to author mappers and reducers for “hadoop-streaming”.  For most of the class we will employ “mrjob” - an open-source framework developed at Yelp.  Employing mrjob enables class members to program mappers and reducers in python.  The mrjob framework then submits the mapper-reducer to run locally without using hadoop, to run on Amazon Web Services, or to run them on a private hadoop cluster.  This will simplify the programming tasks.

Schedule: Here's a tentative schedule to give a rough idea of what we intend to cover.  This may change somewhat to meet the interests of the class participants. 

 

Week/Date
Topic
Notes
Week 1
Implementing Algorithms on Big Data
 
Jan 19
MapReduce, Hadoop Streaming, Mahout, Amazon (AWS, EMR)

Lecture 1,

mrjob installation

Jan 25
mrjob
Lecture 2  
Week 2
Clustering
 
Jan 26
k-means, Canopy Clustering

Lecture 3  

1st Exercises

Feb 1
Guassian Mixture Model - EM 

Lecture 4  

2ndExercises

Week 3
Supervised Learning
 
Feb 9
Regularized Regression - glmnet algo for elasticnet Lecture 5  
Feb 15 SVM - Pegasos algo for two-class and one-class, extensions

Lecture 6

Project Suggestions  

Week 4 Other ML Tasks
Feb 16 Text Mining & Recommender Systems
Lecture 7  
Feb 22
SVD methods, SVD on mapReduce, Lanczos algo Lecture 8  
Week 5 Student Projects   
Feb 23
 
Lecture 9  
Feb 29
 
Lecture 10  

 



Prerequisites:
-Facility with undergrad level math and stats (vector calculus, density functions, etc.)
-Comfortable programming  basic python (version 2.6 or 2.7 NOT version 3). 

-You'll also need to develop some familiarity with Numpy - ("random" family of functions, matrix(), array())
-Install mrjob and boto (these are both python installations)
-Familiarity with basic machine learning.  

 

Background Material:

 

Reference material for python

Here's a page with links to Python tutorial to help you learn python.  python references DO NOT INSTALL Python VERSION 3 - it has incompatibilities.  You can find python at www.python.org

 

mrjob

Here's some installation help with mrjob. mrjob installation We'll have a wide variety of different OS and capabilities.  If you make discoveries about the process when you install, add info the the mrjob installation page. 

 

Here's some general documentation on mrjob and a google group devoted to it:  mrjob resources

 

Amazon Web Services

You'll need to sign up for AWS.  This page has step-by-step signup directions:  AWS

 

Ricky Ho's Blog

 

Ricky Ho (one of the members of our class) has put great explanations on his blog.  Have a look at them. 

http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html

http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

http://horicky.blogspot.com/2010/08/mapreduce-to-recommend-people.html

http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html

http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

 

Registration:

Register for the class at:  http://machinelearningbigdata.eventbrite.com/

 

People have asked to attend this class remotely, so we've added a teleconference ticket on eventbrite.  We need signups for remote attendees at least one day before the event so we have time to communicate connection info.

 

Thank you to amazon web services for sponsoring this class. 

 

Comments (0)

You don't have permission to comment on this page.