• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Lecture 1

Page history last edited by mike@mbowles.com 9 years, 8 months ago

Here are papers that we'll cover in class as introductory material: 


These slides give some motivation for specialized big-data algorithms, background on map reduce and on general techniques for converting machine learning algorithms to map-reduce, and run through an overview of some of the algorithms that we'll cover in more detail in the course of the class.  Machine Learning on Big Data - ClassIntro.pdf


If you've got a machine learning problem that's got a "big-data" aspect, consider approaching your problem with a policy of gradual escalation.  Here's how.  ApproachingMLonBigData.pdf




Here are some slides that one of the students from earlier class session put together to help people understand map reduce.  Map-ReduceHackerDojo.ppt


This is the original paper describing the map reduce framework for processing large data sets.  The nomenclature invented in this paper is now in common usage.  The introduction to mrjob will assume familiarity with this info.  mapreduce-osdi04.pdf


Here's a paper outlining how a variety of machine learning algorithms can be fit into the map-reduce framework.  nips06-mapreducemulticore.pdf


Here are links to source code for the mean-variance example running on AWS directly : simple mean-var - aws directly

and running with mrjob: mean-variance using mrjob


Running simple mean-variance on AWS.


Once  you have signed up for all the required services click on the S3 tab.   We will need to upload our flies to S3 in order for the AWS version of Hadoop to find our files. 


1.  Create a new bucket.  You can think of a bucket as a drive.  Pick a name for your bucket - mvBucket
2. Now once you have created a new bucket create two new folders - one to hold the python code - mvCode and the other to hold the input file - mvInput.  We are going to upload the Python mapper and reducer files to the folder called: mvCode.  The other folder mvInput will hold the input to our Hadoop job.  
3. Upload the file inputFile.txt to the folder mvInput in the bucket: mvBucket.   
4. Upload the files: mapper.py and reducer.py to the folder mvCode in the bucket: mvBucket.  

Now we have all the files uploaded we are ready to launch our first Hadoop job on multiple machines. 


5. Click on the tab that says: Elastic MapReduce.  Next click on the button that says: “Create New Job Flow”.  Name the job flow meanVar001.  Below that are two check boxes and a drop down box.  Check the radio button that says: Run your own application.  On the pull-down menu that says: “Choose a Job Type” select: Streaming, then hit the continue button. 


6. This step is where you give the input arguments to Hadoop, it is very important that you put get these settings correct otherwise your job will fail.  Enter the values in the following fields (be sure to include the quotes):
Input Location*:mvBucket/mvInput/inputFile.txt
Output Location*: mvBucket/meanVar001Log
Mapper*: "python s3n://mvBucket/mvCode/mapper.py"


Reducer*: "python s3n://mvBucket/mvCode/reducer.py"
You can leave the Extra Args field blank, this is where you would specify extra arguments such as restricting the number of reducers, etc.


7. The next window is the configure EC2 Instances window.  This is where you specifiy the number of servers that will crunch your data. The default is two you can change it to one.  You can also specify the type of EC2 instance you want to use.  You can use a more powerful machine with larger memory, however it will cost more.  In practice big jobs are usually run on “Large” (or better) instances.  Please refer to http://aws.amazon.com/ec2/#instance for more details.  For this trivial demonstration you can use one “small” machine.  Make sure you enable logging, and for the Amazon S3


Recording of Lecture 1


If you click on the links below, it will launch a viewer that will run through the audio and desktop from the lecture. 


Yelp Presentation Sildes


Here's a copy of the slides that Sudarshan presented in class. 






Nag's Screen Shots

Nag (one of our students) has made available his screen shots for

1.  Setting up AWS account

2.  Running simple map reduce job directly on AWS




Thank Nag for making these available to us. 



Comments (0)

You don't have permission to comment on this page.