There are 4 files in this folder.
1. python code to generate the input data (inputGen.py)
2. python code for the mapper (mapper.py)
3. python code for the reducer (reducer.py)
4. text file with input data generated by inputGen.py (input.txt
upload these to aws to run this directly.
1. Create a new bucket. You can think of a bucket as a drive. Pick a name for your bucket - simple-example
2. Now once you have created a new bucket create two new folders - one to hold the python code - mvCode and the other to hold the input file - mvInput. We are going to upload the Python mapper and reducer files to the folder called: mvCode. The other folder mvInput will hold the input to our Hadoop job. 3. Upload the file inputFile.txt to the folder mvInput in the bucket: simple-example.
4. Upload the files: mapper.py and reducer.py to the folder mvCode in the bucket: simple-example. Now we have all the files uploaded we are ready to launch our first Hadoop job on multiple machines.
5. Click on the tab that says: "Elastic MapReduce". Next click on the button that says: “Create New Job Flow”. Name the job flow "meanVar001". Below that are two check boxes and a drop down box. Check the radio button that says: "Run your own application". On the pull-down menu that says: “Choose a Job Type” select: "Streaming", then hit the continue button.
6. This step is where you give the input arguments to Hadoop, it is very important that you put get these settings correct otherwise your job will fail. Enter the values in the following fields (be sure to include the quotes):
Input Location: simple-example/mvInput/inputFile.txt
Output Location: simple-example/meanVar001Log
Mapper: "python s3n://simple-example/mvCode/mapper.py"
Reducer: "python s3n://simple-example/mvCode/reducer.py"
You can leave the "Extra Args" field blank, this is where you would specify extra arguments such as restricting the number of reducers, etc.
7. The next window is the configure EC2 Instances window. This is where you specifiy the number of servers that will crunch your data. The default is 2 you can change it to 1. You can also specify the type of EC2 instance you want to use. You can use a more powerful machine with larger memory, however it will cost more. In practice big jobs are usually run on “Large” (or better) instances. Please refer to http://aws.amazon.com/ec2/#instance for more details. For this trivial demonstration you can use one “small” machine. Make sure you enable logging,
Comments (0)
You don't have permission to comment on this page.