Monday, November 23, 2015

Reduce Side Join - Hadoop MapReduce



Design Pattern - REDUCE Side Join 


You will use reduce side join if you are using more than one dataset and both of them are equally big.

Dataset to be used

File Name - customerDetails.txt

Name CustomerId

Example -

Aaron Hawkins,296334
Aaron Smayling,814503
Adam Bellavance,960803
Adam Hart,157942
Adam Shillingsburg,713629
Adrian Barton,525624
Adrian Hane,434995
Adrian Shami,813495


Filename - customerTransaction.txt

 transaction details......  , Name , ......

Example - 

1,3,13/10/2010,Low,6,261.54,0.04,Regular Air,-213.25,38.94,35,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,20/10/2010
49,293,01/10/2012,High,49,10123.02,0.07,Delivery Truck,457.81,208.16,68.02,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",Jumbo Drum,0.58,02/10/2012
50,293,01/10/2012,High,27,244.57,0.01,Regular Air,46.71,8.69,2.99,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,03/10/2012
80,483,10/07/2011,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,3.99,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,12/07/2011
3866,27559,30/10/2011,High,38,465.9,0.05,Regular Air,79.34,12.28,4.86,Aaron Hawkins,Nova Scotia,Atlantic,Home Office,Office Supplies,Paper,Xerox 1933,Small Box,0.38,31/10/2011
Here we will attach some marker to output values of each mapper so that in Reducer we can identify which mapper has emitted that output.
MapReduce Program :






























Wednesday, November 18, 2015

Map Side Join - Hadoop MapReduce


Design Pattern - MAP Side Join 


You will use mapside join if one of your table can fit in memory which will reduce overhead on your sort and shuffle data.

Prerequisites:

  • Data should be partitioned and sorted in particular way.
  • Each input data should be divided in same number of partition.
  • Must be sorted with same key.
  • All the records for a particular key must reside in the same partition.

Dataset to be used

File Name - u.item 

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

Example -

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

Filename - u.data

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
        user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

Example - 

196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596

MapReduce Program :









Output will be like this : 

Til There Was You (1997) 2.3333333333333335
1-900 (1994) 2.4285714285714284
101 Dalmatians (1996) 2.8536585365853657
12 Angry Men (1957) 3.6048387096774195
187 (1997) 3.5224913494809686