rmr2 is an R package designed to write scalable algorithms. This package is connector between R and Hadoop and it allows to write MapReduce jobs using Hadoop Streaming. MapReduce is a low-level paradigm to write algorithms that run in parallel. Scaling a data analysis or Machine Learning technique may require re-writing some steps in MapReduce. Some algorithms can be easily re-written in MapReduce, some others require some changes, others can’t be parallelized at all.

Using a low-level tool like rmr2 requires a lot of testing and debugging. For this purpose, it’s possible to use a local computing context to test the algorithms in-memory on a small data sample. The following example shows how to perform a word-count on some sample data.

print(getwd())
## [1] "/Users/micheleusuelli/Documents/1career/2projects/2myarticles/rmr2"
source("rmr2_lib.R")
vector_lines <- readLines("../R OOP/R and OOP methods 06.txt")
input_val <- list(vector_lines[1:10], vector_lines[1:10 + 10])
input <- lapply(input_val, function(val){
  keyval(NULL, data.frame(val, stringsAsFactors = FALSE))
})
map_word_count <- function(., v) {
  words_by_row <- lapply(v[, 1], strsplit, split = " ")
  words_by_row <- lapply(words_by_row, "[[", 1)
  words_tot <- do.call("c", words_by_row)
  table_words <- table(words_tot)
  keyval(names(table_words), as.vector(table_words))
}
reduce_count <- function(k, v) {
  # browser()
  count_tot <- sum(v)
  if(count_tot > 10){
    keyval(k, count_tot)
  }else{
    NULL
  }
}
map_iris_count <- function(., v) {
  table_species <- table(v[, "Species"])
  keyval(names(table_species), as.vector(table_species))
}
mr_output <- mapreduce(input = input,
                       map = map_word_count,
                       reduce = reduce_count)
data.frame(mr_output)
##     key val
## 1     a  15
## 2 class  11
## 3    of  11
## 4   the  17
mr_output <- mapreduce(input = iris,
                       map = map_iris_count,
                       reduce = reduce_count)
data.frame(mr_output)
##          key val
## 1     setosa  50
## 2 versicolor  50
## 3  virginica  50

The algorithm is split

We have been able to run and test rmr2 in-memory. However, the algorithm is still a black-box in the sense that we don’t know what happened within inside each step. Furthermore, we don’t know about the input of the reduce step.

A solution is interrupting the execution R code before each reduce step using browser.