rmr2 is an R package designed to write scalable algorithms. This package is connector between R and Hadoop and it allows to write MapReduce jobs using Hadoop Streaming. MapReduce is a low-level paradigm to write algorithms that run in parallel. Scaling a data analysis or Machine Learning technique may require re-writing some steps in MapReduce. Some algorithms can be easily re-written in MapReduce, some others require some changes, others can’t be parallelized at all.
Using a low-level tool like rmr2 requires a lot of testing and debugging. For this purpose, it’s possible to use a local computing context to test the algorithms in-memory on a small data sample. The following example shows how to perform a word-count on some sample data.
print(getwd())
## [1] "/Users/micheleusuelli/Documents/1career/2projects/2myarticles/rmr2"
source("rmr2_lib.R")
vector_lines <- readLines("../R OOP/R and OOP methods 06.txt")
input_val <- list(vector_lines[1:10], vector_lines[1:10 + 10])
input <- lapply(input_val, function(val){
keyval(NULL, data.frame(val, stringsAsFactors = FALSE))
})
map_word_count <- function(., v) {
words_by_row <- lapply(v[, 1], strsplit, split = " ")
words_by_row <- lapply(words_by_row, "[[", 1)
words_tot <- do.call("c", words_by_row)
table_words <- table(words_tot)
keyval(names(table_words), as.vector(table_words))
}
reduce_count <- function(k, v) {
# browser()
count_tot <- sum(v)
if(count_tot > 10){
keyval(k, count_tot)
}else{
NULL
}
}
map_iris_count <- function(., v) {
table_species <- table(v[, "Species"])
keyval(names(table_species), as.vector(table_species))
}
mr_output <- mapreduce(input = input,
map = map_word_count,
reduce = reduce_count)
data.frame(mr_output)
## key val
## 1 a 15
## 2 class 11
## 3 of 11
## 4 the 17
mr_output <- mapreduce(input = iris,
map = map_iris_count,
reduce = reduce_count)
data.frame(mr_output)
## key val
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
The algorithm is split
We have been able to run and test rmr2 in-memory. However, the algorithm is still a black-box in the sense that we don’t know what happened within inside each step. Furthermore, we don’t know about the input of the reduce step.
A solution is interrupting the execution R code before each reduce step using browser
.