Questions tagged [map-reduce]

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

28 questions
23
votes
3 answers

Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I…
12
votes
3 answers

Does Amazon RedShift replace Hadoop for ~1XTB data?

There is plenty of hype surrounding Hadoop and its eco-system. However, in practice, where many data sets are in the terabyte range, is it not more reasonable to use Amazon RedShift for querying large data sets, rather than spending time and effort…
trienism
  • 253
  • 2
  • 9
11
votes
3 answers

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…
Amir Ali Akbari
  • 1,393
  • 3
  • 13
  • 25
8
votes
4 answers

Data science and MapReduce programming model of Hadoop

What are the different classes of data science problems that can be solved using mapreduce programming model?
10land
  • 369
  • 3
  • 10
7
votes
1 answer

Linear Regression in R Mapreduce(RHadoop)

I m new to RHadoop and also to RMR... I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error. Tring to read the file from hdfs Error: Error in mr(map = map, reduce = reduce,…
user3782364
  • 101
  • 3
6
votes
3 answers

Why does map reduce have a shuffle step?

I'm looking at a diagram of map reduce where there is a map step, a shuffle step and then the reduce step. Why shuffle?
sebastianspiegel
  • 931
  • 5
  • 11
  • 17
6
votes
1 answer

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] …
gsamaras
  • 291
  • 6
  • 15
5
votes
4 answers

Can all statistical algorithms be parallelized using a Map Reduce framework

Is it correct to say that any statistical learning algorithm (linear/logistic regression, SVM, neural network, random forest) can be implemented inside a Map Reduce framework? Or are there restrictions? I guess there may be some algorithms that is…
Victor
  • 651
  • 3
  • 8
  • 20
5
votes
3 answers

What technologies are fastest at performing joins on large datasets?

By "large", I mean in the range of 100m to 10b rows. I'm currently using both Hadoop MapReduce and Amazon RedShift. MapReduce has been a little disappointing here. Redshift works very well if the data is distributed well for the given query. Are…
Bill
  • 51
  • 1
4
votes
1 answer

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script: dumbo start matrix2seqfile.py -input…
4
votes
1 answer

Timing sequence in MapReduce

I'm running a test on MapReduce algorithm in different environments, like Hadoop and MongoDB, and using different types of data. What are the different methods or techniques to find out the execution time of a query. If I'm inserting a huge amount…
syed
  • 41
  • 1
2
votes
1 answer

Dataset map function error : TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor

I am currently trying to write a script to create a TFRecord file. Therefore, I am following the instruction on the offical tensorflow website: https://www.tensorflow.org/tutorials/load_data/tfrecord#writing_a_tfrecord_file However, when applying…
toom
  • 171
  • 4
2
votes
3 answers

Can Hadoop be beneficial when data is in database tables and not in a file system

I work for a bank. Most of our data is in the form of database tables. Would we benefit by implementing Hadoop? I am of the impression that Hadoop is more for a Distributed File System (unstructured data) as opposed to OLAP databases (Netezza)
Victor
  • 651
  • 3
  • 8
  • 20
2
votes
1 answer

Difference Between Hadoop Mapreduce(Java) and RHadoop mapreduce

I understand Hadoop MapReduce and its features but I am confused about R MapReduce. One difference I have read is that R utilizes maximum RAM. So do perform parallel processing integrated R with Hadoop. My doubt is: R can do all stats, math and…
user3782364
  • 101
  • 3
2
votes
1 answer

Time Complexity notation in Big Data platforms

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average…
Mohitt
  • 141
  • 2
1
2