Questions tagged [map-reduce]

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

28 questions

votes

3 answers

Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I…

asked Aug 14 '14 at 00:50

cjauvin

votes

3 answers

Does Amazon RedShift replace Hadoop for ~1XTB data?

There is plenty of hype surrounding Hadoop and its eco-system. However, in practice, where many data sets are in the terabyte range, is it not more reasonable to use Amazon RedShift for querying large data sets, rather than spending time and effort…

apache-hadoop map-reduce aws

asked Jun 11 '14 at 04:24

trienism

votes

3 answers

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…

scalability apache-hadoop map-reduce mongodb

asked May 18 '14 at 12:03

Amir Ali Akbari

1,393
3
13
25

votes

4 answers

Data science and MapReduce programming model of Hadoop

What are the different classes of data science problems that can be solved using mapreduce programming model?

apache-hadoop map-reduce

asked Jul 28 '14 at 16:17

10land

votes

1 answer

Linear Regression in R Mapreduce(RHadoop)

I m new to RHadoop and also to RMR... I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error. Tring to read the file from hdfs Error: Error in mr(map = map, reduce = reduce,…

machine-learning r apache-hadoop map-reduce

asked Jul 03 '14 at 10:49

user3782364

votes

3 answers

Why does map reduce have a shuffle step?

I'm looking at a diagram of map reduce where there is a map step, a shuffle step and then the reduce step. Why shuffle?

map-reduce

asked Apr 05 '16 at 17:25

sebastianspiegel

votes

1 answer

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] …

python apache-hadoop k-means map-reduce distributed

asked Feb 06 '16 at 02:38

gsamaras

votes

4 answers

Can all statistical algorithms be parallelized using a Map Reduce framework

Is it correct to say that any statistical learning algorithm (linear/logistic regression, SVM, neural network, random forest) can be implemented inside a Map Reduce framework? Or are there restrictions? I guess there may be some algorithms that is…

machine-learning apache-hadoop map-reduce

asked Aug 26 '15 at 20:44

Victor

votes

3 answers

What technologies are fastest at performing joins on large datasets?

By "large", I mean in the range of 100m to 10b rows. I'm currently using both Hadoop MapReduce and Amazon RedShift. MapReduce has been a little disappointing here. Redshift works very well if the data is distributed well for the given query. Are…

bigdata performance map-reduce aws

asked Nov 09 '14 at 14:11

Bill

votes

1 answer

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script: dumbo start matrix2seqfile.py -input…

bigdata python apache-hadoop map-reduce

asked Jun 27 '15 at 06:34

Harshvardhan Solanki

votes

1 answer

Timing sequence in MapReduce

I'm running a test on MapReduce algorithm in different environments, like Hadoop and MongoDB, and using different types of data. What are the different methods or techniques to find out the execution time of a query. If I'm inserting a huge amount…

efficiency map-reduce performance experiments

asked Dec 06 '14 at 17:56

syed

votes

1 answer

Dataset map function error : TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor

I am currently trying to write a script to create a TFRecord file. Therefore, I am following the instruction on the offical tensorflow website: https://www.tensorflow.org/tutorials/load_data/tfrecord#writing_a_tfrecord_file However, when applying…

tensorflow map-reduce

asked Jul 08 '20 at 13:41

toom

votes

3 answers

Can Hadoop be beneficial when data is in database tables and not in a file system

I work for a bank. Most of our data is in the form of database tables. Would we benefit by implementing Hadoop? I am of the impression that Hadoop is more for a Distributed File System (unstructured data) as opposed to OLAP databases (Netezza)

apache-hadoop databases map-reduce

asked Aug 26 '15 at 20:49

Victor

votes

1 answer

Difference Between Hadoop Mapreduce(Java) and RHadoop mapreduce

I understand Hadoop MapReduce and its features but I am confused about R MapReduce. One difference I have read is that R utilizes maximum RAM. So do perform parallel processing integrated R with Hadoop. My doubt is: R can do all stats, math and…

machine-learning r apache-hadoop map-reduce

asked Jun 27 '14 at 12:03

user3782364

votes

1 answer

Time Complexity notation in Big Data platforms

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average…

bigdata algorithms map-reduce

asked May 07 '15 at 06:22

Mohitt

2 Next