Questions tagged [apache-hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The project itself includes a variety of other complementary additions.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Pig, a platform/programming language for authoring parallelizable jobs
Storm, a system for real-time and stream processing
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Spark, a fast and general engine for large-scale data processing.
Flink, a fast and reliable large-scale data processing engine.

Recommended reference sources:

Hive Language Reference

Commercial support is available from a variety of companies.

118 questions

votes

10 answers

Do I need to learn Hadoop to be a Data Scientist?

An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?

bigdata apache-hadoop

asked Jun 10 '14 at 06:20

Pensu

votes

5 answers

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…

apache-spark apache-hadoop distributed knowledge-base cloud-computing

asked Jun 17 '14 at 20:48

idclark

votes

2 answers

What is the difference between Hadoop and noSQL

I heard about many tools / frameworks for helping people to process their data (big data environment). One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing? Are they complementary?

nosql tools processing apache-hadoop

asked May 14 '14 at 10:44

рüффп

votes

3 answers

Does Amazon RedShift replace Hadoop for ~1XTB data?

There is plenty of hype surrounding Hadoop and its eco-system. However, in practice, where many data sets are in the terabyte range, is it not more reasonable to use Amazon RedShift for querying large data sets, rather than spending time and effort…

apache-hadoop map-reduce aws

asked Jun 11 '14 at 04:24

trienism

votes

2 answers

Tradeoffs between Storm and Hadoop (MapReduce)

Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch…

bigdata efficiency apache-hadoop distributed

asked Jun 01 '14 at 10:25

mbbce

votes

3 answers

What are R's memory constraints?

In reviewing “Applied Predictive Modeling" a reviewer states: One critique I have of statistical learning (SL) pedagogy is the absence of computation performance considerations in the evaluation of different modeling techniques. With its…

apache-hadoop r

asked May 14 '14 at 17:48

blunders

1,932
2
15
19

votes

3 answers

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…

scalability apache-hadoop map-reduce mongodb

asked May 18 '14 at 12:03

Amir Ali Akbari

1,393
3
13
25

votes

4 answers

Data science and MapReduce programming model of Hadoop

What are the different classes of data science problems that can be solved using mapreduce programming model?

apache-hadoop map-reduce

asked Jul 28 '14 at 16:17

10land

votes

3 answers

Good books for Hadoop, Spark, and Spark Streaming

Can anyone suggest any good books to learn hadoop and map reduce basics? Also something for Spark, and Spark Streaming? Thanks

apache-hadoop

asked Dec 05 '14 at 05:50

tsar2512

votes

1 answer

Cascaded Error in Apache Storm

Going through the presentation and material of Summingbird by Twitter, one of the reasons that is mentioned for using Storm and Hadoop clusters together in Summingbird is that processing through Storm results in cascading of error. In order to avoid…

bigdata apache-hadoop

asked Jun 01 '14 at 12:51

mbbce

votes

1 answer

Linear Regression in R Mapreduce(RHadoop)

I m new to RHadoop and also to RMR... I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error. Tring to read the file from hdfs Error: Error in mr(map = map, reduce = reduce,…

machine-learning r apache-hadoop map-reduce

asked Jul 03 '14 at 10:49

user3782364

votes

2 answers

Lambda Architecture - How to implement the Merge Layer / Query Layer

I am reading up about lambda architecture. It makes sense. we have queue based data ingestion. we have an in-memory store for data which is very new and we have HDFS for old data. So we have our entire data set. in our system. very good. but the…

bigdata apache-hadoop

asked Jan 02 '15 at 20:03

Knows Not Much

votes

3 answers

Is there a benefit to using hadoop with only one node?

I just started learning about Hadoop. From what I understand its primary strength is its ability to distribute a task across many nodes. Is there any benefit to using Hadoop with only a single node, other then the potential of expanding to more…

apache-hadoop

asked Oct 11 '15 at 01:05

Eric Anastas

votes

2 answers

Processing data stored in Redshift

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the…

apache-hadoop aws

asked Nov 12 '14 at 17:27

deanj

votes

1 answer

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] …

python apache-hadoop k-means map-reduce distributed

asked Feb 06 '16 at 02:38

gsamaras

2 3 4 5 6 7 8 Next