Questions tagged [apache-hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The project itself includes a variety of other complementary additions.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Storm, a system for real-time and stream processing
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Spark, a fast and general engine for large-scale data processing.
  • Flink, a fast and reliable large-scale data processing engine.

Recommended reference sources:

Commercial support is available from a variety of companies.

118 questions
40
votes
10 answers

Do I need to learn Hadoop to be a Data Scientist?

An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?
Pensu
  • 591
  • 1
  • 4
  • 8
34
votes
5 answers

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…
15
votes
2 answers

What is the difference between Hadoop and noSQL

I heard about many tools / frameworks for helping people to process their data (big data environment). One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing? Are they complementary?
рüффп
  • 295
  • 5
  • 16
12
votes
3 answers

Does Amazon RedShift replace Hadoop for ~1XTB data?

There is plenty of hype surrounding Hadoop and its eco-system. However, in practice, where many data sets are in the terabyte range, is it not more reasonable to use Amazon RedShift for querying large data sets, rather than spending time and effort…
trienism
  • 253
  • 2
  • 9
12
votes
2 answers

Tradeoffs between Storm and Hadoop (MapReduce)

Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch…
mbbce
  • 347
  • 2
  • 8
11
votes
3 answers

What are R's memory constraints?

In reviewing “Applied Predictive Modeling" a reviewer states: One critique I have of statistical learning (SL) pedagogy is the absence of computation performance considerations in the evaluation of different modeling techniques. With its…
blunders
  • 1,932
  • 2
  • 15
  • 19
11
votes
3 answers

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…
Amir Ali Akbari
  • 1,393
  • 3
  • 13
  • 25
8
votes
4 answers

Data science and MapReduce programming model of Hadoop

What are the different classes of data science problems that can be solved using mapreduce programming model?
10land
  • 369
  • 3
  • 10
8
votes
3 answers

Good books for Hadoop, Spark, and Spark Streaming

Can anyone suggest any good books to learn hadoop and map reduce basics? Also something for Spark, and Spark Streaming? Thanks
tsar2512
  • 183
  • 1
  • 4
8
votes
1 answer

Cascaded Error in Apache Storm

Going through the presentation and material of Summingbird by Twitter, one of the reasons that is mentioned for using Storm and Hadoop clusters together in Summingbird is that processing through Storm results in cascading of error. In order to avoid…
mbbce
  • 347
  • 2
  • 8
7
votes
1 answer

Linear Regression in R Mapreduce(RHadoop)

I m new to RHadoop and also to RMR... I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error. Tring to read the file from hdfs Error: Error in mr(map = map, reduce = reduce,…
user3782364
  • 101
  • 3
7
votes
2 answers

Lambda Architecture - How to implement the Merge Layer / Query Layer

I am reading up about lambda architecture. It makes sense. we have queue based data ingestion. we have an in-memory store for data which is very new and we have HDFS for old data. So we have our entire data set. in our system. very good. but the…
Knows Not Much
  • 171
  • 1
  • 4
6
votes
3 answers

Is there a benefit to using hadoop with only one node?

I just started learning about Hadoop. From what I understand its primary strength is its ability to distribute a task across many nodes. Is there any benefit to using Hadoop with only a single node, other then the potential of expanding to more…
Eric Anastas
  • 161
  • 2
6
votes
2 answers

Processing data stored in Redshift

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the…
deanj
  • 63
  • 4
6
votes
1 answer

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] …
gsamaras
  • 291
  • 6
  • 15
1
2 3 4 5 6 7 8