Questions tagged [apache-spark]

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write, originally developed in the AMPLab at UC Berkeley.

From http://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms. To make programming faster, Spark provides clean, concise APIs in , and . You can also use Spark interactively from the and shells to rapidly query big datasets.

Spark runs on , , standalone, or in the cloud. It can access diverse data sources including , , , and .

Recommended reference sources:

Spark Documentation

Learning Spark - Lightning-Fast Big Data Analysis

AMP Camp 5 (Berkeley, CA, November 20-21, 2014)

AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, Tachyon

AMP Camp 3 (Berkeley, CA, Aug 2013)

AMP Camp 2 (Strata Santa Clara, Feb 2013)

AMP Camp 1 (Berkeley, CA, Aug 2012)

238 questions
35
votes
6 answers

Merging multiple data frames row-wise in PySpark

I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
34
votes
5 answers

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…
15
votes
3 answers

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…
the3rdNotch
  • 253
  • 1
  • 2
  • 7
15
votes
3 answers

How to convert categorical data to numerical data in Pyspark

I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…
SRS
  • 1,065
  • 5
  • 11
  • 22
14
votes
3 answers

Replace all numeric values in a pyspark dataframe by a constant value

Consider a pyspark dataframe consisting of 'null' elements and numeric elements. In general, the numeric elements have different values. How is it possible to replace all the numeric values of the dataframe by a constant numeric value (for example…
justus
  • 141
  • 1
  • 1
  • 4
12
votes
3 answers

How do I set/get heap size for Spark (via Python notebook)

I'm using Spark (1.5.1) from an IPython notebook on a macbook pro. After installing Spark and Anaconda, I start IPython from a terminal by executing: IPYTHON_OPTS="notebook" pyspark. This opens a webpage listing all my IPython notebooks. I can…
Kai
  • 303
  • 1
  • 2
  • 10
12
votes
3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…
gcd
  • 121
  • 1
  • 3
12
votes
1 answer

Spark ALS: recommending for new users

The question How do I predict the rating for a new user in an ALS model trained in Spark? (New = not seen during training time) The problem I'm following the official Spark ALS tutorial…
ciri
  • 236
  • 2
  • 7
11
votes
1 answer

PySpark dataframe repartition

What happens when we do repartition on a PySpark dataframe based on the column. For example dataframe.repartition('id') Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the…
Nikhil Baby
  • 213
  • 1
  • 2
  • 6
11
votes
1 answer

Calculate cosine similarity in Apache Spark

I have a DataFrame with IDF of certain words computed. For example (10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332]) .... and so on Now give a query Q, I can…
11
votes
3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
10
votes
1 answer

Server log analysis using machine learning

I was assigned this task to analyze the server logs of our application which contains exception logs, database logs event logs etc. I am new to machine learning, we use Spark with elastic search and Sparks MLlib(or PredictionIO).An example of the…
elric
  • 111
  • 1
  • 1
  • 3
10
votes
1 answer

Spark, optimally splitting a single RDD into two

I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so Option 1 - Create map from original RDD and filter def…
j.a.gartner
  • 1,215
  • 1
  • 9
  • 18
9
votes
3 answers

How to run a pyspark application in windows 8 command prompt

I have a python script written with Spark Context and I want to run it. I tried to integrate IPython with Spark, but I could not do that. So, I tried to set the spark path [ Installation folder/bin ] as an environment variable and called…
SRS
  • 1,065
  • 5
  • 11
  • 22
9
votes
1 answer

Extracting individual emails from an email thread

Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails. For example consider…
1
2 3
15 16