Questions tagged [apache-spark]

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write, originally developed in the AMPLab at UC Berkeley.

From http://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms. To make programming faster, Spark provides clean, concise APIs in scala, java and python. You can also use Spark interactively from the scala and python shells to rapidly query big datasets.

Spark runs on yarn, mesos, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, and amazon-s3.

Recommended reference sources:

Spark Documentation

Learning Spark - Lightning-Fast Big Data Analysis

AMP Camp 5 (Berkeley, CA, November 20-21, 2014)

AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, Tachyon

AMP Camp 3 (Berkeley, CA, Aug 2013)

AMP Camp 2 (Strata Santa Clara, Feb 2013)

AMP Camp 1 (Berkeley, CA, Aug 2012)

238 questions

votes

6 answers

Merging multiple data frames row-wise in PySpark

I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…

python apache-spark cross-validation pyspark

asked Apr 22 '16 at 04:27

krishna Prasad

1,147
1
14
23

votes

5 answers

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…

apache-spark apache-hadoop distributed knowledge-base cloud-computing

asked Jun 17 '14 at 20:48

idclark

votes

3 answers

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…

apache-spark scala

asked Jul 22 '15 at 14:16

the3rdNotch

votes

3 answers

How to convert categorical data to numerical data in Pyspark

I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…

python apache-spark categorical-data pyspark

asked Jun 29 '15 at 22:55

SRS

1,065
5
11
22

votes

3 answers

Replace all numeric values in a pyspark dataframe by a constant value

Consider a pyspark dataframe consisting of 'null' elements and numeric elements. In general, the numeric elements have different values. How is it possible to replace all the numeric values of the dataframe by a constant numeric value (for example…

python apache-spark

asked Oct 19 '16 at 23:22

justus

votes

3 answers

How do I set/get heap size for Spark (via Python notebook)

I'm using Spark (1.5.1) from an IPython notebook on a macbook pro. After installing Spark and Anaconda, I start IPython from a terminal by executing: IPYTHON_OPTS="notebook" pyspark. This opens a webpage listing all my IPython notebooks. I can…

apache-spark pyspark ipython anaconda

asked Oct 21 '15 at 18:17

Kai

votes

3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…

python apache-spark pyspark ipython

asked Jul 23 '15 at 03:45

gcd

votes

1 answer

Spark ALS: recommending for new users

The question How do I predict the rating for a new user in an ALS model trained in Spark? (New = not seen during training time) The problem I'm following the official Spark ALS tutorial…

apache-spark recommender-system pyspark

asked Oct 24 '16 at 21:13

ciri

votes

1 answer

PySpark dataframe repartition

What happens when we do repartition on a PySpark dataframe based on the column. For example dataframe.repartition('id') Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the…

apache-spark pyspark

asked Feb 22 '18 at 10:19

Nikhil Baby

votes

1 answer

Calculate cosine similarity in Apache Spark

I have a DataFrame with IDF of certain words computed. For example (10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332]) .... and so on Now give a query Q, I can…

machine-learning nlp apache-spark cosine-distance

asked Aug 10 '16 at 05:43

Ganesh Krishnan

votes

3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()

apache-spark pyspark

asked May 10 '16 at 12:38

krishna Prasad

1,147
1
14
23

votes

1 answer

Server log analysis using machine learning

I was assigned this task to analyze the server logs of our application which contains exception logs, database logs event logs etc. I am new to machine learning, we use Spark with elastic search and Sparks MLlib(or PredictionIO).An example of the…

machine-learning predictive-modeling apache-spark

asked Nov 27 '15 at 18:11

elric

votes

1 answer

Spark, optimally splitting a single RDD into two

I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so Option 1 - Create map from original RDD and filter def…

apache-spark pyspark

asked May 01 '15 at 20:32

j.a.gartner

1,215
1
9
18

votes

3 answers

How to run a pyspark application in windows 8 command prompt

I have a python script written with Spark Context and I want to run it. I tried to integrate IPython with Spark, but I could not do that. So, I tried to set the spark path [ Installation folder/bin ] as an environment variable and called…

python apache-spark pyspark ipython windows

asked Jun 21 '15 at 17:31

SRS

1,065
5
11
22

votes

1 answer

Extracting individual emails from an email thread

Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails. For example consider…

classification scikit-learn apache-spark preprocessing sentiment-analysis

asked Jun 01 '17 at 13:02

Greedy Coder

2 3

…

15 16 Next