Highest Voted 'scalability' Questions - Data Science Stack Exchange

94

votes

12 answers

How big is big data?

Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated…

asked May 14 '14 at 03:56

Rubens

4,117
5
25
42

15

votes

4 answers

Data Science Tools Using Scala

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?

scalability scala

asked Dec 10 '14 at 06:37

sheldonkreger

1,169
8
20

14

votes

4 answers

Looking for example infrastructure stacks/workflows/pipelines

I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…

machine-learning bigdata efficiency scalability distributed

asked Jun 17 '14 at 10:37

chrshmmmr

143
7

11

votes

3 answers

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…

scalability apache-hadoop map-reduce mongodb

asked May 18 '14 at 12:03

Amir Ali Akbari

1,393
3
13
25

10

votes

3 answers

How do various statistical techniques (regression, PCA, etc) scale with sample size and dimension?

Is there a known general table of statistical techniques that explain how they scale with sample size and dimension? For example, a friend of mine told me the other day that the computation time of simply quick-sorting one dimensional data of size n…

bigdata statistics efficiency scalability

asked Aug 05 '14 at 18:36

Bridgeburners

229
1
7

9

votes

1 answer

Learning signal encoding

I have a large number of samples which represent Manchester encoded bit streams as audio signals. The frequency at which they are encoded is the primary frequency component when it is high, and there is a consistent amount of white noise in the…

machine-learning data-mining scalability algorithms feature-selection

asked Jun 18 '14 at 03:19

ragingSloth

1,854
3
14
15

8

votes

3 answers

How to compare experiments run over different infrastructures

I'm developing a distributed algorithm, and to improve efficiency, it relies both on the number of disks (one per machine), and on an efficient load balance strategy. With more disks, we're able to reduce the time spent with I/O; and with an…

bigdata efficiency performance scalability distributed

asked Jun 15 '14 at 00:00

Rubens

4,117
5
25
42

6

votes

2 answers

How to best accomplish high speed comparison of like data?

I attack this problem frequently with inefficiency because it's always pretty low on the priority list and my clients are resistant to change until things break. I would like some input on how to speed things up. I have multiple datasets of…

efficiency scalability sql

asked Jun 13 '14 at 10:57

Steve Kallestad

3,208
4
23
41

5

votes

3 answers

How can one quickly look up people from a large database?

Vocabulary Face detection: Finding all faces in an image. Face representation: The simplest way to represent a face is as an image (pixels / color values). This is not very space efficient and likely makes follow-up tasks hard. Face embeddings are…

bigdata image-recognition search scalability

asked Apr 19 '19 at 10:16

Martin Thoma

19,540
36
98
170

4

votes

1 answer

LSTM Time series prediction for multiple multivariate series

I have to predict next min traffic for multiple cities (100+). I am thinking of using LSTM. My main concern is how do I scale the number of cities. How does LSTM learn different amount of traffic and other related features of all cities to predict…

time-series lstm scalability

asked Feb 20 '19 at 14:15

maggs

345
4
11

4

votes

2 answers

How to measure execution time on distributed system

I'm planning to run experiments with large datasets on distributed system in order to evaluate efficiency gains in comparison with previous proposals. I have limited number of machines nearly ten machines having 200 GB of free space on hard disk on…

bigdata scalability distributed

asked Jun 17 '14 at 05:55

Rubens

4,117
5
25
42

4

votes

1 answer

Use Cases of Neo4J and Spark GraphX

I have used Neo4J to implement a content recommendation engine. I like Cypher, and find graph databases to be intuitive. Looking at scaling to a larger data set, I am not confident No4J + Cypher will be performant. Spark has the GraphX project,…

scalability graphs neo4j

asked Dec 10 '14 at 22:47

sheldonkreger

1,169
8
20

3

votes

0 answers

Clustering large set of images

I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of feature representations is the one I would like to…

machine-learning clustering bigdata apache-spark scalability

asked Jan 25 '21 at 09:44

Overloop

31
2

3

votes

1 answer

scikit-learn OMP mem error

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have…

python bigdata feature-selection scikit-learn scalability

asked Nov 02 '14 at 11:36

sshanks

31
2

2

votes

1 answer

Where and how to do large scale supervised machine learning?

I'm beginner in ML and I have a large dataset that has 15 features with 6M rows, so it becomes challenging to work on it locally. I can train one model locally but to perform hyper parameter tuning and cross validations with my macbook pro, it runs…

random-forest supervised-learning pyspark scalability cloud

asked Jun 02 '21 at 14:07

ro23

35
1
4

Questions tagged [scalability]