Questions tagged [scalability]
33 questions
94
votes
12 answers
How big is big data?
Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated…
Rubens
- 4,117
- 5
- 25
- 42
15
votes
4 answers
Data Science Tools Using Scala
I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?
sheldonkreger
- 1,169
- 8
- 20
14
votes
4 answers
Looking for example infrastructure stacks/workflows/pipelines
I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…
chrshmmmr
- 143
- 7
11
votes
3 answers
Can map-reduce algorithms written for MongoDB be ported to Hadoop later?
In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required…
Amir Ali Akbari
- 1,393
- 3
- 13
- 25
10
votes
3 answers
How do various statistical techniques (regression, PCA, etc) scale with sample size and dimension?
Is there a known general table of statistical techniques that explain how they scale with sample size and dimension? For example, a friend of mine told me the other day that the computation time of simply quick-sorting one dimensional data of size n…
Bridgeburners
- 229
- 1
- 7
9
votes
1 answer
Learning signal encoding
I have a large number of samples which represent Manchester encoded bit streams as audio signals. The frequency at which they are encoded is the primary frequency component when it is high, and there is a consistent amount of white noise in the…
ragingSloth
- 1,854
- 3
- 14
- 15
8
votes
3 answers
How to compare experiments run over different infrastructures
I'm developing a distributed algorithm, and to improve efficiency, it relies both on the number of disks (one per machine), and on an efficient load balance strategy. With more disks, we're able to reduce the time spent with I/O; and with an…
Rubens
- 4,117
- 5
- 25
- 42
6
votes
2 answers
How to best accomplish high speed comparison of like data?
I attack this problem frequently with inefficiency because it's always pretty low on the priority list and my clients are resistant to change until things break. I would like some input on how to speed things up.
I have multiple datasets of…
Steve Kallestad
- 3,208
- 4
- 23
- 41
5
votes
3 answers
How can one quickly look up people from a large database?
Vocabulary
Face detection: Finding all faces in an image.
Face representation: The simplest way to represent a face is as an image (pixels / color values). This is not very space efficient and likely makes follow-up tasks hard. Face embeddings are…
Martin Thoma
- 19,540
- 36
- 98
- 170
4
votes
1 answer
LSTM Time series prediction for multiple multivariate series
I have to predict next min traffic for multiple cities (100+). I am thinking of using LSTM. My main concern is how do I scale the number of cities. How does LSTM learn different amount of traffic and other related features of all cities to predict…
maggs
- 345
- 4
- 11
4
votes
2 answers
How to measure execution time on distributed system
I'm planning to run experiments with large datasets on distributed system in order to evaluate efficiency gains in comparison with previous proposals.
I have limited number of machines nearly ten machines having 200 GB of free space on hard disk on…
Rubens
- 4,117
- 5
- 25
- 42
4
votes
1 answer
Use Cases of Neo4J and Spark GraphX
I have used Neo4J to implement a content recommendation engine. I like Cypher, and find graph databases to be intuitive.
Looking at scaling to a larger data set, I am not confident No4J + Cypher will be performant. Spark has the GraphX project,…
sheldonkreger
- 1,169
- 8
- 20
3
votes
0 answers
Clustering large set of images
I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of feature representations is the one I would like to…
Overloop
- 31
- 2
3
votes
1 answer
scikit-learn OMP mem error
I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error.
The machine has 16G RAM, so I don't think this should have…
sshanks
- 31
- 2
2
votes
1 answer
Where and how to do large scale supervised machine learning?
I'm beginner in ML and I have a large dataset that has 15 features with 6M rows, so it becomes challenging to work on it locally. I can train one model locally but to perform hyper parameter tuning and cross validations with my macbook pro, it runs…
ro23
- 35
- 1
- 4