Questions tagged [big-data]

27 questions
8
votes
4 answers

Applying algorithms on large data

Is there any book or tutorial that teaches us how to efficiently apply the common algorithms (sorting, searching, etc.) on large data (i.e. data that cannot be fully loaded into main memory) and how to efficiently apply those algorithms considering…
Arani
  • 523
  • 4
  • 11
4
votes
1 answer

What is the difference between FLOPS and OPS?

I have typically heard computer performance discussed in terms of FLOPS. However, I have recently seen multiple references instead using OPS i.e. operations per second, typically in the context of Big Data. What is the difference between FLOPS and…
user1887919
  • 141
  • 1
  • 3
3
votes
4 answers

If I have a large random array of 0s and 1s that I want to sort what kind of an algorithm and data structures should I consider?

What are the types of things that need to be considered if I need to sort a large random array of 0s and 1s? You can assume large array is in the order of million or billions. I understand there are tons of sorting algorithms out there (quick,…
user1068636
  • 217
  • 3
  • 6
3
votes
2 answers

Bloom Filter for 208 million URLs

I need to create a bloom filter of 208 million URLs. What would be a good choice of bit vector size and number of hash functions? I tried a bit vector of size 1 GB and 4 hash functions, but it resulted in too many false positives while reading. I…
3
votes
4 answers

High Dimensional Data Structures

I have a 20-dimensional dataset, with a large amount of data points. I would like to have each dimension discretized into bins. Per bin, I would like to be able to access two neighbours per dimension (i.e. +1 and -1 per dimension). Basically I want…
danielvdende
  • 133
  • 2
3
votes
2 answers

How do XFast and Y Fast Tries compare to B trees in performance?

I learned that Y fast tries support amortized loglog(u) time insertions , deletions. and loglog(u) time membership, successor and predecessor operations with O(n) space. So when n is closer to U in dense big data environments Y fast tries seem…
thambi
  • 125
  • 9
3
votes
2 answers

Semi-streaming algorithm for $s$-$t$ connectivity

Let $G=(V,E)$ be an undirected graph. Given a pair of vertices $s,t \in V$, how can we construct a semi-streaming algorithm which determines is $s$ and $t$ are connected? Is there any way to construct such an algorithm which scans the input stream…
KaliTheGreat
  • 181
  • 7
2
votes
1 answer

(a,b)-tree vs B-tree

I would like to know what are the differences between (a,b)-tree and a B-tree. It has been a few days I am studying different papers and I am seeing different definitions that make me confused. For example in External Memory geometric Data Structure…
M a m a D
  • 1,561
  • 2
  • 18
  • 33
2
votes
0 answers

Examples of real world graphs that are too big for a single commodity-type machine

I've been reading on distributed systems for processing on large graphs. The most prominent examples include Pregel (developed by Google) and Apache Giraph. Most of these systems argue their existence that they are for "big data" processing, i.e.,…
2
votes
2 answers

Hashing by doing modulo $m$ for $m=p^2$ for a prime $p$ instead of using a prime $m$ - is it that bad?

I am doing an exercise from a Big Data course I'm taking on Coursera (this exercise is for experimenting with a big-data problem and is not for any credit or homework) , the assignment was described briefly: Your task is to quickly find the number…
Belgi
  • 267
  • 1
  • 9
2
votes
0 answers

Looking for dynamic network data sets

There are a number of collections of network (or graph) data sets freely available on the web, e.g. http://snap.stanford.edu/data/index.html http://www.cc.gatech.edu/dimacs10/downloads.shtml I am looking for dynamic network data sets, i.e.…
cls
  • 21
  • 2
1
vote
0 answers

Clarification on MapReduce description in textbook

I am reading through chapter 2 of of the free textbook "Mining of Massive Datasets" (http://www.mmds.org/). On page 28 the following is stated: "It is reasonable to create one Map task for every chunk of the input file(s), but we may wish to create…
ClownInTheMoon
  • 323
  • 2
  • 9
1
vote
1 answer

Why splitted text files is bigger than a large one with the same content?

I have this large text file that when unzipped has about 2GB. I split this one into multiple(more than 5 million) files and now I have a folder of about 20GB, how is this possible?
1
vote
0 answers

What are internal clustering index for binary data ? And if possible applicable to massive cluster ?

I was wondering what are the current existing internal clustering index for binary data. I know already the silhouette and Davis Bouldin for euclidian space, i suppose they work as well in binary space using Hamming distance for example, tell me if…
KyBe
  • 235
  • 3
  • 9
1
vote
1 answer

Taking intersection in large search

As I understand, you can build the the word -> pages index in Google or large SQL database since indexed search has complexity O(1) -- lookup gives you a billion-page result at once сomputer -> About 2.14 bln results science -> About 1.93 bln…
Little Alien
  • 195
  • 6
1
2