Questions tagged [dataset]

A dataset is a collection of data, often in tabular or matrix form.

This tag is NOT intended for data requests ("where can I find a dataset about ...") --> see OpenData

A dataset, or data set, is a collection of data - the data points of which are typically related in some way.

Most commonly a dataset corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question. The dataset lists values for each of the variables, such as height and weight of an object, for each member of the dataset. Each value is known as a datum or data point.

The term dataset may also be used more loosely, to refer to the data in a collection of closely related tables.

1514 questions
203
votes
35 answers

Publicly Available Datasets

One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level analysis. Looking at the other people's effort,…
Amir Ali Akbari
  • 1,393
  • 3
  • 13
  • 25
63
votes
5 answers

Is it always better to use the whole dataset to train the final model?

A common technique after training, validating and testing the Machine Learning model of preference is to use the complete dataset, including the testing subset, to train a final model to deploy it on, e.g. a product. My question is: Is it always…
pcko1
  • 4,030
  • 2
  • 17
  • 30
59
votes
6 answers

Should I go for a 'balanced' dataset or a 'representative' dataset?

My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should choose a similar data setup for training my…
pnp
  • 693
  • 1
  • 6
  • 10
35
votes
10 answers

Why is it wrong to train and test a model on the same dataset?

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?
karalis1
  • 461
  • 1
  • 5
  • 8
35
votes
4 answers

Quick guide into training highly imbalanced data sets

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class. Some good answers…
IgorS
  • 5,474
  • 11
  • 34
  • 43
28
votes
3 answers

Data Science Project Ideas

I don't know if this is a right place to ask this question, but a community dedicated to Data Science should be the most appropriate place in my opinion. I have just started with Data Science and Machine learning. I am looking for long term project…
Kevin Desai
  • 383
  • 1
  • 3
  • 4
28
votes
7 answers

Publicly available social network datasets/APIs

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics…
Rubens
  • 4,117
  • 5
  • 25
  • 42
25
votes
4 answers

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of gather(), separate() and spread(), which can be found in R tidyr tool.
cpumar
  • 815
  • 1
  • 10
  • 14
23
votes
2 answers

Loading own train data and labels in dataloader using pytorch?

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a…
Amarnath
  • 361
  • 1
  • 2
  • 5
23
votes
6 answers

Uploading images folder from my system into Google Colab

I want to train a deep learning model on a dataset containing around 3000 images. Since the dataset is huge, I want to use Google colab since it's GPU supported. How do I upload this full image folder into my notebook and use it?
22
votes
3 answers

Dataset for Named Entity Recognition on Informal Text

I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the documents in my dataset, I'm looking for out of…
Madison May
  • 2,039
  • 2
  • 18
  • 18
22
votes
3 answers

How to generate synthetic dataset using machine learning model learnt with original dataset?

Generally, the machine learning model is built on datasets. I'd like to know if there is any way to generate synthetic dataset using such trained machine learning model preserving original dataset characteristics? [original data --> build machine…
m-bhole
  • 323
  • 1
  • 2
  • 8
18
votes
5 answers

Downloading a large dataset on the web directly into AWS S3

Does anyone know if it's possible to import a large dataset into Amazon S3 from a URL? Basically, I want to avoid downloading a huge file and then reuploading it to S3 through the web portal. I just want to supply the download URL to S3 and wait…
Will Stedden
  • 183
  • 1
  • 1
  • 5
18
votes
4 answers

One hot encoding alternatives for large categorical values

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns? I found this interesting link. But they are converting to class/object which I don't want. I…
18
votes
3 answers

When should we consider a dataset as imbalanced?

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
  • 604
  • 2
  • 6
  • 16
1
2 3
99 100