Statistics is a scientific approach to inductive inference and prediction based on probabilistic models of the data. By extension, it covers the design of experiments and surveys to gather data for this purpose.
Questions tagged [statistics]
1116 questions
132
votes
1 answer
How to get correlation between two categorical variable and a categorical variable and continuous variable?
I am building a regression model and I need to calculate the below to check for correlations
Correlation between 2 Multi level categorical variables
Correlation between a Multi level categorical variable and
continuous variable
VIF(variance…
GeorgeOfTheRF
- 2,078
- 5
- 18
- 20
60
votes
5 answers
Neural networks: which cost function to use?
I am using TensorFlow for experiments mainly with neural networks. Although I have done quite some experiments (XOR-Problem, MNIST, some Regression stuff, ...) now, I struggle with choosing the "correct" cost function for specific problems because…
daniel451
- 723
- 1
- 6
- 6
47
votes
12 answers
Data Science in C (or C++)
I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS.
This works out well in my role as a Data Scientist, however, by starting my career in R and only…
Hack-R
- 1,949
- 1
- 21
- 34
38
votes
3 answers
Calculation and Visualization of Correlation Matrix with Pandas
I have a pandas data frame with several entries, and I want to calculate the correlation between the income of some type of stores. There are a number of stores with income data, classification of area of activity (theater, cloth stores, food ...)…
gdlm
- 535
- 1
- 6
- 9
29
votes
4 answers
Books about the "Science" in Data Science?
What are the books about the science and mathematics behind data science? It feels like so many "data science" books are programming tutorials and don't touch things like data generating processes and statistical inference. I can already code, what…
Anton
- 399
- 4
- 5
29
votes
10 answers
Any Online R console?
I am looking for an online console for the language R. Like I write the code and the server should execute and provide me with the output.
Similar to the website Datacamp.
Gotham
- 291
- 1
- 3
- 3
26
votes
7 answers
Is Python a viable language to do statistical analysis in?
I originally came from R, but Python seems to be the more common language these days. Ideally, I would do all my coding in Python as the syntax is easier and I've had more real life experience using it - and switching back and forth is a pain.
Out…
confused
- 498
- 4
- 11
23
votes
4 answers
What statistical model should I use to analyze the likelihood that a single event influenced longitudinal data
I am trying to find a formula, method, or model to use to analyze the likelihood that a specific event influenced some longitudinal data. I am having difficultly figuring out what to search for on Google.
Here is an example scenario:
Image you own a…
Peter Kirby
- 333
- 1
- 4
21
votes
3 answers
Overfitting in Linear Regression
I'm just getting started with machine learning and I have trouble understanding how overfitting can happen in a linear regression model.
Considering we use only 2 feature variables to train a model, how can a flat plane possibly be overfitted to a…
Sachin Krishna
- 379
- 1
- 2
- 7
21
votes
2 answers
What is the correct meaning and interpretation of p-values?
I’m posting this question, and an answer, to help dispel a few misunderstandings about what p-values are. As a hiring manager interviewing mid-level and senior data scientists, I have noticed these misunderstandings many times. I have also noticed…
Robert Long
- 3,518
- 12
- 30
19
votes
2 answers
Why does data science see class imbalance as a problem for supervised learning when statistics does not?
Why does data science see class imbalance as a problem in supervised learning when statistics says it is not?
Data science seems to seem class imbalance as problematic and needing special techniques to remedy this problem.
For instance, this DS.SE…
Dave
- 4,542
- 1
- 10
- 35
17
votes
2 answers
High-dimensional data: What are useful techniques to know?
Due to various curses of dimensionality, the accuracy and speed of many of the common predictive techniques degrade on high dimensional data. What are some of the most useful techniques/tricks/heuristics that help deal with high-dimensional data…
ASX
- 461
- 2
- 4
- 7
16
votes
5 answers
Beginner math books for Machine Learning
I'm a Computer Science engineer with no background in statistics or advanced math.
I'm studying the book Python Machine Learning by Raschka and Mirjalili, but when I tried to understand the math of the Machine Learning, I wasn't able to understand…
Tantaros
- 261
- 2
- 9
16
votes
1 answer
How many features to sample using Random Forests
The Wikipedia page which quotes "The Elements of Statistical Learning" says:
Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split.
I understand that this is a fairly good educated…
Valentin Calomme
- 6,256
- 3
- 23
- 54
15
votes
2 answers
Analyzing A/B test results which are not normally distributed, using independent t-test
I have a set of results from an A/B test (one control group, one feature group) which do not fit a Normal Distribution.
In fact the distribution resembles more closely the Landau Distribution.
I believe the independent t-test requires that the…
teebszet
- 253
- 2
- 6