Latest Twitter Threads by @selcukorkmaz on Thread Reader App

Jan 10 • 6 tweets • 2 min read

In statistics and probability theory, a sample space is the set of all possible outcomes of a random experiment. It provides a comprehensive framework for understanding all potential results that could occur in a given scenario. The sample space is typically denoted by the symbol S.

#Statistics #DataScience #Research #Science

Examples:

1. Coin Toss: When flipping a fair coin, the sample space consists of two possible outcomes: heads (H) and tails (T). Thus, the sample space can be represented as:
S ={ H, T}

2. Rolling a Six-Sided Die: For a single roll of a standard six-sided die, the sample space includes all six possible outcomes:
S = i{ 1, 2, 3, 4, 5, 6}

3. Tossing Two Coins: When tossing two coins simultaneously, the sample space comprises all possible pairs of outcomes:
S ={ (H, H), (H, T), (T, H), (T, T)}

Dec 18, 2024 • 9 tweets • 3 min read

The Wisdom of Crowds

The wisdom of crowds is a phenomenon where the collective judgment or estimate of a group can be remarkably accurate, often surpassing individual expertise. This principle is grounded in the idea that individual errors tend to cancel each other out when aggregated, provided the crowd is diverse, independent, and sufficiently large.

David Spiegelhalter’s jellybean experiment illustrates this concept vividly and highlights its statistical underpinnings. 1. The Experiment

• Spiegelhalter and James Grime conducted a simple yet revealing test of crowd intelligence. They posted a YouTube video displaying a jar of jellybeans and asked viewers to guess how many beans were inside.

• A total of 915 guesses were collected, ranging from 219 to an absurd 31,337.

Nov 25, 2024 • 5 tweets • 2 min read

In statistical modeling, particularly within the context of regression analysis and analysis of variance (ANOVA), fixed effects and random effects are two fundamental concepts that describe different types of variables or factors in a model. Here’s a straightforward explanation:

#Statistics #DataScience #Research #Science

Fixed Effects:

Fixed effects refer to variables or factors whose levels are specifically chosen and are of primary interest in the study. These effects are considered constant and non-random, meaning the conclusions drawn from them are applicable only to the specific levels included in the analysis.

Imagine you’re studying the impact of different teaching methods on student performance. If you specifically choose and focus on three methods—lecture, discussion, and online learning—these are your fixed effects. You’re interested in understanding how each of these particular methods affects performance, and your conclusions will apply only to these methods.

Nov 24, 2024 • 7 tweets • 3 min read

Heteroscedasticity refers to a condition in regression analysis where the variance of the error terms, or residuals, is not constant across all levels of the independent variables. In other words, the spread of the residuals changes systematically with the values of the predictors. This violates the assumption of homoscedasticity, which states that residuals should have constant variance.

#Statistics #DataScience #Research #Science

Implications of Heteroscedasticity in Regression Analysis

1. Inefficiency of OLS Estimates: While ordinary least squares (OLS) estimators remain unbiased in the presence of heteroscedasticity, they are no longer efficient. This inefficiency means that OLS estimators do not achieve the minimum variance among all unbiased estimators, leading to less precise coefficient estimates.

2. Biased Standard Errors: Heteroscedasticity causes the estimated variances of the regression coefficients to be biased, leading to unreliable hypothesis testing. The t-statistics may appear more significant than they truly are, potentially resulting in incorrect conclusions about the relationships between variables.

3. Misleading Inferences: Due to biased standard errors, statistical tests (such as t-tests for individual coefficients) may lead to incorrect conclusions. For instance, a variable might appear statistically significant when it is not, or vice versa.

4. Invalid Goodness-of-Fit Measures: Measures like the R-squared statistic may be misleading in the presence of heteroscedasticity, as they assume constant variance of the residuals. This can lead to overestimating the model’s explanatory power.

Nov 18, 2024 • 10 tweets • 3 min read

🧵 Understanding Degrees of Freedom in Statistics

In statistics, degrees of freedom (d.f.) are the number of independent values that can vary in your data after certain constraints are applied.

#Statistics #DataScience #Research #Science

Imagine a prize behind 1 of 3 doors. If you open 2 doors and find no prize, the 3rd door is fixed. Here, you have 2 degrees of freedom.

Nov 16, 2024 • 8 tweets • 2 min read

Understanding confidence intervals can clarify statistics and enhance your data interpretation skills. Let’s try to break it down:

#Statistics #DataScience #Science #Research

A CI is a range of values, derived from sample data, that likely contains the true population parameter. It provides a measure of uncertainty around an estimate, indicating the precision of the data.

nlm.nih.gov/oet/ed/stats/0…

Nov 5, 2024 • 5 tweets • 1 min read

Say goodbye to cumbersome machine learning workflows in R! Meet fastml – your new go-to package for streamlined training, evaluation, and comparison of multiple ML models with minimal code. #rstats #MachineLearning #DataScience

github.com/selcukorkmaz/f… Whether you're a data scientist, analyst, or enthusiast, fastml simplifies the ML process by offering:

Comprehensive data preprocessing
Support for a wide range of algorithms
Automated hyperparameter tuning
Performance metrics & visualization tools
#rstats #DataScience

Oct 27, 2024 • 15 tweets • 2 min read

Rethinking p-Values in Scientific Research

This paper addresses the ongoing debate about p-values and offers innovative solutions to improve statistical practice. Let’s explore the key insights! #Statistics #Research #OpenScience #DataScience

tandfonline.com/doi/full/10.10… Why p-Values are Problematic:

• Misinterpretation as the probability the null hypothesis is true.
• Encourages a binary “significant” vs. “not significant” mindset.
• Over-reliance can lead to irreproducible results and publication bias.

Aug 10, 2023 • 11 tweets • 3 min read

Reporting Regression Results Beautifully Using R 📈✨

💎 Intro:
Making Beautiful Reports in RSo, you've run a regression in R and now you're staring at a wall of numbers? Let's transform that data mountain into a readable, pretty format!

#RStats #DataScience 💎 Why the Fuss About Presentation?
Data speaks, but in whispers. To make it sing, we need to dress it up, make it understandable, and shareable especially for non-stats folks.

First, let's fit a linear regression:

regression_model <- lm(mpg ~ qsec + drat + wt, data = mtcars)

Jul 31, 2023 • 6 tweets • 2 min read

🧐 Intro:
Ever wondered the difference between Bayesian and Frequentist reasoning? Let's dive into a chat between two friends, Frequentist (F) and Bayesian (B), as they discuss their views.

#DataScience #Statistics

1/ 🧐 Topic: Probability

F: "I view probabilities as long-run frequencies. Flip a coin often enough, and I'll predict the proportion of heads in a hypothetical infinite series."

B: "For me, probabilities represent belief. Say there's a 70% chance of rain? That's your confidence… twitter.com/i/web/status/1…

Jul 30, 2023 • 9 tweets • 2 min read

📊 Diving into the world of data reduction techniques! Let's compare two popular methods: Factor Analysis (FA) and Principal Component Analysis (PCA). A thread! 🧵

#DataScience

📍PCA:

PCA is a technique to reduce the dimensionality of data. It identifies orthogonal (perpendicular) axes (principal components) in the data that maximize variance.

#DataScience

Jul 29, 2023 • 13 tweets • 2 min read

1/ 🧵 Dive deep into the differences between Logit and Probit Models! A common question in #DataScience, these two have nuances worth understanding. Let's explore.🔍

2/ Both Logit & Probit models are stalwarts in statistics when modeling binary dependent variables. But what sets them apart? The devil's in the details.📊 #DataScience

Jul 29, 2023 • 9 tweets • 2 min read

1/ 🧵 Let's dive into the age-old debate between #Statistics and #MachineLearning. While they both deal with data, their perspectives, goals, and techniques can differ. Here's a breakdown:

2/ Origin & History:
•Statistics: Has its roots in probability theory & has been around for centuries. Traditionally used in areas like economics, biology, and social sciences.
•Machine Learning: Born from computer science & AI. Rose with big data & computing advances.

Jul 28, 2023 • 8 tweets • 2 min read

Is R-square Useful or Dangerous? 📈

1/ R-square, also known as the coefficient of determination, measures the proportion of variance in the dependent variable that can be explained by the independent variables in a regression model.
#DataScience #Statistics

2/ Useful? Definitely! R^2 gives us an idea of how well our model explains the variance in the data. Higher values suggest the model explains a lot of the variation; however, it's not the sole criterion for a "good" model.

Jul 28, 2023 • 8 tweets • 2 min read

1/8 📌 Intro
Both correlation and covariance provide insights into the relationship between two variables. While they might seem similar, there are key differences to note. Let's dive in! #DataScience #Statistics

2/8 📊 Covariance
Covariance measures the directional relationship between two variables. It can be positive (both variables increase together), negative (one variable decreases as the other increases), or zero (no consistent pattern).

Jul 25, 2023 • 10 tweets • 2 min read

1/ 🤔 Ever wonder how "bootstrapping" works? I recently used it for estimating confidence intervals & someone asked me about its logic. At first, I was stumped, even though I've used it often! Here's my attempt to clarify.

#Statistics #Bootstrapping #DataScience 📈📉

2/ 🥾 What's bootstrapping? It's a resampling technique where you take many subsamples from your sample data & analyze them. The idea? The subsamples give us an insight into the variability in our sample.

Jul 21, 2023 • 8 tweets • 2 min read

🧵 Difference Between Confidence Interval & Credible Interval
1/ Intro
Both Confidence Intervals (CIs) and Credible Intervals (CrIs) provide a range for estimating an unknown parameter. But they're based on different philosophies and interpretations. #DataScience #Stats

2/ Confidence Interval (CI) 📉
•Based on frequentist statistics.
•If we were to repeat a study many times, ~95% (or another chosen level) of the CIs would contain the true parameter.
•It's about the intervals and their likelihood of capturing the true value.

Jul 21, 2023 • 10 tweets • 2 min read

1/ 🧵 Let's dive into a common statistical question: When calculating standard deviation, why do we square the differences rather than taking their absolute value? Let's break this down. 📊 #DataScience #rstats

2/ Historical Context:
To start, the idea of squaring differences has a historical basis. Sir Francis Galton, a cousin of Charles Darwin, introduced it. Galton's work influenced the development of the variance (and subsequently the standard deviation).

Jul 19, 2023 • 10 tweets • 2 min read

1/10 🧵 Dive into Data Visualization with #ggplot2! 📊
Let's explore the foundation of this popular #R package and how to create stunning plots using its components. Follow along! #DataScience #Rstats

2/10 🖼️ The Canvas:
ggplot(data = your_data) creates the canvas. Every ggplot plot begins here. You're specifying the dataset you're working with. But just this alone won’t visualize anything! #RStats #DataScience

Jul 13, 2023 • 15 tweets • 3 min read

1/15 🧵 Want to level up your #R programming skills? Whether you're a beginner or an intermediate R user, this thread is for you! Follow along for valuable tips, resources, and strategies to become a more confident and skilled R programmer. 🚀 #RStats #DataScience

2/15 R is a powerful language for data manipulation, analysis and visualization. To elevate your skills, start by understanding the language at its core. This includes the syntax, data types, vectors, matrices, lists, and data frames. #RStats

May 7, 2023 • 8 tweets • 3 min read

1/ 📊📈 Let's dive into the fascinating world of #statistics and explore two key concepts: Odds Ratio and Relative Risk! Understanding the differences and applications of these two measures is crucial for interpreting study results and making informed decisions. #DataScience

2/ 🎲 Odds Ratio (OR): The Odds Ratio is a measure of association between an exposure and an outcome. It represents the odds of an event occurring in one group compared to the odds in another group. OR is particularly useful in case-control studies. #DataScience

Share this page!

Enter URL or ID to Unroll