Selçuk Korkmaz, PhD Profile picture
Apr 24 17 tweets 8 min read Twitter logo Read on Twitter
🧵1/7 Understanding the difference between test set and validation set is crucial for building accurate and robust machine learning models. In this thread, we'll discuss the key differences between these two sets and their importance in model development. #MachineLearning #RStats https://www.brainstobytes.c...
🧵2/7 Validation set: It is used during model development to tune hyperparameters and make decisions about the model architecture. It helps evaluate the model's performance and prevents overfitting by providing an unbiased estimate of how well the model generalizes to new data.
🧵3/7 Test set: This is a separate dataset not used during model training or validation. It's only used after the model has been finalized to assess its performance on completely unseen data. This provides an unbiased evaluation of the final model. #RStats #DataScience
🧵4/7 Differences:
🔹Purpose: Validate set fine-tunes model, test set for final assessment.
🔹Usage: Validate set in model dev stage, test set used post-finalization.
🔹Impact: Validate set directly affects model, test set doesn't impact. #RStats #DataScience
🧵5/7 It's important to maintain the independence between these sets. Repeatedly using the test set to make model adjustments can lead to overfitting, as the model will start to perform well on the test set specifically, but may not generalize well to new data. #RStats
🧵6/7 In practice, data is often split into three parts: training, validation, and test sets. The training set is for model learning, the validation set for hyperparameter tuning, and the test set for final evaluation. This ensures unbiased model assessment. #DataScience #RStats
🧵7/7 To sum up, understanding the distinction between test and validation sets is crucial for building effective and generalizable machine learning models. They serve different purposes in the model development process, and maintaining their independence is key. #DataScience🤖🎓
Let's demonstrate the process of splitting the data into training, validation, and test sets using R. We'll use the iris dataset, which is built into R, for this example.

1. First, let's load the necessary libraries and the iris dataset:

library(dplyr)
data(iris)

#RStats
2. Now, let's shuffle the dataset and split it into training, validation, and test sets (60%, 20%, and 20% respectively):

set.seed(42) # Set a seed for reproducibility
iris_shuffled <- iris %>% sample_frac(1)

#RStats
# Define the sizes of the data splits
train_size <- 0.6 * nrow(iris_shuffled)
validation_size <- 0.2 * nrow(iris_shuffled)

#RStats #DataScience
# Split the data
train_data <- iris_shuffled[1:train_size,]
validation_data <- iris_shuffled[(train_size+1):(train_size+validation_size),]
test_data <- iris_shuffled[(train_size+validation_size+1):nrow(iris_shuffled),]

#RStats #DataScience
3. Now you can train a model using the training set and tune its hyperparameters using the validation set. For instance, let's train a simple k-Nearest Neighbors (kNN) classifier using the class library:
library(class)

k_values <- c(1, 3, 5, 7, 9) # Define a range of k values for tuning
best_k <- k_values[1]
best_accuracy <- 0

#RStats #DataScience
# Loop through the k values and evaluate the model on the validation set
for (k in k_values) {
predicted_labels <- knn(train_data[, -5], validation_data[, -5], train_data[, 5], k = k)
accuracy <- sum(predicted_labels == validation_data[, 5]) / nrow(validation_data)

#RStats
cat("k =", k, ", accuracy =", accuracy, "\n")
if (accuracy > best_accuracy) {
best_accuracy <- accuracy
best_k <- k
}}
cat("Best k =", best_k, ", Best accuracy =", best_accuracy, "\n")

#RStats #DataScience
4. Finally, evaluate the model on the test set:
test_predicted_labels <- knn(train_data[, -5], test_data[, -5], train_data[, 5], k = best_k)
test_accuracy <- sum(test_predicted_labels == test_data[, 5]) / nrow(test_data)
cat("Test accuracy =", test_accuracy, "\n")

#RStats
In this example, we demonstrated how to split the data into training, validation, and test sets using R, and how to train a kNN model using the training set, tune its hyperparameters using the validation set, and evaluate its performance on the test set. #RStats #DataScience

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Selçuk Korkmaz, PhD

Selçuk Korkmaz, PhD Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @selcukorkmaz

Apr 25
🧵 1/10 🧵
🎯 Demystifying the #Apply Functions Family in #R 🎯

Are you an #Rstats enthusiast? Let's dive into the powerful 'apply' family of functions to help you manipulate and analyze data efficiently! 👩‍💻👨‍💻

#DataScience #RStats Source: https://r-coder.com...
🧵 2/10
Meet the Family! 🏡

There are six main functions in the 'apply' family:

1️⃣ apply()
2️⃣ lapply()
3️⃣ sapply()
4️⃣ vapply()
5️⃣ mapply()
6️⃣ tapply()

Each has its own use case and is designed to work with different data structures. Let's explore them! 🕵️‍♂️🔍

#RStats
🧵 3/10 🧵
1️⃣ apply()

Use apply() for applying a function across the rows or columns of a matrix or array.

Syntax: apply(X, MARGIN, FUN, ...)

X: array or matrix
MARGIN: 1 for rows, 2 for columns
FUN: function to apply
... : additional arguments

#RStats #DataScience
Read 10 tweets
Apr 24
1/ 🎯 Introduction 📌
The #caret package in #R is a powerful tool for data pre-processing, feature selection, and machine learning model training. In this thread, we'll explore some useful tips & tricks to help you get the most out of caret. #DataScience #MachineLearning #RStats Image
2/ 🧹 Data Pre-processing 📌
caret offers various data pre-processing techniques, like centering, scaling, and removing near-zero-variance predictors. Use the preProcess() function to apply these methods before model training.🧪 #RStats #DataScience
3/ ⚙️ Feature Selection 📌
Use the rfe() function for recursive feature elimination. This method helps you find the most important features in your dataset, improving model performance & interpretation.🌟 #RStats #DataScience
Read 8 tweets
Apr 24
1/🧶📝 Welcome to a Twitter thread discussing the pros & cons of the #R packages, #knitr and #sweave. These packages allow us to create dynamic, reproducible documents that integrate text, code, and results. Let's dive into the strengths and weaknesses of each. #Rstats
2/🔍 #knitr is a more recent and widely-used package that simplifies the creation of dynamic reports. It's an evolution of #sweave and supports various output formats, including PDF, HTML, and Word. Plus, it's compatible with Markdown and LaTeX! #Rstats
3/🌟 Pros of #knitr:
✅ Better syntax highlighting
✅ Cache system to speed up compilation
✅ Inline code chunks
✅ Flexible output hooks
✅ More output formats
✅ Integrates with other languages
Overall, it provides more control and customization in document creation. #RStats
Read 9 tweets
Apr 24
🧵1/9 Let's talk about methods for identifying the optimal number of clusters in cluster analysis!

Cluster analysis is a technique used to group data points based on their similarity. Here are some popular methods & R packages. #RStats #DataScience Source: https://www.geeksfo...
🔍2/9 Elbow Method: The Elbow Method involves plotting the explained variation (inertia) as a function of the number of clusters. The "elbow point" on the curve represents the optimal number of clusters. R package: 'factoextra' #RStats #DataScience cran.r-project.org/web/packages/f…
📈3/9 Silhouette Score: This method evaluates the quality of clustering by calculating the average silhouette score of each data point. Higher silhouette scores indicate better cluster assignments. Optimal clusters have the highest average silhouette score.cran.r-project.org/web/packages/c…
Read 9 tweets
Apr 23
[1/9] 🎲 Let's talk about the difference between probability and likelihood in #statistics. These two terms are often confused, but understanding their distinction is key for making sense of data analysis! #Rstats #DataScience Image
[2/9]💡Probability is a measure of how likely a specific outcome is in a random process. It quantifies the degree of certainty we have about the occurrence of an event. It ranges from 0 (impossible) to 1 (certain). The sum of probabilities for all possible outcomes is always 1.
[3/9] 📊 Likelihood, on the other hand, is a measure of how probable a particular set of observed data is, given a specific set of parameters for a statistical model. Likelihood is not a probability, but it shares the same mathematical properties (i.e., it's always non-negative).
Read 10 tweets
Apr 23
1/🧵🔍 Making sense of Principal Component Analysis (PCA), Eigenvectors & Eigenvalues: A simple guide to understanding PCA and its implementation in R! Follow this thread to learn more! #RStats #DataScience #PCA Source: https://towardsdata...
2/📚PCA is a dimensionality reduction technique that helps us to find patterns in high-dimensional data by projecting it onto a lower-dimensional space. It's often used for data visualization, noise filtering, & finding variables that explain the most variance. #DataScience
3/🎯 The goal of PCA is to identify linear combinations of original variables (principal components) that capture the maximum variance in the data, with each principal component being orthogonal to the others. #RStats #DataScience
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(