What do we mean the data is skewed?

Blog post description.

2/22/20261 min read

When we say data is skewed, we mean that the distribution of values is not symmetric—it has a long tail on one side. In other words, most of the data is concentrated on one end, with a few extreme values stretching out in one direction.

Intuition

If you plotted the data as a histogram:

A symmetric distribution (like a bell curve) looks balanced.
A skewed distribution looks lopsided.

Types of skewness

Right‑skewed (positive skew)
- Long tail on the right (toward larger values)
- Most values are small, a few are very large
- Example: income, response times, file sizes
- Typical relationship:
  mean > median > mode
Left‑skewed (negative skew)
- Long tail on the left (toward smaller values)
- Most values are large, a few are very small
- Example: exam scores where most people score high
- Typical relationship:
  mean < median < mode

Simple example

Suppose response times (ms):

50, 52, 55, 56, 57, 58, 500

Most values are around ~55 ms, but one slow request (500 ms) stretches the distribution to the right → right‑skewed data.

Why skewed data matters

Averages can be misleading
The mean is pulled toward the tail. In skewed data, the median often better represents a “typical” value.
Model assumptions
Many statistical methods (e.g., linear regression residuals, parametric tests) assume roughly normal or symmetric distributions.
System behavior & performance
In systems/data engineering, skew often means hotspots—a small number of keys or partitions dominate load, which can hurt scalability and latency (this usage of “skew” is common in distributed systems as well).

Common ways to handle skew

Use median, percentiles (p90, p99) instead of mean
Apply transformations (log, square root) for modeling
Bucket or cap outliers (winsorization)
In distributed systems: repartitioning, salting keys, load‑aware strategies

Portfolio

Hire Amit Tamse for your work!

Contact

Connect

Reach me at : amitktamse@gmail.com

+1 408 582 4075