What do we mean the data is skewed?
Blog post description.
2/22/20261 min read
When we say data is skewed, we mean that the distribution of values is not symmetric—it has a long tail on one side. In other words, most of the data is concentrated on one end, with a few extreme values stretching out in one direction.
Intuition
If you plotted the data as a histogram:
A symmetric distribution (like a bell curve) looks balanced.
A skewed distribution looks lopsided.
Types of skewness
Right‑skewed (positive skew)
Long tail on the right (toward larger values)
Most values are small, a few are very large
Example: income, response times, file sizes
Typical relationship:
mean > median > mode
Left‑skewed (negative skew)
Long tail on the left (toward smaller values)
Most values are large, a few are very small
Example: exam scores where most people score high
Typical relationship:
mean < median < mode
Simple example
Suppose response times (ms):
50, 52, 55, 56, 57, 58, 500
Most values are around ~55 ms, but one slow request (500 ms) stretches the distribution to the right → right‑skewed data.
Why skewed data matters
Averages can be misleading
The mean is pulled toward the tail. In skewed data, the median often better represents a “typical” value.Model assumptions
Many statistical methods (e.g., linear regression residuals, parametric tests) assume roughly normal or symmetric distributions.System behavior & performance
In systems/data engineering, skew often means hotspots—a small number of keys or partitions dominate load, which can hurt scalability and latency (this usage of “skew” is common in distributed systems as well).
Common ways to handle skew
Use median, percentiles (p90, p99) instead of mean
Apply transformations (log, square root) for modeling
Bucket or cap outliers (winsorization)
In distributed systems: repartitioning, salting keys, load‑aware strategies
Portfolio
Hire Amit Tamse for your work!
Contact
Connect
Reach me at : amitktamse@gmail.com
+1 408 582 4075
© 2024. All rights reserved.