What do we mean the data is skewed?

Blog post description.

2/22/20261 min read

When we say data is skewed, we mean that the distribution of values is not symmetric—it has a long tail on one side. In other words, most of the data is concentrated on one end, with a few extreme values stretching out in one direction.

Intuition

If you plotted the data as a histogram:

  • A symmetric distribution (like a bell curve) looks balanced.

  • A skewed distribution looks lopsided.

Types of skewness

  1. Right‑skewed (positive skew)

    • Long tail on the right (toward larger values)

    • Most values are small, a few are very large

    • Example: income, response times, file sizes

    • Typical relationship:
      mean > median > mode

  2. Left‑skewed (negative skew)

    • Long tail on the left (toward smaller values)

    • Most values are large, a few are very small

    • Example: exam scores where most people score high

    • Typical relationship:
      mean < median < mode

Simple example

Suppose response times (ms):

50, 52, 55, 56, 57, 58, 500

Most values are around ~55 ms, but one slow request (500 ms) stretches the distribution to the right → right‑skewed data.

Why skewed data matters

  • Averages can be misleading
    The mean is pulled toward the tail. In skewed data, the median often better represents a “typical” value.

  • Model assumptions
    Many statistical methods (e.g., linear regression residuals, parametric tests) assume roughly normal or symmetric distributions.

  • System behavior & performance
    In systems/data engineering, skew often means hotspots—a small number of keys or partitions dominate load, which can hurt scalability and latency (this usage of “skew” is common in distributed systems as well).

Common ways to handle skew

  • Use median, percentiles (p90, p99) instead of mean

  • Apply transformations (log, square root) for modeling

  • Bucket or cap outliers (winsorization)

  • In distributed systems: repartitioning, salting keys, load‑aware strategies