Ch 04. 기술통계 시각화 (히스토그램, 박스플롯) — 실전 R 통계 분석

기술통계 시각화 (히스토그램, 박스플롯)

앞 챕터에서 평균, 중앙값, 표준편차를 숫자로 확인했습니다. 이번에는 같은 정보를 그래프로 봅니다. 사람의 뇌는 숫자보다 시각을 더 빨리 처리합니다. 분포의 모양, 이상값의 위치, 두 집단의 차이는 그래프 한 장으로 훨씬 직관적으로 드러납니다.

ggplot2 기본 구조

이 책의 모든 시각화는 ggplot2를 씁니다.

library(ggplot2)

# ggplot2의 기본 구조
ggplot(data = 데이터프레임, aes(x = x열, y = y열)) +
  geom_레이어유형() +
  추가설정()

aes() 안에 어느 열을 x축, y축, 색상 등에 매핑할지 지정합니다. geom_* 함수로 어떤 종류의 그래프를 그릴지 결정합니다.

히스토그램 — 분포의 모양 보기

히스토그램은 연속형 변수를 구간(bin)으로 나누고 각 구간에 속한 데이터 수를 막대로 표시합니다.

library(tidyverse)

# 기본 히스토그램
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram()

# bin 조정 — binwidth로 각 막대의 너비 지정
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  labs(
    title = "붓꽃 꽃받침 길이 분포",
    x     = "꽃받침 길이 (cm)",
    y     = "빈도"
  ) +
  theme_minimal()

bins(막대 개수)와 binwidth(막대 너비) 중 하나만 지정합니다. 기본값은 bins=30이지만, 데이터에 맞게 조정하는 것이 좋습니다.

# 종별로 색상 구분 — fill에 범주형 변수를 매핑합니다
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(binwidth = 0.2, alpha = 0.6, position = "identity") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 분포",
    x     = "꽃받침 길이 (cm)",
    y     = "빈도",
    fill  = "종"
  ) +
  theme_minimal()

position = "identity"로 막대를 겹쳐 그리고, alpha로 투명도를 주면 세 종의 분포를 한눈에 비교할 수 있습니다.

밀도 곡선 — 히스토그램의 부드러운 버전

# 밀도 곡선
ggplot(iris, aes(x = Sepal.Length)) +
  geom_density(fill = "steelblue", alpha = 0.4) +
  labs(
    title = "꽃받침 길이 밀도 분포",
    x     = "꽃받침 길이 (cm)",
    y     = "밀도"
  ) +
  theme_minimal()

# 히스토그램 + 밀도 곡선 — y축을 density로 통일해야 합니다
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 0.2, fill = "steelblue", color = "white", alpha = 0.6) +
  geom_density(color = "darkblue", linewidth = 1) +
  labs(
    title = "꽃받침 길이 분포 (히스토그램 + 밀도 곡선)",
    x     = "꽃받침 길이 (cm)",
    y     = "밀도"
  ) +
  theme_minimal()

# 종별 밀도 곡선 비교
ggplot(iris, aes(x = Sepal.Length, fill = Species, color = Species)) +
  geom_density(alpha = 0.3) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 밀도 분포",
    x     = "꽃받침 길이 (cm)",
    y     = "밀도"
  ) +
  theme_minimal()

밀도 곡선의 모양으로 데이터가 정규분포에 가까운지, 왜도가 있는지, 여러 봉우리가 있는지 파악할 수 있습니다.

박스플롯 — 사분위수를 한 눈에

박스플롯은 Q1, 중앙값, Q3, 수염(whisker), 이상값을 한 그래프에 보여줍니다.

# 박스플롯의 각 요소
# ─ 상단 수염: Q3 + 1.5 × IQR 이하의 최댓값
# ┐ Q3 (75%)
# │ (박스 내부)
# ─ 중앙값 (50%)
# │ (박스 내부)
# ┘ Q1 (25%)
# ─ 하단 수염: Q1 - 1.5 × IQR 이상의 최솟값
# ∙ 수염 밖의 점: 이상값

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  labs(
    title = "종별 꽃받침 길이 박스플롯",
    x     = "종",
    y     = "꽃받침 길이 (cm)"
  ) +
  theme_minimal()

# 색상 추가
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 2) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 박스플롯",
    x     = "종",
    y     = "꽃받침 길이 (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")   # 범례 제거 (x축에 이미 있으므로)

박스플롯에 데이터 포인트 추가

박스플롯만으로는 데이터가 얼마나 많은지, 어떻게 분포하는지 알기 어렵습니다. geom_jitter()로 개별 데이터 포인트를 겹쳐 그립니다.

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +   # 이상값 점 숨김 (jitter로 대체)
  geom_jitter(width = 0.2, alpha = 0.4, size = 1.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 (박스플롯 + 데이터 포인트)",
    x     = "종",
    y     = "꽃받침 길이 (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

바이올린 플롯 — 밀도 정보를 포함한 박스플롯

박스플롯과 밀도 곡선을 합친 형태입니다. 분포의 모양까지 보여줍니다.

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 바이올린 플롯",
    x     = "종",
    y     = "꽃받침 길이 (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

바이올린 플롯이 넓은 부분에 데이터가 많이 몰려 있습니다.

# facet_wrap — 범주별로 패널을 만들어 나란히 그립니다
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  facet_wrap(~ Species, ncol = 1) +
  labs(
    title = "종별 꽃받침 길이 분포",
    x     = "꽃받침 길이 (cm)",
    y     = "빈도"
  ) +
  theme_minimal()

# 여러 변수를 한 번에 — pivot_longer로 긴 형태로 변환
iris_long <- iris %>%
  pivot_longer(cols = -Species,
               names_to  = "variable",
               values_to = "value")

ggplot(iris_long, aes(x = value, fill = Species)) +
  geom_histogram(binwidth = 0.3, alpha = 0.6, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "붓꽃 변수별 분포") +
  theme_minimal()

평균 표시 추가

# 박스플롯에 평균 점 추가
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.6) +
  stat_summary(fun = mean, geom = "point",
               shape = 23, size = 3, fill = "white", color = "black") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "종별 꽃받침 길이 (다이아몬드: 평균)",
    x     = "종",
    y     = "꽃받침 길이 (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

다이아몬드 모양의 점이 평균입니다. 중앙값 선과 평균 점의 거리가 멀면 데이터가 비대칭이라는 신호입니다.

기술통계와 시각화의 연결

지금까지 배운 내용을 함께 적용해봅니다.

library(tidyverse)

# 1. 기술통계량 계산
iris %>%
  group_by(Species) %>%
  summarise(
    n      = n(),
    mean   = round(mean(Sepal.Length), 2),
    median = median(Sepal.Length),
    sd     = round(sd(Sepal.Length), 2),
    IQR    = IQR(Sepal.Length)
  )

# 2. 분포 시각화
p1 <- ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density(alpha = 0.4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "밀도 분포", x = "꽃받침 길이") +
  theme_minimal()

p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.6) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3,
               fill = "white") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "박스플롯", x = "종", y = "꽃받침 길이") +
  theme_minimal() +
  theme(legend.position = "none")

# patchwork 패키지로 두 그래프 나란히 배치
library(patchwork)
p1 + p2

숫자(기술통계량)와 그래프(시각화)는 서로 보완합니다. 평균과 표준편차만 보면 놓치는 정보를, 그래프가 잡아줍니다. 반대로 그래프만 보면 정확한 수치를 알 수 없습니다. 두 가지를 함께 쓰는 습관을 들이면 됩니다.

다음 PART에서는 확률의 언어로 데이터를 해석하는 방법을 배웁니다.