본문 바로가기
Study/Data Analysis

문과생의 통계학 기초 개념 정리 ❶ (Statistics Fundamentals for Quant Research)

by 하이앤 2025. 6. 7.
728x90
반응형
SMALL

이것이 얼마 만에 보는 수학인가..

데이터 분석을 위해 잊었던 통계지식들을 다시금 머리에 집어넣고 있는데, 한국어로도 간만인걸 영어로도 알아야하다보니

깔끔하게 정리해두는게 좋을 것 같아 해보았습니다.

참고로!

이번 포스팅에서는 순수 기술통계와 시각적인 이해, EDA 기초에 대해서만 다루었습니다:)

반응형

1. Basic Statistical Concepts (기초 통계 개념)

1.1 Key Terms (주요 용어)

Variable (변량)

  • Definition: Numerical values or data points in a dataset
  • Korean: 자료의 수치, 데이터의 값
  • Example: Heights of 100 people randomly selected

Frequency (도수)

  • Definition: Number of observations that fall within each class interval
  • Korean: 각 계급에 속하는 변량의 개수

Relative Frequency (상대도수)

  • Definition: Proportion of observations in each class interval
  • Korean: 각 계급에 속하는 변량의 비율
  • Formula: Relative Frequency = Frequency / Total Number of Observations

Class Interval (계급)

  • Definition: A range of values that divides the data into groups
  • Korean: 변량을 일정한 간격으로 나눈 구간
  • Note: Consider minimum and maximum values when setting intervals

1.2 Types of Data (자료의 종류)

Categorical Data (범주형 자료)

  • Nominal Data (명목형): Categories with no order (성별, 혈액형)
  • Ordinal Data (순서형): Categories with meaningful order (학점, 만족도)

Quantitative Data (양적 자료)

  • Discrete Data (이산형): Countable values (자녀 수, 판매량)
  • Continuous Data (연속형): Measurable values (키, 체중, 시간)
  • Interval Data (구간형): Equal intervals, no true zero (온도)
  • Ratio Data (비율형): Equal intervals with true zero (소득, 나이)

2. Measures of Central Tendency (중심경향성)

Central Tendency: The tendency of data to cluster around a central value
Korean: 데이터 분포의 중심을 보여주는 값

이미지 출처: "Understanding Measures of Central Tendency", Nitesh (@nitesh.py), Medium

2.1 Mode (최빈값)

  • Definition: Most frequently occurring value in a dataset
  • Korean: 가장 빈번하게 나타나는 값
  • Best for: Categorical data (범주형 자료)
  • Advantage: Can be used for any type of data
  • Limitation: May not exist or may have multiple modes

2.2 Median (중앙값)

  • Definition: Middle value when data is arranged in ascending order
  • Korean: 자료를 크기 순으로 나열했을 때 가운데 위치하는 값
  • Calculation:
    • Odd n: Middle value
    • Even n: Average of two middle values
  • Advantage: Robust to outliers (이상치에 크게 영향받지 않음)
  • Best for: Ordinal data and skewed distributions (순서형 자료, 치우친 분포)

2.3 Mean (평균)

  • Definition: Sum of all values divided by number of observations
  • Korean: 자료의 값을 모두 더해서 자료의 수로 나눈 값
  • Formula: x̄ = Σxi / n
  • Best for: Continuous data with normal distribution (연속형 자료, 정규분포)
  • Disadvantage: Sensitive to outliers (이상치에 영향을 크게 받음)

2.4 Other Types of Means

Weighted Mean (가중평균)

  • Definition: Average that assigns different weights to different values
  • Korean: 자료의 중요도에 따라 가중치를 부여한 평균
  • Formula: x̄w = Σ(wi × xi) / Σwi
  • Example: GPA calculation with credit hours

Geometric Mean (기하평균)

  • Definition: Average useful for rates of growth or ratios
  • Korean: 성장률 등 이전 시점에 대한 비율의 평균
  • Formula: GM = ⁿ√(x₁ × x₂ × ... × xₙ)
  • Examples: CAGR (Compound Annual Growth Rate), investment returns

Harmonic Mean (조화평균)

  • Definition: Reciprocal of arithmetic mean of reciprocals
  • Formula: HM = n / Σ(1/xi)
  • Use: Averaging rates (속도, 비율의 평균)

3. Measures of Dispersion (퍼짐정도)

Dispersion: How spread out the data points are from the center
Korean: 자료가 얼마나 흩어져있고 얼마나 모여있는지

3.1 Range (범위)

  • Definition: Difference between maximum and minimum values
  • Korean: 최대값 - 최소값
  • Formula: Range = Max - Min
  • Advantages: Easy to calculate and interpret
  • Disadvantages:
    • No information about distribution within range
    • Heavily influenced by outliers (극단치에 매우 민감)

3.2 Interquartile Range (IQR)

  • Definition: Q3 - Q1 (3rd quartile minus 1st quartile)
  • Korean: 제3사분위수 - 제1사분위수
  • Use: Better measure for skewed distributions (치우친 분포)
  • Advantage: Resistant to outliers
  • Application: Used in box plots for outlier detection

3.3 Variance (분산)

  • Definition: Average of squared deviations from the mean
  • Korean: 편차 제곱의 평균
  • Formula: σ² = Σ(xi - μ)² / n (population)
  • Formula: s² = Σ(xi - x̄)² / (n-1) (sample)
  • Purpose: Measures variability around the mean
  • Unit: Squared units of original data

3.4 Standard Deviation (표준편차)

  • Definition: Square root of variance
  • Korean: 분산의 제곱근
  • Formula: σ = √variance
  • Advantages:
    • Same units as original data
    • Easier to interpret than variance
    • Good for standardizing data scales

4. Distribution Shape (분포의 형태)

4.1 Skewness (왜도)

  • Definition: Measure of asymmetry in distribution
  • Korean: 분포의 좌우 비대칭성 정도

Types of Skewness:

이미지 출처: "Understanding Measures of Central Tendency", Nitesh (@nitesh.py), Medium

  • Positive Skew (Right Skew):
    • Longer right tail (오른쪽 꼬리가 긴 분포)
    • Mode < Median < Mean
    • 데이터 분포: 대부분의 값이 왼쪽(낮은 값)에 몰려있고, 소수의 큰 값들이 오른쪽에 있음
    • Examples: 소득 분포, 집값 분포
  • Zero Skew:
    • Symmetric distribution (대칭 분포)
    • Mean = Median = Mode
  • Negative Skew (Left Skew):
    • Longer left tail (왼쪽 꼬리가 긴 분포)
    • Mean < Median < Mode
    • 데이터 분포: 대부분의 값이 오른쪽(높은 값)에 몰려있고, 소수의 작은 값들이 왼쪽에 있음
    • Examples: 연령별 사망률, 쉬운시험점수, 고품질 제품의 수명

왜도(Skewness) 치우침 판단 기준값

일반적인 기준

  • -0.5 ~ +0.5: 거의 대칭적 (대부분 정규분포로 간주 가능)
  • -1 ~ -0.5 또는 +0.5 ~ +1: 보통 정도의 치우침
  • -1 미만 또는 +1 초과: 심하게 치우침

엄격한 기준 (일부 연구에서 사용)

  • 절댓값 2 초과: 매우 심한 치우침
  • 절댓값 3 초과: 극심한 치우침 (정규성 가정 심각하게 위반)

가장 보편적으로 사용되는 기준은 ±0.5와 ±1입니다.

4.2 Kurtosis (첨도)

  • Definition: Measure of tail thickness and peak sharpness
  • Korean: 분포의 뾰족한 정도, 양쪽 꼬리의 두터움 정도

Types of Kurtosis:

© AnalystPrep - CFA Level 1 Quantitative Methods

  • Leptokurtic (Excess Kurtosis > 0):
    • Sharp peak, thick tails (뾰족하고 두터운 꼬리)
    • More extreme values than normal distribution
  • Platykurtic (Excess Kurtosis < 0):
    • Flat peak, thin tails (평평하고 얇은 꼬리)
    • Fewer extreme values than normal distribution
  • Mesokurtic (Excess Kurtosis = 0):
    • Normal distribution shape (정규분포와 같은 첨도)

Important Note: High kurtosis indicates higher probability of extreme events (이상치 발생 가능성이 높음)


5. Descriptive Statistics Applications (기술통계 응용)

5.1 What is Descriptive Statistics? (기술통계란?)

  • Definition: Methods to summarize and describe data characteristics
  • Korean: 데이터의 간결한 요약 정보
  • Purpose: Understand data patterns without making inferences
  • Components: Numerical summaries + Visualizations

5.2 Descriptive vs Inferential Statistics

Aspect Descriptive Statistics Inferential Statistics

Aspect  Descriptive Statistics Inferential Statistics
Purpose Describe sample data Make inferences about population
Korean 기술통계 추론통계
Scope What we observe What we can conclude
Methods Summaries, charts Hypothesis tests, confidence intervals
Examples Mean height = 170cm Population mean likely between 165-175cm

5.3 Five-Number Summary (다섯 수치 요약)

  1. Minimum (최솟값)
  2. Q1 (1st Quartile, 제1사분위수): 25th percentile
  3. Median (중앙값): 50th percentile
  4. Q3 (3rd Quartile, 제3사분위수): 75th percentile
  5. Maximum (최댓값)

Use: Foundation for box plots and outlier detection

5.4 Outlier Detection (이상치 탐지)

  • IQR Method:
    • Lower fence: Q1 - 1.5 × IQR
    • Upper fence: Q3 + 1.5 × IQR
    • Values beyond fences are potential outliers
  • Z-score Method: |z| > 2 or 3 (depending on strictness)

6. Data Visualization & EDA (데이터 시각화와 탐색적 분석)

6.1 Frequency Distribution Visualization

Frequency Distribution Table (도수분포표)

  • Purpose: Organize data by class intervals and their frequencies
  • Components: Class intervals, frequency, relative frequency, cumulative frequency
  • Advantages: Easy to understand distribution patterns
  • Disadvantages: Loss of individual data points

Histogram (히스토그램)

  • Definition: Visual representation of frequency distribution
  • Korean: 도수분포표를 시각화하는 가장 기본적인 방법
  • Features: Bars touch each other (연속된 막대)
  • Use: Shows distribution shape, central tendency, spread

6.2 Other Important Visualizations

Box Plot (상자그림)

  • Components: Box (IQR), whiskers, median line, outliers
  • Advantages: Shows five-number summary and outliers clearly
  • Use: Comparing distributions across groups

Dot Plot (점도표)

  • Use: Small datasets, shows exact values
  • Advantage: No loss of information

Stem-and-Leaf Plot (줄기잎그림)

  • Advantage: Retains original data values
  • Use: Quick visualization for small to medium datasets

6.3 Exploratory Data Analysis (EDA)

  • Definition: Process of analyzing data to summarize main characteristics
  • Korean: 탐색적 자료 분석
  • Goals:
    1. Understand data structure
    2. Detect outliers and errors
    3. Find patterns and relationships
    4. Test assumptions
    5. Suggest hypotheses

EDA Steps:

  1. Data Quality Check: Missing values, duplicates, errors
  2. Univariate Analysis: Individual variable distributions
  3. Bivariate Analysis: Relationships between pairs of variables
  4. Multivariate Analysis: Relationships among multiple variables

Key Formulas Summary (주요 공식 정리)

Statistic Population Formula Sample Formula Purpose
Mean μ = Σxi / N x̄ = Σxi / n Central tendency
Variance σ² = Σ(xi - μ)² / N s² = Σ(xi - x̄)² / (n-1) Dispersion
Std Dev σ = √σ² s = √s² Dispersion (same units)
Range Max - Min Max - Min Simple spread measure
IQR Q3 - Q1 Q3 - Q1 Robust spread measure

Practical Guidelines for Data Analysis

When to Use Which Measure?

Central Tendency:

  • Normal Distribution: Use mean
  • Skewed Distribution: Use median
  • Categorical Data: Use mode
  • Multiple Modes: Report all modes

Dispersion:

  • Normal Distribution: Use standard deviation
  • Skewed Distribution: Use IQR
  • Quick Overview: Use range (but report its limitations)

Distribution Shape:

  • Always check skewness and kurtosis before choosing analysis methods
  • Visualize first: Use histograms and box plots
  • Consider transformation for highly skewed data

Data Quality Checklist:

  1. ✅ Check for missing values
  2. ✅ Identify and handle outliers
  3. ✅ Verify data types are correct
  4. ✅ Look for impossible values
  5. ✅ Check for duplicate records
  6. ✅ Validate ranges and constraints

 

728x90
반응형
LIST