SECOM Semiconductor Manufacturing Data - Exploratory Data Analysis

Dataset Background

The SECOM dataset originates from a semiconductor fabrication facility where products undergo hundreds of sensor measurements during production. Each sample represents one production entity (e.g., a wafer or chip), and the binary label indicates whether the product passed or failed quality control.

Attribute Value
Samples 1,567 production entities
Features 590 sensor measurements
Labels Pass (-1) / Fail (1)

Key Questions This Analysis Addresses

  1. How much missing data exists, and which features are most affected?
  2. What is the class distribution (pass vs. fail)?
  3. Which features show the strongest relationship with the outcome?
  4. Are there redundant (highly correlated) features we can remove?
  5. What preprocessing steps are needed before modeling?

1. Environment Setup and Data Loading

We begin by importing the necessary Python libraries and loading the dataset files. The SECOM data consists of two files: - secom.data: Contains 590 sensor measurements per sample (space-separated, NaN for missing) - secom_labels.data: Contains the pass/fail label and timestamp for each sample

# Load libraries
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Suppress warnings
warnings.filterwarnings("ignore")

# Set visualization style for consistent, publication-quality plots
plt.style.use("seaborn-v0_8-whitegrid")


print("Libraries loaded successfully")
Libraries loaded successfully

DATA LOADING

  1. secom.data - Raw sensor measurements (590 features)
  2. secom_labels.data - Pass/Fail labels with timestamps
  3. Labels use -1 for Pass and 1 for Fail
  4. 1 label column (pass/fail)
  5. 1 timestamp column
  6. Total: 592 columns
# DEFINE THE PATH TO THE DATA DIRECTORY
data_dir = Path("../data/secom")

# Load the sensor measurements (features)
df = pd.read_csv(
    data_dir / "secom.data",
    sep=" ",
    header=None,
    na_values="NaN",
)

# Assign meaningful column names (sensor_0, sensor_1, ..., sensor_589)
df.columns = [f"sensor_{i}" for i in range(df.shape[1])]

# Load the labels file containing pass/fail status and timestamps
labels = pd.read_csv(
    data_dir / "secom_labels.data",
    sep=" ",
    header=None,
    names=["label", "timestamp"],
)

# Merge labels into the main dataframe
df["label"] = labels["label"]

# Convert timestamp strings to proper datetime objects
# The strip('"') removes surrounding quotes from the timestamp strings
df["timestamp"] = pd.to_datetime(labels["timestamp"].str.strip('"'))

# Display dataset dimensions
print("Dataset Successfully Loaded")
print("=" * 40)
print(f"Total columns:             {df.shape[1]:,}")
print(f"  - Sensor features:       {df.shape[1] - 2}")
print("  - Additional columns:    label, timestamp")
print(f"Total rows (samples):      {df.shape[0]:,}")
Dataset Successfully Loaded
========================================
Total columns:             592
  - Sensor features:       590
  - Additional columns:    label, timestamp
Total rows (samples):      1,567

2. Data Overview

Before diving into detailed analysis, let’s get a high-level view of the data structure, including data types, memory usage, and the time period over which the data was collected.

df.head()
sensor_0 sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 sensor_6 sensor_7 sensor_8 sensor_9 ... sensor_582 sensor_583 sensor_584 sensor_585 sensor_586 sensor_587 sensor_588 sensor_589 label timestamp
0 3030.93 2564.00 2187.7333 1411.1265 1.3602 100.0 97.6133 0.1242 1.5005 0.0162 ... 0.5005 0.0118 0.0035 2.3630 NaN NaN NaN NaN -1 2008-07-19 11:55:00
1 3095.78 2465.14 2230.4222 1463.6606 0.8294 100.0 102.3433 0.1247 1.4966 -0.0005 ... 0.5019 0.0223 0.0055 4.4447 0.0096 0.0201 0.0060 208.2045 -1 2008-07-19 12:32:00
2 2932.61 2559.94 2186.4111 1698.0172 1.5102 100.0 95.4878 0.1241 1.4436 0.0041 ... 0.4958 0.0157 0.0039 3.1745 0.0584 0.0484 0.0148 82.8602 1 2008-07-19 13:17:00
3 2988.72 2479.90 2199.0333 909.7926 1.3204 100.0 104.2367 0.1217 1.4882 -0.0124 ... 0.4990 0.0103 0.0025 2.0544 0.0202 0.0149 0.0044 73.8432 -1 2008-07-19 14:43:00
4 3032.24 2502.87 2233.3667 1326.5200 1.5334 100.0 100.3967 0.1235 1.5031 -0.0031 ... 0.4800 0.4766 0.1045 99.3032 0.0202 0.0149 0.0044 73.8432 -1 2008-07-19 15:22:00

5 rows × 592 columns

print("Data Types Summary")
print(df.dtypes.value_counts())
Data Types Summary
float64           590
int64               1
datetime64[ns]      1
Name: count, dtype: int64

DATA COLLECTION TIME PERIOD

Understanding the time range helps us: - Identify if data represents seasonal/temporal patterns - Check for time-based drift in manufacturing process - Adjust rain/test splits if needed

print("Data Collection Period")
print(f"Start date: {df['timestamp'].min()}")
print(f"End date:   {df['timestamp'].max()}")
print(f"Duration:   {(df['timestamp'].max() - df['timestamp'].min()).days} days")
Data Collection Period
Start date: 2008-07-19 11:55:00
End date:   2008-10-17 06:07:00
Duration:   89 days

3. Missing Value Analysis

Missing values are common in manufacturing sensor data due to: - Sensor malfunctions or calibration issues - Data transmission errors - Certain sensors not being applicable for all product types

# Calculate both absolute counts and percentages to understand:
# Which features have the most missing data
# How severe the missing data problem is overall

feature_cols = [col for col in df.columns if col.startswith("sensor_")]

# Count missing values for each feature
missing_counts = df[feature_cols].isnull().sum()

# Calculate missing percentage
missing_pct = (missing_counts / len(df)) * 100

# Summary dataframe sorted by missing percentage
missing_df = pd.DataFrame({
    "missing_count": missing_counts,
    "missing_pct": missing_pct,
}).sort_values("missing_pct", ascending=False)

# Print summary statistics
print("Missing Value Summary")
print(f"Total sensor features:           {len(feature_cols)}")
print(f"Features with ANY missing:       {(missing_counts > 0).sum()}")
print(f"Features with >50% missing:      {(missing_pct > 50).sum()}")
print(f"Features 100% missing (useless): {(missing_pct == 100).sum()}")
Missing Value Summary
Total sensor features:           590
Features with ANY missing:       538
Features with >50% missing:      28
Features 100% missing (useless): 0

VISUALIZE MISSING VALUE DISTRIBUTION

Two plots help us understand the missing data pattern: 1. Histogram: Shows the overall distribution of missingness 2. Bar chart: Highlights the worst offending features

# Create two plots to show missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# LEFT PLOT: Histogram of missing percentages across all features
# This shows how many features fall into each "missingness" bucket
axes[0].hist(missing_pct, bins=50, edgecolor="black", alpha=0.7, color="#3498db")
axes[0].set_xlabel("Percentage of Missing Values", fontsize=11)
axes[0].set_ylabel("Number of Features", fontsize=11)
axes[0].set_title("Distribution of Missing Values Across Features", fontsize=12)
axes[0].axvline(x=50, color="red", linestyle="--", linewidth=2, label="50% threshold")
axes[0].legend()

# RIGHT PLOT: Top 20 features with highest missing percentages
top_missing = missing_df.head(20)
axes[1].barh(range(len(top_missing)), top_missing["missing_pct"], color="#e74c3c")
axes[1].set_yticks(range(len(top_missing)))
axes[1].set_yticklabels(top_missing.index, fontsize=9)
axes[1].set_xlabel("Percentage of Missing Values", fontsize=11)
axes[1].set_title("Top 20 Features with Most Missing Values", fontsize=12)
axes[1].invert_yaxis()  # Highest at top

plt.tight_layout()
plt.show()

  • Insight: Features with >50% missing data likely provide little value and should be candidates for removal during preprocessing.*

Why remove >50% missing data candidadtes? Reduce noise. Improve model performance. When more than 50% of the data is missing, it is better to remove the feature because it is not useful for the model.

In semiconductor manufacturing, a sensor with >50% missing data likely indicates: - Sensor malfunction - Sensor only applies to certain product types - Data transmission failures

ANALYZE MISSING VALUES PER SAMPLE

It’s also important to check if certain SAMPLES have excessive missing data, which might indicate data quality issues for those specific production runs.

Runs = Samples

Columns = Features

# Count missing features for each sample
sample_missing = df[feature_cols].isnull().sum(axis=1)
sample_missing_pct = (sample_missing / len(feature_cols)) * 100

print("Missing Values Per Sample")
print(f"Average features missing per sample: {sample_missing.mean():.1f} ({sample_missing_pct.mean():.1f}%)")
print(f"Minimum missing per sample:          {sample_missing.min()}")
print(f"Maximum missing per sample:          {sample_missing.max()}")
Missing Values Per Sample
Average features missing per sample: 26.8 (4.5%)
Minimum missing per sample:          4
Maximum missing per sample:          152

Insight: Each sample has some missing values, which is typicalor sensor data. We’ll need to impute these before modeling.


4. Descriptive Statistics

Summary statistics for each feature helps identify: - Constant features: Zero variance (useless for prediction) - Scale differences: Features with vastly different ranges - Outliers: Extreme values that may need special handling

# Get pandas describe output and transpose for easier viewing
stats = df[feature_cols].describe().T

# Add missing percentage for reference
stats["missing_pct"] = missing_pct

# Add range (max - min)
stats["range"] = stats["max"] - stats["min"]

# Add coefficient of variation (measures relative variability)
# CV = std / |mean| - useful for comparing variability across different scales
stats["cv"] = stats["std"] / stats["mean"].abs()

print("Sample of Descriptive Statistics (first 10 features):")
stats.head(25)
Sample of Descriptive Statistics (first 10 features):
count mean std min 25% 50% 75% max missing_pct range cv
sensor_0 1561.0 3014.452896 73.621787 2743.2400 2966.260000 3011.49000 3056.650000 3356.3500 0.382897 613.1100 0.024423
sensor_1 1560.0 2495.850231 80.407705 2158.7500 2452.247500 2499.40500 2538.822500 2846.4400 0.446713 687.6900 0.032217
sensor_2 1553.0 2200.547318 29.513152 2060.6600 2181.044400 2201.06670 2218.055500 2315.2667 0.893427 254.6067 0.013412
sensor_3 1553.0 1396.376627 441.691640 0.0000 1081.875800 1285.21440 1591.223500 3715.0417 0.893427 3715.0417 0.316313
sensor_4 1553.0 4.197013 56.355540 0.6815 1.017700 1.31680 1.525700 1114.5366 0.893427 1113.8551 13.427535
sensor_5 1553.0 100.000000 0.000000 100.0000 100.000000 100.00000 100.000000 100.0000 0.893427 0.0000 0.000000
sensor_6 1553.0 101.112908 6.237214 82.1311 97.920000 101.51220 104.586700 129.2522 0.893427 47.1211 0.061686
sensor_7 1558.0 0.121822 0.008961 0.0000 0.121100 0.12240 0.123800 0.1286 0.574346 0.1286 0.073561
sensor_8 1565.0 1.462862 0.073897 1.1910 1.411200 1.46160 1.516900 1.6564 0.127632 0.4654 0.050515
sensor_9 1565.0 -0.000841 0.015116 -0.0534 -0.010800 -0.00130 0.008400 0.0749 0.127632 0.1283 17.973588
sensor_10 1565.0 0.000146 0.009302 -0.0349 -0.005600 0.00040 0.005900 0.0530 0.127632 0.0879 63.821983
sensor_11 1565.0 0.964353 0.012452 0.6554 0.958100 0.96580 0.971300 0.9848 0.127632 0.3294 0.012912
sensor_12 1565.0 199.956809 3.257276 182.0940 198.130700 199.53560 202.007100 272.0451 0.127632 89.9511 0.016290
sensor_13 1564.0 0.000000 0.000000 0.0000 0.000000 0.00000 0.000000 0.0000 0.191449 0.0000 NaN
sensor_14 1564.0 9.005371 2.796596 2.2493 7.094875 8.96700 10.861875 19.5465 0.191449 17.2972 0.310548
sensor_15 1564.0 413.086035 17.221095 333.4486 406.127400 412.21910 419.089275 824.9271 0.191449 491.4785 0.041689
sensor_16 1564.0 9.907603 2.403867 4.4696 9.567625 9.85175 10.128175 102.8677 0.191449 98.3981 0.242628
sensor_17 1564.0 0.971444 0.012062 0.5794 0.968200 0.97260 0.976800 0.9848 0.191449 0.4054 0.012417
sensor_18 1564.0 190.047354 2.781041 169.1774 188.299825 189.66420 192.189375 215.5977 0.191449 46.4203 0.014633
sensor_19 1557.0 12.481034 0.217965 9.8773 12.460000 12.49960 12.547100 12.9898 0.638162 3.1125 0.017464
sensor_20 1567.0 1.405054 0.016737 1.1797 1.396500 1.40600 1.415000 1.4534 0.000000 0.2737 0.011912
sensor_21 1565.0 -5618.393610 626.822178 -7150.2500 -5933.250000 -5523.25000 -5356.250000 0.0000 0.127632 7150.2500 0.111566
sensor_22 1565.0 2699.378435 295.498535 0.0000 2578.000000 2664.00000 2841.750000 3656.2500 0.127632 3656.2500 0.109469
sensor_23 1565.0 -3806.299734 1380.162148 -9986.7500 -4371.750000 -3820.75000 -3352.750000 2363.0000 0.127632 12349.7500 0.362599
sensor_24 1565.0 -298.598136 2902.690117 -14804.5000 -1476.000000 -78.75000 1377.250000 14106.0000 0.127632 28910.5000 9.721059

IDENTIFY PROBLEMATIC FEATURES

Constant features (std=0) carry no information and should be removed. Very low variance features are also suspicious.

# Find features with exactly zero variance (all values the same)
constant_features = stats[stats["std"] == 0].index.tolist()

# Find features with very low variance (std < 0.01)
low_var_features = stats[stats["std"] < 0.01].index.tolist()

print("Problematic Features")
print(f"Constant features (zero variance):  {len(constant_features)}")
print(f"Near-zero variance (std < 0.01):    {len(low_var_features)}")
Problematic Features
Constant features (zero variance):  116
Near-zero variance (std < 0.01):    171

We should consider removing constant features because they do not provide any information to the model.

Essentially if STD = 0 then the feature is constant and does not provide any information to the model because the feature is identical.

Variance near 0 is also a sign of constant features:

High variance → Feature changes → Could explain differences in outcome

Zero variance → Feature never changes → Cannot explain anything


5. Class Imbalance Analysis

In manufacturing quality control, defect rates are typically low (most products pass). This creates a class imbalance problem where the minority class (failures) is underrepresented. If not addressed, machine learning models may: - Predict “pass” for everything and achieve high accuracy - Fail to detect actual defects (which is the whole point!)

-1 = Pass (product passed quality control) 1 = Fail (product failed quality control - DEFECT)

# Count samples in each class
label_counts = df["label"].value_counts()

# Calculate percentages
label_pct = df["label"].value_counts(normalize=True) * 100

print("Class Distribution")
print(f"Pass (-1): {label_counts[-1]:,} samples ({label_pct[-1]:.1f}%)")
print(f"Fail (1):  {label_counts[1]:,} samples ({label_pct[1]:.1f}%)")
print(f"Imbalance ratio: {label_counts[-1] / label_counts[1]:.1f}:1")
Class Distribution
Pass (-1): 1,463 samples (93.4%)
Fail (1):  104 samples (6.6%)
Imbalance ratio: 14.1:1

Class Imbalance Visualization

Visual representation helps understand the severity of the imbalance problem.

# Two Plots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Define colors: green for pass, red for fail
colors = ["#2ecc71", "#e74c3c"]
labels_text = ["Pass (-1)", "Fail (1)"]

# LEFT: Bar chart showing absolute counts
axes[0].bar(labels_text, [label_counts[-1], label_counts[1]], color=colors, edgecolor="black")
axes[0].set_ylabel("Number of Samples", fontsize=11)
axes[0].set_title("Class Distribution (Counts)", fontsize=12)
# Add count labels on top of bars
for i, v in enumerate([label_counts[-1], label_counts[1]]):
    axes[0].text(i, v + 20, str(v), ha="center", fontweight="bold")

# RIGHT: Pie chart showing proportions
axes[1].pie(
    [label_counts[-1], label_counts[1]],
    labels=labels_text,
    autopct="%1.1f%%",
    colors=colors,
    explode=(0, 0.1),  # Emphasize the minority class
    startangle=90,
    textprops={"fontsize": 11},
)
axes[1].set_title("Class Proportion", fontsize=12)

plt.tight_layout()
plt.show()


6. Feature Distribution Analysis

Comparing feature distributions between pass and fail samples helps identify: - Discriminative features: Features that differ between classes - Distribution shapes: Normal, skewed, bimodal, etc. - Overlap: How separable the classes are in feature space

We focus on features with low missing values (<5%) to get reliable distribution estimates.

# Get features with less than 5% missing data
low_missing_features = missing_df[missing_df["missing_pct"] < 5].index.tolist()[:20]

print(f"Analyzing {len(low_missing_features)} features with <5% missing values")
Analyzing 20 features with <5% missing values
# plot features with low missing values
fig, axes = plt.subplots(4, 5, figsize=(16, 12))
axes = axes.flatten()

for i, feature in enumerate(low_missing_features):
    ax = axes[i]

    # Plot overlapping histograms for each class
    for label_val, color, label_name in [(-1, "#2ecc71", "Pass"), (1, "#e74c3c", "Fail")]:
        data = df[df["label"] == label_val][feature].dropna()
        ax.hist(data, bins=30, alpha=0.5, color=color, label=label_name, density=True)

    # Use shortened names for readability
    ax.set_title(feature.replace("sensor_", "S"), fontsize=9)
    ax.tick_params(labelsize=7)

# Add legend to first subplot only
axes[0].legend(fontsize=8)

plt.suptitle("Feature Distributions by Class (Pass vs Fail)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print(
    "Look for features where the red and green distributions differ significantly - these are potentially predictive features."
)

Look for features where the red and green distributions differ significantly - these are potentially predictive features.

7. Correlation Analysis

Correlation analysis serves two purposes: 1. Feature-to-feature correlation: Identify redundant features that measure the same thing 2. Feature-to-label correlation: Find features most associated with the outcome

We use features with <10% missing to ensure reliable correlations. Limiting to 50 features keeps the heatmap readable.

# Select features for correlation analysis
analysis_features = missing_df[missing_df["missing_pct"] < 10].index.tolist()[:50]

# Compute Pearson correlation matrix
corr_matrix = df[analysis_features].corr()

print(f"Computing correlations for {len(analysis_features)} features")
print(f"Correlation matrix shape: {corr_matrix.shape}")
Computing correlations for 50 features
Correlation matrix shape: (50, 50)

CORRELATION HEATMAP

plt.figure(figsize=(14, 12))

# Create mask for upper triangle (avoid redundant info)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Create heatmap
sns.heatmap(
    corr_matrix,
    mask=mask,
    cmap="RdBu_r",  # Red-Blue diverging colormap
    center=0,  # Center colormap at 0
    vmin=-1,
    vmax=1,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8, "label": "Correlation"},
)

plt.title("Feature Correlation Heatmap (Top 50 Features)", fontsize=14)
plt.tight_layout()
plt.show()

Insight: Blocks of red/blue indicate groups of correlated features. These could be reduced using PCA or by removing redundant features.

# High Correlation pairs
# Find all pairs with correlation > 0.9
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        corr_val = corr_matrix.iloc[i, j]
        if abs(corr_val) > 0.9:
            high_corr_pairs.append({
                "feature_1": corr_matrix.columns[i],
                "feature_2": corr_matrix.columns[j],
                "correlation": corr_val,
            })

# Create dataframe of highly correlated pairs
high_corr_df = pd.DataFrame(high_corr_pairs).sort_values("correlation", ascending=False)

print(f"Found {len(high_corr_df)} highly correlated pairs (|r| > 0.9)")
print("\nTop 10 most correlated pairs:")
high_corr_df.head(10)
Found 15 highly correlated pairs (|r| > 0.9)

Top 10 most correlated pairs:
feature_1 feature_2 correlation
11 sensor_525 sensor_253 0.999362
2 sensor_362 sensor_224 0.995710
13 sensor_351 sensor_213 0.995094
9 sensor_350 sensor_212 0.993534
0 sensor_497 sensor_225 0.993071
4 sensor_211 sensor_349 0.988676
5 sensor_355 sensor_217 0.987291
14 sensor_391 sensor_253 0.987185
10 sensor_525 sensor_391 0.986747
8 sensor_352 sensor_214 0.979281

CORRELATION WITH TARGET LABEL

The most important correlations are with the target (pass/fail). Features with high absolute correlation are likely predictive.

# Calculate correlation of each feature with the label
label_corr = df[feature_cols].corrwith(df["label"]).dropna()

# Sort by absolute correlation value
label_corr = label_corr.sort_values(key=abs, ascending=False)

print("Top 10 Features Most Correlated with Pass/Fail Label:")
for feat, corr in label_corr.head(10).items():
    direction = "↑ (higher = more fails)" if corr > 0 else "↓ (higher = more passes)"
    print(f"   {feat}: r = {corr:+.4f} {direction}")
Top 10 Features Most Correlated with Pass/Fail Label:
   sensor_59: r = +0.1558 ↑ (higher = more fails)
   sensor_103: r = +0.1512 ↑ (higher = more fails)
   sensor_510: r = +0.1316 ↑ (higher = more fails)
   sensor_348: r = +0.1302 ↑ (higher = more fails)
   sensor_158: r = +0.1213 ↑ (higher = more fails)
   sensor_431: r = +0.1209 ↑ (higher = more fails)
   sensor_293: r = +0.1145 ↑ (higher = more fails)
   sensor_111: r = -0.1139 ↓ (higher = more passes)
   sensor_434: r = +0.1121 ↑ (higher = more fails)
   sensor_430: r = +0.1101 ↑ (higher = more fails)

VISUALIZE TOP CORRELATED FEATURES WITH LABEL

top_corr_features = label_corr.head(10).index.tolist()

fig, ax = plt.subplots(figsize=(10, 6))

# Color by direction: red for positive, blue for negative
colors = ["#e74c3c" if x > 0 else "#3498db" for x in label_corr.head(10).values]

ax.barh(range(10), label_corr.head(10).values, color=colors)
ax.set_yticks(range(10))
ax.set_yticklabels(top_corr_features)
ax.set_xlabel("Correlation with Label (Fail=1)", fontsize=11)
ax.set_title("Top 10 Features Correlated with Pass/Fail Outcome", fontsize=12)
ax.axvline(x=0, color="black", linewidth=0.5)
ax.invert_yaxis()

plt.tight_layout()
plt.show()

print("Insight: Red bars = higher values predict FAILURE.Blue bars = higher values predict PASS")

Insight: Red bars = higher values predict FAILURE.Blue bars = higher values predict PASS

8. Outlier Detection

Outliers in sensor data may represent: - Genuine anomalies: Unusual but valid measurements - Sensor errors: Malfunctions or calibration issues
- Data entry errors: Mistakes during data collection

We use the Interquartile Range (IQR) method to identify outliers.

IQR Method: Q1 = 25th percentile, Q3 = 75th percentile IQR = Q3 - Q1 (middle 50% of data) Outliers: values < Q1 - 1.5IQR or > Q3 + 1.5IQR

# Function to count outliers using IQR method
def count_outliers_iqr(series):
    """Count the number of outliers in a pandas Series using IQR method.

    Parameters
    ----------
        series: pandas Series of numeric values

    Returns
    -------
        int: Number of outlier values

    """
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return ((series < lower_bound) | (series > upper_bound)).sum()


# Count outliers for each feature with low missing values
outlier_counts = {}
for col in low_missing_features:
    outlier_counts[col] = count_outliers_iqr(df[col].dropna())

# Create summary dataframe
outlier_df = pd.DataFrame(
    {
        "outlier_count": outlier_counts.values(),
        "outlier_pct": [c / len(df) * 100 for c in outlier_counts.values()],
    },
    index=outlier_counts.keys(),
).sort_values("outlier_pct", ascending=False)

print("Outlier Analysis Summary")
print(f"Average outlier percentage: {outlier_df['outlier_pct'].mean():.1f}%")
print(f"Max outlier percentage:     {outlier_df['outlier_pct'].max():.1f}%")
print("\nFeatures with most outliers:")
outlier_df.head(10)
Outlier Analysis Summary
Average outlier percentage: 4.4%
Max outlier percentage:     10.5%

Features with most outliers:
outlier_count outlier_pct
sensor_362 165 10.529675
sensor_224 153 9.763880
sensor_496 149 9.508615
sensor_483 122 7.785578
sensor_485 104 6.636886
sensor_348 100 6.381621
sensor_211 77 4.913848
sensor_349 72 4.594767
sensor_355 72 4.594767
sensor_350 67 4.275686

BOX PLOTS FOR FEATURES WITH MOST OUTLIERS

Box plots show the distribution and outliers by class, helping us see if outliers are associated with failures.

top_outlier_features = outlier_df.head(8).index.tolist()

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, feature in enumerate(top_outlier_features):
    ax = axes[i]
    df.boxplot(column=feature, by="label", ax=ax)
    ax.set_title(feature, fontsize=10)
    ax.set_xlabel("Label (-1=Pass, 1=Fail)")
    ax.set_ylabel("Value")

plt.suptitle("Box Plots: Features with Most Outliers (by Class)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("Insight: Outliers that appear only in one class may be important predictive signals rather than errors.")

Insight: Outliers that appear only in one class may be important predictive signals rather than errors.

9. Key Findings Summary

SECOM EDA - KEY FINDINGS REPORT

DATASET OVERVIEW • Total samples: 1,567 • Sensor features: 590 • Data collection period: 89 days

MISSING VALUES • Features with missing data: 538 (91.2%) • Features with >50% missing: 28 • Avg missing per sample: 27 features → ACTION: Impute or remove features with >50% missing

CLASS IMBALANCE • Pass samples: 1,463 (93.4%) • Fail samples: 104 (6.6%) • Imbalance ratio: 14.1:1 → ACTION: Use SMOTE, class weights, or stratified sampling

FEATURE REDUNDANCY • Highly correlated pairs (|r|>0.9): 15 • Constant (zero variance) features: 116 • Near-zero variance features: 171 → ACTION: Remove constant features; consider PCA

TOP PREDICTIVE FEATURES • Best correlated with outcome: sensor_59 Correlation: r = 0.1558 → These features should be prioritized in modeling


10. Preprocessing Recommendations

Based on this exploratory analysis, the following preprocessing pipeline is recommended before building predictive models:

Step 1: Handle Missing Values

Action Criteria Rationale
Remove features >50% missing Too much missing data to impute reliably
Impute remaining <50% missing Use median imputation (robust to outliers)

Step 2: Feature Selection

  • Remove constant features (zero variance) - provide no information
  • Address multicollinearity - remove one of each highly correlated pair (|r| > 0.95)
  • Consider PCA for dimensionality reduction

Step 3: Handle Class Imbalance

Choose one or more approaches: - SMOTE (Synthetic Minority Over-sampling Technique) - Class weights in model training (e.g., class_weight='balanced') - Stratified sampling for train/test splits

Step 4: Outlier Treatment

  • Use RobustScaler instead of StandardScaler (less sensitive to outliers)
  • Investigate outliers that appear only in failure samples (potential predictive signals)

Step 5: Feature Scaling

  • Standardize all features before modeling (sensors have different scales)
  • This is essential for distance-based algorithms (KNN, SVM) and neural networks