Series-level Exploration Examples#

Simple examples of series exploration methods.

Used Libraries#

The following methods selectively use these libraries:

import plotly.io as pio
pio.renderers.default = "notebook"

Basic Info Method#

info() - Shows summary series statistics

For numeric column.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['age'].explore.info()
Summary Statistics for "age" (Type: Float)
Summary Percentiles Detailed Stats Value Counts
Total 714 (80%) Max 80 Mean 29.70 24 30 (3%)
Missing 177 (20%) 99% 65.87 Trimmed Mean (10%) 29.27 22 27 (3%)
Distinct 88 (10%) 95% 56 Mode 24 18 26 (3%)
Non-Duplicate 16 (2%) 75% 38 Range 79.58 28 25 (3%)
Duplicates 802 (90%) 50% 28 IQR 17.88 30 25 (3%)
Dup. Values 72 (8%) 25% 20.12 Std 14.53 19 25 (3%)
Zeros --- 5% 4 MAD 13.34 21 24 (3%)
Negative --- 1% 1 Kurt 0.18 25 23 (3%)
Memory Usage <1 Mb Min 0.42 Skew 0.39 36 22 (2%)

For categorical column.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['class'].explore.info()
Summary Statistics for "class" (Type: Categorical)
Summary Text Metrics Value Counts
Total Values 891 (100%) Avg Word Count 1.0 Third 491 (55%)
Missing Values --- Max Length (chars) 6.0 First 216 (24%)
Empty Strings --- Avg Length (chars) 5.2 Second 184 (21%)
Distinct Values 3 (<1%) Median Length (chars) 5.0
Non-Duplicates --- Min Length (chars) 5.0
Exact Duplicates 888 (99%) Most Common Length 5 (79.3%)
Fuzzy Duplicates 888 (99%) Avg Digit Ratio 0.00
Values with Duplicates 3 (<1%)
Memory Usage <1 Mb

For datetime column.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
superstore['Order Date'].explore.info()
Summary Statistics for "Order Date" (Type: Datetime)
Summary Data Quality Stats Temporal Stats
First date 2014-01-03 Values 9.99k (100%) Missing Years ---
Last date 2017-12-30 Zeros --- Missing Months ---
Avg Days Frequency 0.15 Missings --- Missing Weeks ---
Min Days Interval 0 Distinct 1.24k (12%) Missing Days 221 (15%)
Max Days Interval 4 Duplicates 8.76k (88%) Weekend Percentage 33.7%
Memory Usage <1 Mb Dup. Values 1.12k (11%) Most Common Weekday Monday

For text column.

from frameon import load_dataset, FrameOn as fo

reviews = fo(load_dataset('reviews'))
reviews['Text'].explore.info()
Summary Statistics for "Text" (Type: Text)
Summary Text Metrics
Total Values 1.00k (100%) Avg Word Count 10.2
Missing Values --- Max Length (chars) 149.0
Empty Strings --- Avg Length (chars) 55.2
Distinct Values 990 (99%) Median Length (chars) 48.0
Non-Duplicates 980 (98%) Min Length (chars) 11.0
Exact Duplicates 10 (1%) Most Common Length 13 (2.3%)
Fuzzy Duplicates 13 (1%) Avg Digit Ratio 0.00
Values with Duplicates 10 (1%)
Memory Usage <1 Mb

Detect Anomalies#

detect_anomalies() - Detects anomalies in the series using the specified method.

Return boolean mask where True indicates anomalies in the series.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
mask = titanic['age'].explore.detect_anomalies(
    anomaly_type='missing'
)
mask.head()
0    False
1    False
2    False
3    False
4    False
Name: age, dtype: bool

Detect Outliers#

detect_outliers() - Detect outliers in series using statistical and machine learning methods.

from frameon import load_dataset, FrameOn as fo

tips = fo(load_dataset('tips'))
tips['total_bill'].explore.detect_outliers(
    method='quantile',
    threshold=0.05
)
Outliers in "total_bill"
Method Threshold Total Points Outliers Count Outliers Percentage Bounds
QUANTILE 0.05 244 26 10.66% [9.56, 38.1]

Anomalies by Categories#

anomalies_by_categories() - Analyze anomaly distribution across all categorical columns in parent DataFrame.

from frameon import load_dataset, FrameOn as fo

tips = fo(load_dataset('tips'))
tips['total_bill'].explore.anomalies_by_categories(
    anomaly_type='outlier',
    method='quantile',
    threshold=0.05
)
Outliers distribution across categories (method: quantile, threshold: 0.05)
Column Category Total Anomaly Anomaly Rate Total % Anomaly % % Diff
sex Male 157 19 12.1% 64.3% 73.1% 8.7%
smoker Yes 93 12 12.9% 38.1% 46.2% 8.0%
day Fri 19 3 15.8% 7.8% 11.5% 3.8%
time Lunch 68 8 11.8% 27.9% 30.8% 2.9%
day Sat 87 10 11.5% 35.7% 38.5% 2.8%
day Thur 62 7 11.3% 25.4% 26.9% 1.5%
time Dinner 176 18 10.2% 72.1% 69.2% -2.9%
smoker No 151 14 9.3% 61.9% 53.8% -8.0%
day Sun 76 6 7.9% 31.1% 23.1% -8.1%
sex Female 87 7 8.0% 35.7% 26.9% -8.7%

Anomalies Over Time#

anomalies_over_time() - Plot anomalies over time using resampling.

from frameon import load_dataset, FrameOn as fo

taxis = fo(load_dataset('taxis'))
fig = taxis['payment'].explore.anomalies_over_time(
    anomaly_type='missing',
    time_column='pickup',
    freq='1D'
)
fig.show()

Detect Window Outliers#

detect_window_outliers() - Detect and analyze outliers in rolling windows of time series data.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
superstore['Sales'].explore.detect_window_outliers(
    time_column='Order Date'
    , window=10
    , resample_freq='W'
    , agg_func='mean'
    , method='confidence'
    , threshold=0.05
)
Outliers in "Sales"
Method Threshold Total Points Outliers Count Outliers Percentage Bounds (mean)
CONFIDENCE 0.05 209 84 40.19% [159, 294]

Plot Rolling Anomaly Rate#

plot_rolling_anomaly_rate() - Calculate and visualize the rolling rate of specified anomalies in a time series.

from frameon import load_dataset, FrameOn as fo

taxis = fo(load_dataset('taxis'))
taxis['payment'].explore.plot_rolling_anomaly_rate(
    anomaly_type='missing',
    time_column='pickup',
    window=30, 
)