Series-level Exploration Examples#

Simple examples of series exploration methods.

Used Libraries#

The following methods selectively use these libraries:

import plotly.io as pio
pio.renderers.default = "notebook"

Basic Info Method#

info() - Shows summary series statistics

For numeric column.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['age'].explore.info()

Summary Statistics for "age" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	714 (80%)	Max	80	Mean	29.70	24	30 (3%)
Missing	177 (20%)	99%	65.87	Trimmed Mean (10%)	29.27	22	27 (3%)
Distinct	88 (10%)	95%	56	Mode	24	18	26 (3%)
Non-Duplicate	16 (2%)	75%	38	Range	79.58	28	25 (3%)
Duplicates	802 (90%)	50%	28	IQR	17.88	30	25 (3%)
Dup. Values	72 (8%)	25%	20.12	Std	14.53	19	25 (3%)
Zeros	---	5%	4	MAD	13.34	21	24 (3%)
Negative	---	1%	1	Kurt	0.18	25	23 (3%)
Memory Usage	<1 Mb	Min	0.42	Skew	0.39	36	22 (2%)

For categorical column.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['class'].explore.info()

Summary Statistics for "class" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	891 (100%)	Avg Word Count	1.0	Third	491 (55%)
Missing Values	---	Max Length (chars)	6.0	First	216 (24%)
Empty Strings	---	Avg Length (chars)	5.2	Second	184 (21%)
Distinct Values	3 (<1%)	Median Length (chars)	5.0
Non-Duplicates	---	Min Length (chars)	5.0
Exact Duplicates	888 (99%)	Most Common Length	5 (79.3%)
Fuzzy Duplicates	888 (99%)	Avg Digit Ratio	0.00
Values with Duplicates	3 (<1%)
Memory Usage	<1 Mb

For datetime column.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
superstore['Order Date'].explore.info()

Summary Statistics for "Order Date" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2014-01-03	Values	9.99k (100%)	Missing Years	---
Last date	2017-12-30	Zeros	---	Missing Months	---
Avg Days Frequency	0.15	Missings	---	Missing Weeks	---
Min Days Interval	0	Distinct	1.24k (12%)	Missing Days	221 (15%)
Max Days Interval	4	Duplicates	8.76k (88%)	Weekend Percentage	33.7%
Memory Usage	<1 Mb	Dup. Values	1.12k (11%)	Most Common Weekday	Monday

For text column.

from frameon import load_dataset, FrameOn as fo

reviews = fo(load_dataset('reviews'))
reviews['Text'].explore.info()

Summary Statistics for "Text" (Type: Text)
Summary		Text Metrics
Total Values	1.00k (100%)	Avg Word Count	10.2
Missing Values	---	Max Length (chars)	149.0
Empty Strings	---	Avg Length (chars)	55.2
Distinct Values	990 (99%)	Median Length (chars)	48.0
Non-Duplicates	980 (98%)	Min Length (chars)	11.0
Exact Duplicates	10 (1%)	Most Common Length	13 (2.3%)
Fuzzy Duplicates	13 (1%)	Avg Digit Ratio	0.00
Values with Duplicates	10 (1%)
Memory Usage	<1 Mb

Detect Anomalies#

detect_anomalies() - Detects anomalies in the series using the specified method.

Return boolean mask where True indicates anomalies in the series.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
mask = titanic['age'].explore.detect_anomalies(
    anomaly_type='missing'
)
mask.head()

  False
  False
  False
  False
  False
Name: age, dtype: bool

Detect Outliers#

detect_outliers() - Detect outliers in series using statistical and machine learning methods.

from frameon import load_dataset, FrameOn as fo

tips = fo(load_dataset('tips'))
tips['total_bill'].explore.detect_outliers(
    method='quantile',
    threshold=0.05
)

Outliers in "total_bill"
Method	Threshold	Total Points	Outliers Count	Outliers Percentage	Bounds
QUANTILE	0.05	244	26	10.66%	[9.56, 38.1]

Anomalies by Categories#

anomalies_by_categories() - Analyze anomaly distribution across all categorical columns in parent DataFrame.

from frameon import load_dataset, FrameOn as fo

tips = fo(load_dataset('tips'))
tips['total_bill'].explore.anomalies_by_categories(
    anomaly_type='outlier',
    method='quantile',
    threshold=0.05
)

Outliers distribution across categories (method: quantile, threshold: 0.05)
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
sex	Male	157	19	12.1%	64.3%	73.1%	8.7%
smoker	Yes	93	12	12.9%	38.1%	46.2%	8.0%
day	Fri	19	3	15.8%	7.8%	11.5%	3.8%
time	Lunch	68	8	11.8%	27.9%	30.8%	2.9%
day	Sat	87	10	11.5%	35.7%	38.5%	2.8%
day	Thur	62	7	11.3%	25.4%	26.9%	1.5%
time	Dinner	176	18	10.2%	72.1%	69.2%	-2.9%
smoker	No	151	14	9.3%	61.9%	53.8%	-8.0%
day	Sun	76	6	7.9%	31.1%	23.1%	-8.1%
sex	Female	87	7	8.0%	35.7%	26.9%	-8.7%

Anomalies Over Time#

anomalies_over_time() - Plot anomalies over time using resampling.

from frameon import load_dataset, FrameOn as fo

taxis = fo(load_dataset('taxis'))
fig = taxis['payment'].explore.anomalies_over_time(
    anomaly_type='missing',
    time_column='pickup',
    freq='1D'
)
fig.show()

Detect Window Outliers#

detect_window_outliers() - Detect and analyze outliers in rolling windows of time series data.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
superstore['Sales'].explore.detect_window_outliers(
    time_column='Order Date'
    , window=10
    , resample_freq='W'
    , agg_func='mean'
    , method='confidence'
    , threshold=0.05
)

Outliers in "Sales"
Method	Threshold	Total Points	Outliers Count	Outliers Percentage	Bounds (mean)
CONFIDENCE	0.05	209	84	40.19%	[159, 294]

Plot Rolling Anomaly Rate#

plot_rolling_anomaly_rate() - Calculate and visualize the rolling rate of specified anomalies in a time series.

from frameon import load_dataset, FrameOn as fo

taxis = fo(load_dataset('taxis'))
taxis['payment'].explore.plot_rolling_anomaly_rate(
    anomaly_type='missing',
    time_column='pickup',
    window=30, 
)