Series-level Exploration Examples#
Simple examples of series exploration methods.
Used Libraries#
The following methods selectively use these libraries:
import plotly.io as pio
pio.renderers.default = "notebook"
Basic Info Method#
info() - Shows summary series statistics
For numeric column.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic['age'].explore.info()
| Summary | Percentiles | Detailed Stats | Value Counts | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total | 714 (80%) | Max | 80 | Mean | 29.70 | 24 | 30 (3%) | |||
| Missing | 177 (20%) | 99% | 65.87 | Trimmed Mean (10%) | 29.27 | 22 | 27 (3%) | |||
| Distinct | 88 (10%) | 95% | 56 | Mode | 24 | 18 | 26 (3%) | |||
| Non-Duplicate | 16 (2%) | 75% | 38 | Range | 79.58 | 28 | 25 (3%) | |||
| Duplicates | 802 (90%) | 50% | 28 | IQR | 17.88 | 30 | 25 (3%) | |||
| Dup. Values | 72 (8%) | 25% | 20.12 | Std | 14.53 | 19 | 25 (3%) | |||
| Zeros | --- | 5% | 4 | MAD | 13.34 | 21 | 24 (3%) | |||
| Negative | --- | 1% | 1 | Kurt | 0.18 | 25 | 23 (3%) | |||
| Memory Usage | <1 Mb | Min | 0.42 | Skew | 0.39 | 36 | 22 (2%) | |||
For categorical column.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic['class'].explore.info()
| Summary | Text Metrics | Value Counts | |||||
|---|---|---|---|---|---|---|---|
| Total Values | 891 (100%) | Avg Word Count | 1.0 | Third | 491 (55%) | ||
| Missing Values | --- | Max Length (chars) | 6.0 | First | 216 (24%) | ||
| Empty Strings | --- | Avg Length (chars) | 5.2 | Second | 184 (21%) | ||
| Distinct Values | 3 (<1%) | Median Length (chars) | 5.0 | ||||
| Non-Duplicates | --- | Min Length (chars) | 5.0 | ||||
| Exact Duplicates | 888 (99%) | Most Common Length | 5 (79.3%) | ||||
| Fuzzy Duplicates | 888 (99%) | Avg Digit Ratio | 0.00 | ||||
| Values with Duplicates | 3 (<1%) | ||||||
| Memory Usage | <1 Mb | ||||||
For datetime column.
from frameon import load_dataset, FrameOn as fo
superstore = fo(load_dataset('superstore'))
superstore['Order Date'].explore.info()
| Summary | Data Quality Stats | Temporal Stats | |||||
|---|---|---|---|---|---|---|---|
| First date | 2014-01-03 | Values | 9.99k (100%) | Missing Years | --- | ||
| Last date | 2017-12-30 | Zeros | --- | Missing Months | --- | ||
| Avg Days Frequency | 0.15 | Missings | --- | Missing Weeks | --- | ||
| Min Days Interval | 0 | Distinct | 1.24k (12%) | Missing Days | 221 (15%) | ||
| Max Days Interval | 4 | Duplicates | 8.76k (88%) | Weekend Percentage | 33.7% | ||
| Memory Usage | <1 Mb | Dup. Values | 1.12k (11%) | Most Common Weekday | Monday | ||
For text column.
from frameon import load_dataset, FrameOn as fo
reviews = fo(load_dataset('reviews'))
reviews['Text'].explore.info()
| Summary | Text Metrics | |||
|---|---|---|---|---|
| Total Values | 1.00k (100%) | Avg Word Count | 10.2 | |
| Missing Values | --- | Max Length (chars) | 149.0 | |
| Empty Strings | --- | Avg Length (chars) | 55.2 | |
| Distinct Values | 990 (99%) | Median Length (chars) | 48.0 | |
| Non-Duplicates | 980 (98%) | Min Length (chars) | 11.0 | |
| Exact Duplicates | 10 (1%) | Most Common Length | 13 (2.3%) | |
| Fuzzy Duplicates | 13 (1%) | Avg Digit Ratio | 0.00 | |
| Values with Duplicates | 10 (1%) | |||
| Memory Usage | <1 Mb | |||
Detect Anomalies#
detect_anomalies() - Detects anomalies in the series using the specified method.
Return boolean mask where True indicates anomalies in the series.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
mask = titanic['age'].explore.detect_anomalies(
anomaly_type='missing'
)
mask.head()
0 False
1 False
2 False
3 False
4 False
Name: age, dtype: bool
Detect Outliers#
detect_outliers() - Detect outliers in series using statistical and machine learning methods.
from frameon import load_dataset, FrameOn as fo
tips = fo(load_dataset('tips'))
tips['total_bill'].explore.detect_outliers(
method='quantile',
threshold=0.05
)
| Method | Threshold | Total Points | Outliers Count | Outliers Percentage | Bounds |
|---|---|---|---|---|---|
| QUANTILE | 0.05 | 244 | 26 | 10.66% | [9.56, 38.1] |
Anomalies by Categories#
anomalies_by_categories() - Analyze anomaly distribution across all categorical columns in parent DataFrame.
from frameon import load_dataset, FrameOn as fo
tips = fo(load_dataset('tips'))
tips['total_bill'].explore.anomalies_by_categories(
anomaly_type='outlier',
method='quantile',
threshold=0.05
)
| Column | Category | Total | Anomaly | Anomaly Rate | Total % | Anomaly % | % Diff |
|---|---|---|---|---|---|---|---|
| sex | Male | 157 | 19 | 12.1% | 64.3% | 73.1% | 8.7% |
| smoker | Yes | 93 | 12 | 12.9% | 38.1% | 46.2% | 8.0% |
| day | Fri | 19 | 3 | 15.8% | 7.8% | 11.5% | 3.8% |
| time | Lunch | 68 | 8 | 11.8% | 27.9% | 30.8% | 2.9% |
| day | Sat | 87 | 10 | 11.5% | 35.7% | 38.5% | 2.8% |
| day | Thur | 62 | 7 | 11.3% | 25.4% | 26.9% | 1.5% |
| time | Dinner | 176 | 18 | 10.2% | 72.1% | 69.2% | -2.9% |
| smoker | No | 151 | 14 | 9.3% | 61.9% | 53.8% | -8.0% |
| day | Sun | 76 | 6 | 7.9% | 31.1% | 23.1% | -8.1% |
| sex | Female | 87 | 7 | 8.0% | 35.7% | 26.9% | -8.7% |
Anomalies Over Time#
anomalies_over_time() - Plot anomalies over time using resampling.
from frameon import load_dataset, FrameOn as fo
taxis = fo(load_dataset('taxis'))
fig = taxis['payment'].explore.anomalies_over_time(
anomaly_type='missing',
time_column='pickup',
freq='1D'
)
fig.show()
Detect Window Outliers#
detect_window_outliers() - Detect and analyze outliers in rolling windows of time series data.
from frameon import load_dataset, FrameOn as fo
superstore = fo(load_dataset('superstore'))
superstore['Sales'].explore.detect_window_outliers(
time_column='Order Date'
, window=10
, resample_freq='W'
, agg_func='mean'
, method='confidence'
, threshold=0.05
)
| Method | Threshold | Total Points | Outliers Count | Outliers Percentage | Bounds (mean) |
|---|---|---|---|---|---|
| CONFIDENCE | 0.05 | 209 | 84 | 40.19% | [159, 294] |
Plot Rolling Anomaly Rate#
plot_rolling_anomaly_rate() - Calculate and visualize the rolling rate of specified anomalies in a time series.
from frameon import load_dataset, FrameOn as fo
taxis = fo(load_dataset('taxis'))
taxis['payment'].explore.plot_rolling_anomaly_rate(
anomaly_type='missing',
time_column='pickup',
window=30,
)