Series-level Preprocessing Examples#

Simple examples of series preprocessing methods.

Used Libraries#

The following methods selectively use these libraries:

import plotly.io as pio
pio.renderers.default = "notebook"

Categorical Conversion#

to_categorical() - Convert numerical series to categorical using specified method. ``

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
rules = {
    "Child": lambda x: x <= 12,
    "Young": lambda x: x <= 25,
    "Adult": lambda x: x < 60,
    "Elderly": lambda x: x >= 60,
    "Missing Age": "default"
}
titanic['age_cat'] = titanic['age'].preproc.to_categorical(
    method='rules',
    rules=rules
)
Count
Adult 387
Young 232
Missing Age 177
Child 69
Elderly 26

Text Processing#

normalize_string_series() - Normalize Series of strings with cleaning and standardization option.

from frameon import FrameOn as fo
import pandas as pd

data = {
    "text": [
        "  Hello, how are you?  ",
        "RANDOM TEXT IN UPPERCASE",
        "café and naïve - special characters!",
        "one two.three,four",
        "example with 'quotes' and — dashes",
    ]
}
df = fo(pd.DataFrame(data))
df['cleaned_text'] = df['text'].preproc.normalize_string_series(
    case_format='title'
)
df
text cleaned_text
0 Hello, how are you? Hello How Are You
1 RANDOM TEXT IN UPPERCASE Random Text In Uppercase
2 café and naïve - special characters! Cafe And Naive Special Characters
3 one two.three,four One Two Three Four
4 example with 'quotes' and — dashes Example With Quotes And Dashes

Numeric Transformations#

transform_numeric() - Applies numeric transformations

from frameon import load_dataset, FrameOn as fo

diamonds = fo(load_dataset('diamonds'))

diamonds['price_log'] = diamonds['price'].preproc.transform_numeric(
    method='boxcox',
    show_dist=True
)
0         326
1         326
2         327
3         334
4         335
         ... 
53935    2757
53936    2757
53937    2757
53938    2757
53939    2757
Name: price, Length: 53940, dtype: int64

Missing Value Handling#

fill_missing_by_category() - Fills missing values by category

Fill missing values using category-based strategies.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.fill_missing_by_category(
    category_columns=['sex', 'class'],
    strategy='simple'
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}") 
Missings in age before handling: 177
Missings in age after handling: 0

impute_missing() - Missing value imputation on specified numerical columns

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.impute_missing(
    method='iterative',
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}") 
Missings in age before handling: 177
Missings in age after handling: 0

Category Analysis#

calc_target_category_share() - Calculate the proportional share of a target category within grouped data, with support for time-based resampling.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
los_angeles_share = superstore['City'].preproc.calc_target_category_share(
    target_category='Los Angeles',
    group_columns=['Order Date', 'Category'],
    resample_freq='ME',
)
los_angeles_share.head()
Order Date Category target_share
0 2014-01-31 Furniture 0.050000
1 2014-01-31 Office Supplies 0.040000
2 2014-01-31 Technology 0.000000
3 2014-02-28 Furniture 0.000000
4 2014-02-28 Office Supplies 0.032258

Validation Methods#

check_group_counts() - Analyze group statistics to assess viability for missing value imputation.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['age'].preproc.check_group_counts(
    category_columns=['sex', 'class'],
)
============================ Group Analysis Report =============================
Grouping columns: sex, class
Value column: age

Total groups:                            6
Groups with missing values:              100.0%
Groups with ALL values missing:          0.0%
Total missing values:                    177
Missing in non-empty groups:             177

---------------------------- Group Size Statistics -----------------------------
Mean group size:               119.0
Median group size:             100.0
Minimum group size:            74
Maximum group size:            253
Standard deviation:            66.6

-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value:   0.0%
Groups with 2-5 missing values: 16.7%
Groups with 5+ missing values: 83.3%

---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements:       100.0%
Groups with 10+ elements:      100.0%
Groups with 20+ elements:      100.0%
Groups with 30+ elements:      100.0%
Groups with 40+ elements:      100.0%
Groups with 50+ elements:      100.0%
================================================================================