Series-level Preprocessing Examples#

Simple examples of series preprocessing methods.

Used Libraries#

The following methods selectively use these libraries:

import plotly.io as pio
pio.renderers.default = "notebook"

Categorical Conversion#

to_categorical() - Convert numerical series to categorical using specified method. ``

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
rules = {
    "Child": lambda x: x <= 12,
    "Young": lambda x: x <= 25,
    "Adult": lambda x: x < 60,
    "Elderly": lambda x: x >= 60,
    "Missing Age": "default"
}
titanic['age_cat'] = titanic['age'].preproc.to_categorical(
    method='rules',
    rules=rules
)

	Count
Adult	387
Young	232
Missing Age	177
Child	69
Elderly	26

Text Processing#

normalize_string_series() - Normalize Series of strings with cleaning and standardization option.

from frameon import FrameOn as fo
import pandas as pd

data = {
    "text": [
        "  Hello, how are you?  ",
        "RANDOM TEXT IN UPPERCASE",
        "café and naïve - special characters!",
        "one two.three,four",
        "example with 'quotes' and — dashes",
    ]
}
df = fo(pd.DataFrame(data))
df['cleaned_text'] = df['text'].preproc.normalize_string_series(
    case_format='title'
)
df

	text	cleaned_text
0	Hello, how are you?	Hello How Are You
1	RANDOM TEXT IN UPPERCASE	Random Text In Uppercase
2	café and naïve - special characters!	Cafe And Naive Special Characters
3	one two.three,four	One Two Three Four
4	example with 'quotes' and — dashes	Example With Quotes And Dashes

Numeric Transformations#

transform_numeric() - Applies numeric transformations

from frameon import load_dataset, FrameOn as fo

diamonds = fo(load_dataset('diamonds'))

diamonds['price_log'] = diamonds['price'].preproc.transform_numeric(
    method='boxcox',
    show_dist=True
)

       326
       326
       327
       334
       335
         ... 
  2757
  2757
  2757
  2757
  2757
Name: price, Length: 53940, dtype: int64

Missing Value Handling#

fill_missing_by_category() - Fills missing values by category

Fill missing values using category-based strategies.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.fill_missing_by_category(
    category_columns=['sex', 'class'],
    strategy='simple'
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}") 

Missings in age before handling: 177
Missings in age after handling: 0

impute_missing() - Missing value imputation on specified numerical columns

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.impute_missing(
    method='iterative',
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}") 

Missings in age before handling: 177

Missings in age after handling: 0

Category Analysis#

calc_target_category_share() - Calculate the proportional share of a target category within grouped data, with support for time-based resampling.

from frameon import load_dataset, FrameOn as fo

superstore = fo(load_dataset('superstore'))
los_angeles_share = superstore['City'].preproc.calc_target_category_share(
    target_category='Los Angeles',
    group_columns=['Order Date', 'Category'],
    resample_freq='ME',
)
los_angeles_share.head()

	Order Date	Category	target_share
0	2014-01-31	Furniture	0.050000
1	2014-01-31	Office Supplies	0.040000
2	2014-01-31	Technology	0.000000
3	2014-02-28	Furniture	0.000000
4	2014-02-28	Office Supplies	0.032258

Validation Methods#

check_group_counts() - Analyze group statistics to assess viability for missing value imputation.

from frameon import load_dataset, FrameOn as fo

titanic = fo(load_dataset('titanic'))
titanic['age'].preproc.check_group_counts(
    category_columns=['sex', 'class'],
)

============================ Group Analysis Report =============================
Grouping columns: sex, class
Value column: age

Total groups:                            6
Groups with missing values:              100.0%
Groups with ALL values missing:          0.0%
Total missing values:                    177
Missing in non-empty groups:             177

---------------------------- Group Size Statistics -----------------------------
Mean group size:               119.0
Median group size:             100.0
Minimum group size:            74
Maximum group size:            253
Standard deviation:            66.6

-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value:   0.0%
Groups with 2-5 missing values: 16.7%
Groups with 5+ missing values: 83.3%

---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements:       100.0%
Groups with 10+ elements:      100.0%
Groups with 20+ elements:      100.0%
Groups with 30+ elements:      100.0%
Groups with 40+ elements:      100.0%
Groups with 50+ elements:      100.0%
================================================================================