Series-level Preprocessing Examples#
Simple examples of series preprocessing methods.
Used Libraries#
The following methods selectively use these libraries:
import plotly.io as pio
pio.renderers.default = "notebook"
Categorical Conversion#
to_categorical() - Convert numerical series to categorical using specified method.
``
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
rules = {
"Child": lambda x: x <= 12,
"Young": lambda x: x <= 25,
"Adult": lambda x: x < 60,
"Elderly": lambda x: x >= 60,
"Missing Age": "default"
}
titanic['age_cat'] = titanic['age'].preproc.to_categorical(
method='rules',
rules=rules
)
| Count | |
|---|---|
| Adult | 387 |
| Young | 232 |
| Missing Age | 177 |
| Child | 69 |
| Elderly | 26 |
Text Processing#
normalize_string_series() - Normalize Series of strings with cleaning and standardization option.
from frameon import FrameOn as fo
import pandas as pd
data = {
"text": [
" Hello, how are you? ",
"RANDOM TEXT IN UPPERCASE",
"café and naïve - special characters!",
"one two.three,four",
"example with 'quotes' and — dashes",
]
}
df = fo(pd.DataFrame(data))
df['cleaned_text'] = df['text'].preproc.normalize_string_series(
case_format='title'
)
df
| text | cleaned_text | |
|---|---|---|
| 0 | Hello, how are you? | Hello How Are You |
| 1 | RANDOM TEXT IN UPPERCASE | Random Text In Uppercase |
| 2 | café and naïve - special characters! | Cafe And Naive Special Characters |
| 3 | one two.three,four | One Two Three Four |
| 4 | example with 'quotes' and — dashes | Example With Quotes And Dashes |
Numeric Transformations#
transform_numeric() - Applies numeric transformations
from frameon import load_dataset, FrameOn as fo
diamonds = fo(load_dataset('diamonds'))
diamonds['price_log'] = diamonds['price'].preproc.transform_numeric(
method='boxcox',
show_dist=True
)
0 326
1 326
2 327
3 334
4 335
...
53935 2757
53936 2757
53937 2757
53938 2757
53939 2757
Name: price, Length: 53940, dtype: int64
Missing Value Handling#
fill_missing_by_category() - Fills missing values by category
Fill missing values using category-based strategies.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.fill_missing_by_category(
category_columns=['sex', 'class'],
strategy='simple'
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}")
Missings in age before handling: 177
Missings in age after handling: 0
impute_missing() - Missing value imputation on specified numerical columns
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
print(f"Missings in age before handling: {titanic['age'].isna().sum()}")
titanic['age'] = titanic['age'].preproc.impute_missing(
method='iterative',
)
print(f"Missings in age after handling: {titanic['age'].isna().sum()}")
Missings in age before handling: 177
Missings in age after handling: 0
Category Analysis#
calc_target_category_share() - Calculate the proportional share of a target category within grouped data, with support for time-based resampling.
from frameon import load_dataset, FrameOn as fo
superstore = fo(load_dataset('superstore'))
los_angeles_share = superstore['City'].preproc.calc_target_category_share(
target_category='Los Angeles',
group_columns=['Order Date', 'Category'],
resample_freq='ME',
)
los_angeles_share.head()
| Order Date | Category | target_share | |
|---|---|---|---|
| 0 | 2014-01-31 | Furniture | 0.050000 |
| 1 | 2014-01-31 | Office Supplies | 0.040000 |
| 2 | 2014-01-31 | Technology | 0.000000 |
| 3 | 2014-02-28 | Furniture | 0.000000 |
| 4 | 2014-02-28 | Office Supplies | 0.032258 |
Validation Methods#
check_group_counts() - Analyze group statistics to assess viability for missing value imputation.
from frameon import load_dataset, FrameOn as fo
titanic = fo(load_dataset('titanic'))
titanic['age'].preproc.check_group_counts(
category_columns=['sex', 'class'],
)
============================ Group Analysis Report =============================
Grouping columns: sex, class
Value column: age
Total groups: 6
Groups with missing values: 100.0%
Groups with ALL values missing: 0.0%
Total missing values: 177
Missing in non-empty groups: 177
---------------------------- Group Size Statistics -----------------------------
Mean group size: 119.0
Median group size: 100.0
Minimum group size: 74
Maximum group size: 253
Standard deviation: 66.6
-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value: 0.0%
Groups with 2-5 missing values: 16.7%
Groups with 5+ missing values: 83.3%
---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements: 100.0%
Groups with 10+ elements: 100.0%
Groups with 20+ elements: 100.0%
Groups with 30+ elements: 100.0%
Groups with 40+ elements: 100.0%
Groups with 50+ elements: 100.0%
================================================================================