Preprocessing Data In Pandas
Pandas
Scale
Transform all data columns to having range from 0 to 1.
scale = lambda x: x / x.max()
df_scale = df.apply(scale)
df_scale.describe()
Normalize
Transform all data columns to having mean 0, and standard deviation 1.
normalize = lambda x: (x - x.mean()) / x.std()
df_norm = df.apply(normalize)
df_norm.describe()
Standardize
First normalize the data, then scale from 0 to 1.
df_stdz = df.apply(normalize).apply(scale)
# if the above throws an error, this means you probably divided by zero when normalizing. Use the below code to first impute bad data to 0 before scaling.
#df_stdz = df.apply(normalize).replace([-np.inf,np.inf], np.nan).fillna(0).apply(scale)
df_stdz.describe()
scikit-learn
Using scikit-learn, do the same thing.
from sklearnsklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_stdz = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)
df_stdz.describe()
Now look into other built-in preprocessors on scikit-learn, and consider how to handle outliers