Sesión 11#
Naive Bayes#
Objetivo: Comprender el modelo de Naive Bayes como un buen punto de partida para benchmark.
import pandas as pd
from pgmpy.models import DiscreteBayesianNetwork
import os
import warnings
warnings.filterwarnings("ignore")
ruta = os.path.join('..', 'data', 'weather.csv')
df = pd.read_csv(ruta)
for col in df.select_dtypes(include=['object', 'string']).columns:
df[col] = df[col].astype('category')
df.head()
| Outlook | Temperature | Humidity | Windy | Play | |
|---|---|---|---|---|---|
| 0 | Sunny | Hot | High | False | No |
| 1 | Sunny | Hot | High | True | No |
| 2 | Overcast | Hot | High | False | Yes |
| 3 | Rain | Mild | High | False | Yes |
| 4 | Rain | Cool | Normal | False | Yes |
df.Play.value_counts()
Play
Yes 9
No 5
Name: count, dtype: int64
target = "Play"
features = df.columns.tolist()
features.remove(target)
print("Features:", features)
print("Target:", target)
Features: ['Outlook', 'Temperature', 'Humidity', 'Windy']
Target: Play
En el caso de Naive Bayes podemos seguir utilizando la clase DiscreteBayesianNetwork de pgmpy para definir el modelo, entrenarlo y hacer predicciones.
Nota las independencias… en este caso, las variables observadas son independientes entre sí dado el nodo padre (la variable objetivo).
En el código esto significa que el nodo padre (la \(y\)) siempre viene acompañada de todas las demás variables como nodos hijos directos.
weather_model = DiscreteBayesianNetwork([
("Play", "Outlook"),
("Play", "Humidity"),
("Play", "Windy"),
("Play", "Temperature")
])
# Train | test
train_df = df.sample(frac=0.7, random_state=42)
test_df = df.drop(train_df.index)
train_df.shape, test_df.shape
((10, 5), (4, 5))
Con el método fit entrenamos el modelo Naive Bayes.
Durante este proceso, pgmpy estima las distribuciones de probabilidad necesarias para hacer predicciones:
La probabilidad previa de la clase $\(P(C)\)$
Las probabilidades condicionales de cada atributo $\(P(X_i \mid C)\)$
Todas estas cantidades se calculan mediante Maximum Likelihood Estimation (MLE), es decir, contando frecuencias en los datos.
weather_model.fit(train_df)
<pgmpy.models.DiscreteBayesianNetwork.DiscreteBayesianNetwork at 0x7f35ac9e9550>
print(weather_model.get_cpds(target))
+-----------+-----+
| Play(No) | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+
for cpd in weather_model.get_cpds():
print(cpd)
print("#-----------------------------#")
+-----------+-----+
| Play(No) | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+
#-----------------------------#
+-------------------+----------+---------------------+
| Play | Play(No) | Play(Yes) |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0 | 0.5 |
+-------------------+----------+---------------------+
| Outlook(Rain) | 0.5 | 0.3333333333333333 |
+-------------------+----------+---------------------+
| Outlook(Sunny) | 0.5 | 0.16666666666666666 |
+-------------------+----------+---------------------+
#-----------------------------#
+------------------+----------+--------------------+
| Play | Play(No) | Play(Yes) |
+------------------+----------+--------------------+
| Humidity(High) | 0.75 | 0.3333333333333333 |
+------------------+----------+--------------------+
| Humidity(Normal) | 0.25 | 0.6666666666666666 |
+------------------+----------+--------------------+
#-----------------------------#
+--------------+----------+---------------------+
| Play | Play(No) | Play(Yes) |
+--------------+----------+---------------------+
| Windy(False) | 0.25 | 0.8333333333333334 |
+--------------+----------+---------------------+
| Windy(True) | 0.75 | 0.16666666666666666 |
+--------------+----------+---------------------+
#-----------------------------#
+-------------------+----------+--------------------+
| Play | Play(No) | Play(Yes) |
+-------------------+----------+--------------------+
| Temperature(Cool) | 0.25 | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Hot) | 0.75 | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Mild) | 0.0 | 0.3333333333333333 |
+-------------------+----------+--------------------+
#-----------------------------#
Ya que tenemos las CPDs, entonces podemos proceder a hacer predict sobre el conjunto de prueba test_df.
predictions = weather_model.predict(test_df.drop(columns=[target]))
predictions
| Outlook | Temperature | Humidity | Windy | Play | |
|---|---|---|---|---|---|
| 3 | Rain | Mild | High | False | Yes |
| 6 | Overcast | Cool | Normal | True | Yes |
| 7 | Sunny | Mild | High | False | Yes |
| 10 | Sunny | Mild | Normal | True | Yes |
# Mask de resultados correctos
predictions['Play'].values == test_df['Play'].values
array([ True, True, False, True])
# Accuracy
(predictions['Play'].values == test_df['Play'].values).mean()
np.float64(0.75)
OJO: Suavizado de Laplace#
# 1) ver el cero (MLE)
print(weather_model.get_cpds('Outlook')) # P(Overcast|No) = 0
+-------------------+----------+---------------------+
| Play | Play(No) | Play(Yes) |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0 | 0.5 |
+-------------------+----------+---------------------+
| Outlook(Rain) | 0.5 | 0.3333333333333333 |
+-------------------+----------+---------------------+
| Outlook(Sunny) | 0.5 | 0.16666666666666666 |
+-------------------+----------+---------------------+
# 2) probabilidades con MLE -> salen 0/1 (sobreconfianza)
weather_model.predict_probability(test_df.drop(columns=[target]))
| Play_No | Play_Yes | |
|---|---|---|
| 3 | 0.0 | 1.0 |
| 6 | 0.0 | 1.0 |
| 7 | 0.0 | 1.0 |
| 10 | 0.0 | 1.0 |
# 3) re-entrenar con suavizado de Laplace (add-one)
from pgmpy.estimators import BayesianEstimator
weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
("Play","Windy"),("Play","Temperature")])
weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2") # K2 = add-1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[17], line 5
1 # 3) re-entrenar con suavizado de Laplace (add-one)
2 from pgmpy.estimators import BayesianEstimator
3 weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
4 ("Play","Windy"),("Play","Temperature")])
----> 5 weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2") # K2 = add-1
TypeError: DiscreteBayesianNetwork.fit() got an unexpected keyword argument 'prior_type'
print(weather_smooth.get_cpds('Outlook')) # ya SIN ceros
+-------------------+---------------------+--------------------+
| Play | Play(No) | Play(Yes) |
+-------------------+---------------------+--------------------+
| Outlook(Overcast) | 0.14285714285714285 | 0.4444444444444444 |
+-------------------+---------------------+--------------------+
| Outlook(Rain) | 0.42857142857142855 | 0.3333333333333333 |
+-------------------+---------------------+--------------------+
| Outlook(Sunny) | 0.42857142857142855 | 0.2222222222222222 |
+-------------------+---------------------+--------------------+
weather_smooth.predict_probability(test_df.drop(columns=[target])) # probabilidades suaves
| Play_No | Play_Yes | |
|---|---|---|
| 3 | 0.237213 | 0.762787 |
| 6 | 0.218679 | 0.781321 |
| 7 | 0.318091 | 0.681909 |
| 10 | 0.456418 | 0.543582 |
# 4) accuracy: igual (0.75) -> lo que cambia son las PROBABILIDADES
pred_s = weather_smooth.predict(test_df.drop(columns=[target]))
(pred_s['Play'].values == test_df['Play'].values).mean()
np.float64(0.75)
5.1 Ejemplo con clase NaiveBayes#
Aquí dejo otro ejemplo de cómo hacerlo usando la clase NaiveBayes y luego construyendo un DiscreteBayesianNetwork equivalente.
from sklearn.model_selection import train_test_split
from pgmpy.models import NaiveBayes, DiscreteBayesianNetwork
from pgmpy.inference import VariableElimination
target = "Play"
train, test = train_test_split(df, test_size=0.3, random_state=42)
Aquí primero instanciamos el modelo NaiveBayes y lo entrenamos con el conjunto de entrenamiento
# entrenar modelo NB
naive_bayes = NaiveBayes()
naive_bayes.fit(train, parent_node=target)
INFO:pgmpy: Datatype (N=numerical, C=Categorical Unordered, O=Categorical Ordered) inferred from data:
{'Outlook': 'C', 'Temperature': 'C', 'Humidity': 'C', 'Windy': 'N', 'Play': 'C'}
# construir modelo DiscreteBayesianNetwork
bayesian_network = DiscreteBayesianNetwork(
[(target, f) for f in df.columns if f != target]
)
# añadir CPDs aprendidas desde model.cpd
for cpd in naive_bayes.cpds:
bayesian_network.add_cpds(cpd)
bayesian_network.check_model()
True
# Instanciamos el motor de inferencia
infer = VariableElimination(bayesian_network)
# Predicción (MAP "a mano" // equivalente a predict)
y_true = []
y_pred = []
for _, row in test.iterrows():
evidence = row.drop(target).to_dict()
q = infer.map_query(variables=[target], evidence=evidence, show_progress=False)
y_true.append(row[target])
y_pred.append(q[target])
# accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_true, y_pred))
Accuracy: 0.8
6. Conclusiones#
Naive Bayes es simple, pero muy efectivo.
Su poder proviene del supuesto de independencia condicional.
Entrenar el modelo equivale a contar frecuencias.
Clasificar consiste en evaluar la probabilidad posterior para cada clase y elegir la mayor.
Es escalable, rápido y funciona muy bien en problemas con muchos atributos y pocos datos.