Sesión 11#

Naive Bayes#

Objetivo: Comprender el modelo de Naive Bayes como un buen punto de partida para benchmark.

import pandas as pd
from pgmpy.models import DiscreteBayesianNetwork
import os
import warnings
warnings.filterwarnings("ignore")
ruta = os.path.join('..', 'data', 'weather.csv')
df = pd.read_csv(ruta)

for col in df.select_dtypes(include=['object', 'string']).columns:
    df[col] = df[col].astype('category')
df.head()
Outlook Temperature Humidity Windy Play
0 Sunny Hot High False No
1 Sunny Hot High True No
2 Overcast Hot High False Yes
3 Rain Mild High False Yes
4 Rain Cool Normal False Yes
df.Play.value_counts()
Play
Yes    9
No     5
Name: count, dtype: int64
target = "Play"
features = df.columns.tolist()
features.remove(target)

print("Features:", features)
print("Target:", target)
Features: ['Outlook', 'Temperature', 'Humidity', 'Windy']
Target: Play

En el caso de Naive Bayes podemos seguir utilizando la clase DiscreteBayesianNetwork de pgmpy para definir el modelo, entrenarlo y hacer predicciones.

Nota las independencias… en este caso, las variables observadas son independientes entre sí dado el nodo padre (la variable objetivo).

En el código esto significa que el nodo padre (la \(y\)) siempre viene acompañada de todas las demás variables como nodos hijos directos.

weather_model = DiscreteBayesianNetwork([
    ("Play", "Outlook"),
    ("Play", "Humidity"),
    ("Play", "Windy"),
    ("Play", "Temperature")
])
# Train | test
train_df = df.sample(frac=0.7, random_state=42)
test_df = df.drop(train_df.index)
train_df.shape, test_df.shape
((10, 5), (4, 5))

Con el método fit entrenamos el modelo Naive Bayes.
Durante este proceso, pgmpy estima las distribuciones de probabilidad necesarias para hacer predicciones:

  • La probabilidad previa de la clase $\(P(C)\)$

  • Las probabilidades condicionales de cada atributo $\(P(X_i \mid C)\)$

Todas estas cantidades se calculan mediante Maximum Likelihood Estimation (MLE), es decir, contando frecuencias en los datos.

weather_model.fit(train_df)
<pgmpy.models.DiscreteBayesianNetwork.DiscreteBayesianNetwork at 0x7f35ac9e9550>
print(weather_model.get_cpds(target))
+-----------+-----+
| Play(No)  | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+
for cpd in weather_model.get_cpds():
    print(cpd)
    print("#-----------------------------#")
+-----------+-----+
| Play(No)  | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+
#-----------------------------#
+-------------------+----------+---------------------+
| Play              | Play(No) | Play(Yes)           |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0      | 0.5                 |
+-------------------+----------+---------------------+
| Outlook(Rain)     | 0.5      | 0.3333333333333333  |
+-------------------+----------+---------------------+
| Outlook(Sunny)    | 0.5      | 0.16666666666666666 |
+-------------------+----------+---------------------+
#-----------------------------#
+------------------+----------+--------------------+
| Play             | Play(No) | Play(Yes)          |
+------------------+----------+--------------------+
| Humidity(High)   | 0.75     | 0.3333333333333333 |
+------------------+----------+--------------------+
| Humidity(Normal) | 0.25     | 0.6666666666666666 |
+------------------+----------+--------------------+
#-----------------------------#
+--------------+----------+---------------------+
| Play         | Play(No) | Play(Yes)           |
+--------------+----------+---------------------+
| Windy(False) | 0.25     | 0.8333333333333334  |
+--------------+----------+---------------------+
| Windy(True)  | 0.75     | 0.16666666666666666 |
+--------------+----------+---------------------+
#-----------------------------#
+-------------------+----------+--------------------+
| Play              | Play(No) | Play(Yes)          |
+-------------------+----------+--------------------+
| Temperature(Cool) | 0.25     | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Hot)  | 0.75     | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Mild) | 0.0      | 0.3333333333333333 |
+-------------------+----------+--------------------+
#-----------------------------#

Ya que tenemos las CPDs, entonces podemos proceder a hacer predict sobre el conjunto de prueba test_df.

predictions = weather_model.predict(test_df.drop(columns=[target]))
predictions
Outlook Temperature Humidity Windy Play
3 Rain Mild High False Yes
6 Overcast Cool Normal True Yes
7 Sunny Mild High False Yes
10 Sunny Mild Normal True Yes
# Mask de resultados correctos
predictions['Play'].values == test_df['Play'].values
array([ True,  True, False,  True])
# Accuracy
(predictions['Play'].values == test_df['Play'].values).mean()
np.float64(0.75)

OJO: Suavizado de Laplace#

# 1) ver el cero (MLE)
print(weather_model.get_cpds('Outlook')) # P(Overcast|No) = 0
+-------------------+----------+---------------------+
| Play              | Play(No) | Play(Yes)           |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0      | 0.5                 |
+-------------------+----------+---------------------+
| Outlook(Rain)     | 0.5      | 0.3333333333333333  |
+-------------------+----------+---------------------+
| Outlook(Sunny)    | 0.5      | 0.16666666666666666 |
+-------------------+----------+---------------------+
# 2) probabilidades con MLE -> salen 0/1 (sobreconfianza)
weather_model.predict_probability(test_df.drop(columns=[target]))
Play_No Play_Yes
3 0.0 1.0
6 0.0 1.0
7 0.0 1.0
10 0.0 1.0
# 3) re-entrenar con suavizado de Laplace (add-one)
from pgmpy.estimators import BayesianEstimator
weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
                                          ("Play","Windy"),("Play","Temperature")])
weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2")  # K2 = add-1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 5
      1 # 3) re-entrenar con suavizado de Laplace (add-one)
      2 from pgmpy.estimators import BayesianEstimator
      3 weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
      4                                           ("Play","Windy"),("Play","Temperature")])
----> 5 weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2")  # K2 = add-1

TypeError: DiscreteBayesianNetwork.fit() got an unexpected keyword argument 'prior_type'
print(weather_smooth.get_cpds('Outlook'))  # ya SIN ceros
+-------------------+---------------------+--------------------+
| Play              | Play(No)            | Play(Yes)          |
+-------------------+---------------------+--------------------+
| Outlook(Overcast) | 0.14285714285714285 | 0.4444444444444444 |
+-------------------+---------------------+--------------------+
| Outlook(Rain)     | 0.42857142857142855 | 0.3333333333333333 |
+-------------------+---------------------+--------------------+
| Outlook(Sunny)    | 0.42857142857142855 | 0.2222222222222222 |
+-------------------+---------------------+--------------------+
weather_smooth.predict_probability(test_df.drop(columns=[target]))  # probabilidades suaves
Play_No Play_Yes
3 0.237213 0.762787
6 0.218679 0.781321
7 0.318091 0.681909
10 0.456418 0.543582
# 4) accuracy: igual (0.75) -> lo que cambia son las PROBABILIDADES
pred_s = weather_smooth.predict(test_df.drop(columns=[target]))
(pred_s['Play'].values == test_df['Play'].values).mean()
np.float64(0.75)

5.1 Ejemplo con clase NaiveBayes#

Aquí dejo otro ejemplo de cómo hacerlo usando la clase NaiveBayes y luego construyendo un DiscreteBayesianNetwork equivalente.

from sklearn.model_selection import train_test_split
from pgmpy.models import NaiveBayes, DiscreteBayesianNetwork
from pgmpy.inference import VariableElimination
target = "Play"
train, test = train_test_split(df, test_size=0.3, random_state=42)

Aquí primero instanciamos el modelo NaiveBayes y lo entrenamos con el conjunto de entrenamiento

# entrenar modelo NB
naive_bayes = NaiveBayes()
naive_bayes.fit(train, parent_node=target)
INFO:pgmpy: Datatype (N=numerical, C=Categorical Unordered, O=Categorical Ordered) inferred from data: 
 {'Outlook': 'C', 'Temperature': 'C', 'Humidity': 'C', 'Windy': 'N', 'Play': 'C'}
# construir modelo DiscreteBayesianNetwork 
bayesian_network = DiscreteBayesianNetwork(
    [(target, f) for f in df.columns if f != target]
)
# añadir CPDs aprendidas desde model.cpd
for cpd in naive_bayes.cpds:
    bayesian_network.add_cpds(cpd)

bayesian_network.check_model()
True
# Instanciamos el motor de inferencia
infer = VariableElimination(bayesian_network)
# Predicción (MAP "a mano" // equivalente a predict)
y_true = []
y_pred = []

for _, row in test.iterrows():
    evidence = row.drop(target).to_dict()
    q = infer.map_query(variables=[target], evidence=evidence, show_progress=False)
    y_true.append(row[target])
    y_pred.append(q[target])
# accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_true, y_pred))
Accuracy: 0.8

6. Conclusiones#

  • Naive Bayes es simple, pero muy efectivo.

  • Su poder proviene del supuesto de independencia condicional.

  • Entrenar el modelo equivale a contar frecuencias.

  • Clasificar consiste en evaluar la probabilidad posterior para cada clase y elegir la mayor.

  • Es escalable, rápido y funciona muy bien en problemas con muchos atributos y pocos datos.