Sesión 11

Contenido

Sesión 11#

Naive Bayes#

Objetivo: Comprender el modelo de Naive Bayes como un buen punto de partida para benchmark.

import pandas as pd
from pgmpy.models import DiscreteBayesianNetwork
import os
import warnings
warnings.filterwarnings("ignore")

ruta = os.path.join('..', 'data', 'weather.csv')
df = pd.read_csv(ruta)

for col in df.select_dtypes(include=['object', 'string']).columns:
    df[col] = df[col].astype('category')

df.head()

	Outlook	Temperature	Humidity	Windy	Play
0	Sunny	Hot	High	False	No
1	Sunny	Hot	High	True	No
2	Overcast	Hot	High	False	Yes
3	Rain	Mild	High	False	Yes
4	Rain	Cool	Normal	False	Yes

df.Play.value_counts()

Play
Yes    9
No     5
Name: count, dtype: int64

target = "Play"
features = df.columns.tolist()
features.remove(target)

print("Features:", features)
print("Target:", target)

Features: ['Outlook', 'Temperature', 'Humidity', 'Windy']
Target: Play

En el caso de Naive Bayes podemos seguir utilizando la clase DiscreteBayesianNetwork de pgmpy para definir el modelo, entrenarlo y hacer predicciones.

Nota las independencias… en este caso, las variables observadas son independientes entre sí dado el nodo padre (la variable objetivo).

En el código esto significa que el nodo padre (la $y$) siempre viene acompañada de todas las demás variables como nodos hijos directos.

weather_model = DiscreteBayesianNetwork([
    ("Play", "Outlook"),
    ("Play", "Humidity"),
    ("Play", "Windy"),
    ("Play", "Temperature")
])

# Train | test
train_df = df.sample(frac=0.7, random_state=42)
test_df = df.drop(train_df.index)
train_df.shape, test_df.shape

((10, 5), (4, 5))

Con el método fit entrenamos el modelo Naive Bayes.
Durante este proceso, pgmpy estima las distribuciones de probabilidad necesarias para hacer predicciones:

La probabilidad previa de la clase $$P(C)$$
Las probabilidades condicionales de cada atributo $$P(X_i \mid C)$$

Todas estas cantidades se calculan mediante Maximum Likelihood Estimation (MLE), es decir, contando frecuencias en los datos.

weather_model.fit(train_df)

<pgmpy.models.DiscreteBayesianNetwork.DiscreteBayesianNetwork at 0x7e0d58e156a0>

print(weather_model.get_cpds(target))

+-----------+-----+
| Play(No)  | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+

for cpd in weather_model.get_cpds():
    print(cpd)
    print("#-----------------------------#")

+-----------+-----+
| Play(No)  | 0.4 |
+-----------+-----+
| Play(Yes) | 0.6 |
+-----------+-----+
#-----------------------------#
+-------------------+----------+---------------------+
| Play              | Play(No) | Play(Yes)           |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0      | 0.5                 |
+-------------------+----------+---------------------+
| Outlook(Rain)     | 0.5      | 0.3333333333333333  |
+-------------------+----------+---------------------+
| Outlook(Sunny)    | 0.5      | 0.16666666666666666 |
+-------------------+----------+---------------------+
#-----------------------------#
+------------------+----------+--------------------+
| Play             | Play(No) | Play(Yes)          |
+------------------+----------+--------------------+
| Humidity(High)   | 0.75     | 0.3333333333333333 |
+------------------+----------+--------------------+
| Humidity(Normal) | 0.25     | 0.6666666666666666 |
+------------------+----------+--------------------+
#-----------------------------#
+--------------+----------+---------------------+
| Play         | Play(No) | Play(Yes)           |
+--------------+----------+---------------------+
| Windy(False) | 0.25     | 0.8333333333333334  |
+--------------+----------+---------------------+
| Windy(True)  | 0.75     | 0.16666666666666666 |
+--------------+----------+---------------------+
#-----------------------------#
+-------------------+----------+--------------------+
| Play              | Play(No) | Play(Yes)          |
+-------------------+----------+--------------------+
| Temperature(Cool) | 0.25     | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Hot)  | 0.75     | 0.3333333333333333 |
+-------------------+----------+--------------------+
| Temperature(Mild) | 0.0      | 0.3333333333333333 |
+-------------------+----------+--------------------+
#-----------------------------#

Ya que tenemos las CPDs, entonces podemos proceder a hacer predict sobre el conjunto de prueba test_df.

predictions = weather_model.predict(test_df.drop(columns=[target]))

predictions

	Outlook	Temperature	Humidity	Windy	Play
3	Rain	Mild	High	False	Yes
6	Overcast	Cool	Normal	True	Yes
7	Sunny	Mild	High	False	Yes
10	Sunny	Mild	Normal	True	Yes

# Mask de resultados correctos
predictions['Play'].values == test_df['Play'].values

array([ True,  True, False,  True])

# Accuracy
(predictions['Play'].values == test_df['Play'].values).mean()

np.float64(0.75)

OJO: Suavizado de Laplace#

# 1) ver el cero (MLE)
print(weather_model.get_cpds('Outlook')) # P(Overcast|No) = 0

+-------------------+----------+---------------------+
| Play              | Play(No) | Play(Yes)           |
+-------------------+----------+---------------------+
| Outlook(Overcast) | 0.0      | 0.5                 |
+-------------------+----------+---------------------+
| Outlook(Rain)     | 0.5      | 0.3333333333333333  |
+-------------------+----------+---------------------+
| Outlook(Sunny)    | 0.5      | 0.16666666666666666 |
+-------------------+----------+---------------------+

# 2) probabilidades con MLE -> salen 0/1 (sobreconfianza)
weather_model.predict_probability(test_df.drop(columns=[target]))

	Play_No	Play_Yes
3	0.0	1.0
6	0.0	1.0
7	0.0	1.0
10	0.0	1.0

# 3) re-entrenar con suavizado de Laplace (add-one)
from pgmpy.estimators import BayesianEstimator
weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
                                          ("Play","Windy"),("Play","Temperature")])
weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2")  # K2 = add-1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 5
      1 # 3) re-entrenar con suavizado de Laplace (add-one)
      2 from pgmpy.estimators import BayesianEstimator
      3 weather_smooth = DiscreteBayesianNetwork([("Play","Outlook"),("Play","Humidity"),
      4                                           ("Play","Windy"),("Play","Temperature")])
----> 5 weather_smooth.fit(train_df, estimator=BayesianEstimator, prior_type="K2")  # K2 = add-1

TypeError: DiscreteBayesianNetwork.fit() got an unexpected keyword argument 'prior_type'

print(weather_smooth.get_cpds('Outlook'))  # ya SIN ceros

+-------------------+---------------------+--------------------+
| Play              | Play(No)            | Play(Yes)          |
+-------------------+---------------------+--------------------+
| Outlook(Overcast) | 0.14285714285714285 | 0.4444444444444444 |
+-------------------+---------------------+--------------------+
| Outlook(Rain)     | 0.42857142857142855 | 0.3333333333333333 |
+-------------------+---------------------+--------------------+
| Outlook(Sunny)    | 0.42857142857142855 | 0.2222222222222222 |
+-------------------+---------------------+--------------------+

weather_smooth.predict_probability(test_df.drop(columns=[target]))  # probabilidades suaves

	Play_No	Play_Yes
3	0.237213	0.762787
6	0.218679	0.781321
7	0.318091	0.681909
10	0.456418	0.543582

# 4) accuracy: igual (0.75) -> lo que cambia son las PROBABILIDADES
pred_s = weather_smooth.predict(test_df.drop(columns=[target]))
(pred_s['Play'].values == test_df['Play'].values).mean()

np.float64(0.75)

5.1 Ejemplo con clase `NaiveBayes`#

Aquí dejo otro ejemplo de cómo hacerlo usando la clase NaiveBayes y luego construyendo un DiscreteBayesianNetwork equivalente.

from sklearn.model_selection import train_test_split
from pgmpy.models import NaiveBayes, DiscreteBayesianNetwork
from pgmpy.inference import VariableElimination

target = "Play"
train, test = train_test_split(df, test_size=0.3, random_state=42)

Aquí primero instanciamos el modelo NaiveBayes y lo entrenamos con el conjunto de entrenamiento

# entrenar modelo NB
naive_bayes = NaiveBayes()
naive_bayes.fit(train, parent_node=target)

INFO:pgmpy: Datatype (N=numerical, C=Categorical Unordered, O=Categorical Ordered) inferred from data: 
 {'Outlook': 'C', 'Temperature': 'C', 'Humidity': 'C', 'Windy': 'N', 'Play': 'C'}

# construir modelo DiscreteBayesianNetwork 
bayesian_network = DiscreteBayesianNetwork(
    [(target, f) for f in df.columns if f != target]
)

# añadir CPDs aprendidas desde model.cpd
for cpd in naive_bayes.cpds:
    bayesian_network.add_cpds(cpd)

bayesian_network.check_model()

True

# Instanciamos el motor de inferencia
infer = VariableElimination(bayesian_network)

# Predicción (MAP "a mano" // equivalente a predict)
y_true = []
y_pred = []

for _, row in test.iterrows():
    evidence = row.drop(target).to_dict()
    q = infer.map_query(variables=[target], evidence=evidence, show_progress=False)
    y_true.append(row[target])
    y_pred.append(q[target])

# accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_true, y_pred))

Accuracy: 0.8

6. Conclusiones#

Naive Bayes es simple, pero muy efectivo.
Su poder proviene del supuesto de independencia condicional.
Entrenar el modelo equivale a contar frecuencias.
Clasificar consiste en evaluar la probabilidad posterior para cada clase y elegir la mayor.
Es escalable, rápido y funciona muy bien en problemas con muchos atributos y pocos datos.