14. 트랜스포머 구현(한국어 감성 분류 모델)

Notice

Recent Posts

Recent Comments

Link

깃허브

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Tags more

Archives

Today

Total

관리 메뉴

수달이네 기술 블로그

14. 트랜스포머 구현(한국어 감성 분류 모델) 본문

AI공부/자연어처리

14. 트랜스포머 구현(한국어 감성 분류 모델)

슬픈 수달이 2026. 3. 7. 15:24

구현(한국어 감성 분류 모델)

데이터 전처리

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, LayerNormalization, Dropout, Add, Input
from tensorflow.keras.optimizers import Adam
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# CSV 파일 로드
dataframe = pd.read_csv('./dataset/sentiment_data.csv')

# 데이터와 라벨 추출
sentences = dataframe['sentence'].astype(str).tolist()
labels = dataframe['label'].tolist()
# 이진 분류에서 출력이 sigmoid 값이고 loss가 binary_crossentropy이기 때문에
# float32로 변환
y = np.array(labels, dtype=np.float32)

감정을 담은 여러 영어 문장 데이터셋, 라벨 상 1 = 긍정적, 0 = 부정적 감정 문장을 나타낸다.

/	sentence	label
0	I love this product	1
1	Absolutely fantastic!	1
2	Would definitely recommend	1
3	Great value for money	1
4	I will buy this again	1

이진분류 라벨을 float32로 변환하여 sigmoid출력과 BCE손실 함수에 맞춰준다.

embedding_dim = 128
max_len = 10

tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>')
# 단어 사전을 만듦
tokenizer.fit_on_texts(sentences)
# 문장을 단어 ID 배열로 변환
sequences = tokenizer.texts_to_sequences(sentences)
word_index = tokenizer.word_index
# padding(0번) 예약 인덱스
vocab_size = len(word_index) + 1

TensorFlow의 KerasAPI에서 제공하는 tokenizer클래스를 이용하여 토큰화한다.
- tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>') → 토크나이저 정의 만약 단어사전에 없는 단어면 <OOV>로 매핑
- tokenizer.fit_on_texts(sentences) → 문장을 훑으면서 인덱싱
- sequences = tokenizer.texts_to_sequences(sentences) → 인덱스대로 문장을 정수 배열로 변환
- word_index = tokenizer.word_index → 사전 생성

# data.shape: (문장개수, 10)
data = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')

data = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post') → 패딩 처리: max_len(10)으로 문장 길이 고정, 짧으면 패딩을 채워넣음.

# stratify: 라벨 비율을 학습/검증에 동일하게 유지
X_train, X_val, y_train, y_val = train_test_split(data, labels, test_size=0.2, random_state=2026, stratify=y)

검증 데이터 학습리

포지셔널 인코딩(Positional Encoding)

트랜스포머에 위치 정보 추가

# 포지셔널 인코딩
def get_positional_encoding(max_len, d_model):
    pos_enc = np.zeros((max_len, d_model), dtype=np.float32)
    for pos in range(max_len):
        for i in range(0, d_model, 2):
            # 짝수 차원은 sin, 홀수 차원은 cos을 넣는 방식(transformer 논문 방식)
            pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / d_model)))
            if i + 1 < d_model:
                pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * (i + 1) / d_model)))
    return pos_enc
    
positional_encoding = get_positional_encoding(max_len, embedding_dim)

pos_enc행렬: 문장 최대 길이, 임베딩 차원의 행렬을 생성

for pos in range(max_len):
        for i in range(0, d_model, 2):
            # 짝수 차원은 sin, 홀수 차원은 cos을 넣는 방식(transformer 논문 방식)
            pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / d_model)))
            if i + 1 < d_model:
                pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * (i + 1) / d_model)))

단어의 위치 별로 인코딩 계산(짝수 차원 → sin함수, 홀수 차원 → cos함수)

$$ PE_{(pos,2i)}=\sin \left( \frac{pos}{10000^{2i/d_{model}}}\right) ,\quad PE_{(pos,2i+1)}=\cos \left( \frac{pos}{10000^{2i/d_{model}}}\right) $$

해당 수식대로 차원에 따라 주기가 달라지도록 만든다.
[1706.03762] Attention Is All You Need 해당 논문 참고

멀티-헤드 어텐션 레이어(Multi-head Attention Layer)

# 멀티헤드 어텐션 레이어
class MultiHeadSelfAttentionalLayer(tf.keras.layers.Layer):
    def __init__(self, num_heads, key_dim, dropout_rate=0.0):
        super().__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads = num_heads,
            key_dim = key_dim,
            dropout = dropout_rate
        )
        self.norm = LayerNormalization(epsilon=1e-6)

    def call(self, x, training=None):
        attn = self.mha(query=x, value=x, key=x, training=training)
        return self.norm(x + attn)

tf.keras.layers.Layer > 해당 레이어를 상속받아 트랜스포머의 Self-Attention을 구현

init

num_heads: 멀티 헤드 어텐션에서의 헤드 개수
key_dim: 각 헤드에서 사용하는 벡터 차원
dropout: 과적합 방지를 위한 드롭아웃 비율
LayerNorm: 레이어 정규화
- 레이어 정규화는 입력 벡터의 평균과 분산으로 정규화 하는데, 분산 즉, σ의 제곱에 제곱근을 한 결과가 0이 되면 div0오류가 발생할 수 있다. 따라서 아주 작은 입실론(1e-5~1e-7)값을 넣어준다.
$$ \hat {x}=\frac{x-\mu }{\sqrt{\sigma ^2+\epsilon }} $$

call

위에서 정의한 변수로 Self-Attention을 수행한다.
Self-Attention은 같은 입력 시퀀스 내 문장들이 서로를 참조하므로 q,k,v모두 x를 넣어준다.
마지막으로 return 값에 x즉, 원래 값을 더해주어 잔차 연결을 해주어 안정성을 높인다.

 def call(self, x, padding_mask=None, training=None):
        attn_mask = None
        if padding_mask is not None:
            # MultiHeadAttention이 기대하는 형태로 확장
            # (batch, 1, seq) -> query 길이 방향으로 설정
            attn_mask = padding_mask[:, tf.newaxis, :]
        attn = self.mha(query=x, value=x, key=x, training=training, attention_mask=attn_mask)
        return self.norm(x + attn)

여기서 call함수에 padding_mask(패딩 무시)를 추가해주면 성능이 향상된다.

padding_mask를 받아 attention_mask로 전달하여 어텐션 계산시 패딩 위치를 무시하도록 설정

적용

num_heads = 8
key_dim = embedding_dim // num_heads

if embedding_dim % num_heads != 0:
    raise ValueError('embedding_dim은 num_heads로 나누어 떨어져야 합니다.')

inputs = Input(shape=(max_len,), dtype=tf.int32)
# padding_mask = tf.not_equal(inputs, 0)
padding_mask = keras.ops.not_equal(inputs, 0)

x = Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True)(inputs)
x = x + positional_encoding

x = MultiHeadSelfAttentionalLayer(num_heads=num_heads, key_dim=key_dim, dropout_rate=0.1)(
    x, padding_mask=padding_mask
)

어텐션 헤드 개수 8개, 키 차원 수 = 128 / 8
입력은 정수 길이의 시퀀스 즉, 위에서 만든 문장 → 정수 인덱스 배열이 입력이 된다.(패딩은 0)
임베딩: 단어 인덱스를 임베딩 dim크기의 벡터로 변환, 만약 0이면 마스크처리(padding mask)
멀티헤드 어텐션 적용

# Pooling도 padding mask를 반영
gap = GlobalAveragePooling1D()
x = gap(x, mask=padding_mask)

시퀀스 전체를 평균내 하나의 벡터로 압축(pooling)

x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

Dense(128 relu) 은닉층 1층에 128개의 뉴런, 비선형 활성화 함수 ReLU
출력은 이진 분류이므로 sigmoid로 비선형 활성화
입력, 출력을 연결해 모델을 완성.
- 옵티마이저 Adam
- 손실함수 BCE
- 평가지표 accuracy

model.summary()

InputLayer
- 입력: (None, 10) → 배치 크기는 가변(None), 문장 길이는 10
Embedding
- 출력: (None, 10, 128) → 각 단어가 128차원 벡터로 변환
- 파라미터: 28,672 → vocab_size × embedding_dim
Add
- 임베딩 벡터 + 포지셔널 인코딩
NotEqual
- 입력에서 0(패딩) 여부를 체크해 마스크 생성
MultiHeadSelfAttentionLayer
- 출력: (None, 10, 128)
- 파라미터: 66,304 → 어텐션 가중치 학습
GlobalAveragePooling1D
- 출력: (None, 128) → 시퀀스 전체를 평균내어 하나의 벡터로 압축
- 마스크 반영 → 패딩 제외
Dense (128, relu)
- 출력: (None, 128)
- 파라미터: 16,512 → 128 × 128 + bias
Dropout
- 출력: (None, 128) → 과적합 방지
Dense (1, sigmoid)

출력: (None, 1) → 최종 확률 값 (이진 분류)
파라미터: 129 → 128 × 1 + bias

Model: "functional"

Total params: 111,617 (436.00 KB)

Trainable params: 111,617 (436.00 KB)

Non-trainable params: 0 (0.00 B)

학습 및 테스트

history = model.fit(X_train, np.array(y_train), epochs=10, batch_size=16, validation_data=(X_val, np.array(y_val)))

학습(10에폭, 배치크기 16)

sample_texts = ["I absolutely love this!", "I can't stand this product"]
sample_sequences = tokenizer.texts_to_sequences(sample_texts)
sample_data = tf.keras.preprocessing.sequence.pad_sequences(sample_sequences, maxlen=max_len, padding='post')
predictions = model.predict(sample_data)

for i, text in enumerate(sample_texts):
    print(f"Text: {text}")
    print(f"Prediction: {'Positive' if predictions[i] > 0.5 else 'Negative'}")

# 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 129ms/step
# Text: I absolutely love this!
# Prediction: Positive
# Text: I can't stand this product
# Prediction: Negative

샘플 텍스트를 보면 잘 예측하는 것을 확인할 수 있다.

sample_texts = ["This is the best thing I’ve ever bought!", "I’m extremely happy with the results.", "The service was terrible and rude.", "I regret buying this, it’s a complete waste of money."]
sample_sequences = tokenizer.texts_to_sequences(sample_texts)
sample_data = tf.keras.preprocessing.sequence.pad_sequences(sample_sequences, maxlen=max_len, padding='post')
predictions = model.predict(sample_data)

for i, text in enumerate(sample_texts):
    print(f"Text: {text}")
    print(f"Prediction: {'Positive' if predictions[i] > 0.5 else 'Negative'}")
# 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 200ms/step
# Text: This is the best thing I’ve ever bought!
# Prediction: Negative
# Text: I’m extremely happy with the results.
# Prediction: Positive
# Text: The service was terrible and rude.
# Prediction: Negative
# Text: I regret buying this, it’s a complete waste of money.
# Prediction: Negative

그러나 완벽하지 않으므로 1번 텍스트는 Negative로 틀린 것을 확인할 수 있다.

'AI공부 > 자연어처리' 카테고리의 다른 글

1. CLIP 모델 이해 (0)	2026.03.13
15. PLM(Pre-trained Language Model) / BERT, GPT, T5 요약 (0)	2026.03.08
13. 트랜스포머(포지셔널 인코딩, 피드포워드 신경망, 잔차연결...) (0)	2026.03.06
12. Attention 연산 구현 (0)	2026.03.05
11. Attention (0)	2026.03.04

'AI공부/자연어처리' Related Articles

수달이네 기술 블로그

14. 트랜스포머 구현(한국어 감성 분류 모델) 본문

14. 트랜스포머 구현(한국어 감성 분류 모델)

구현(한국어 감성 분류 모델)

데이터 전처리

포지셔널 인코딩(Positional Encoding)

멀티-헤드 어텐션 레이어(Multi-head Attention Layer)

적용

학습 및 테스트

'AI공부 > 자연어처리' 카테고리의 다른 글

티스토리툴바