텍스트 벡터화

벡터화 (=임베딩, 텐서가 되는 단계)

단어 수준 (RNN 구조)

단어 수준의 고질적인 단점은 동음이의어를 구분할 수 없다는 것이다.
ex) [바람], [분다] <- 앞 단어를 구별 못함 -> [바람], [핀다]

원핫 인코딩

James -> [1, 0, 0, 0, 0, 0, 0] 텐서는 입력받은 모든 단어의 나열로 만들어짐
is -> [0, 1, 0, 0, 0, 0, 0]
working -> ....
at ->
disney ->
in ->
london ->
단점:

너무 피처가 크고, 심하게 sparse (대부분 0)
단어 간 복잡한 관계 정보 반영 X: 모두 독립적으로 정의강 종류

Word2Vec, Glove, Fasttext

토큰 리스트 학습 -> 단어간 관계, 의미 정보 반영한 벡터 생성
James -> [ 0.5, 0.1, 0,1 ]
is -> [ 0.1, 0.9, 0.1 ]
...
장점

피처 크기 조절 가능: 적은 피처 크기(토큰 개수만큼 x)
0과 1이 아니라, 실수 형태의 피처값
단어 간의 관계 유사도 계산 가능

문장 수준

BOW (bag of words) 단어 묶음

전처리 과정에서 만들어 진다.
토큰의 빈도수가 문장의 의미다.
James | is | working | at | Disney | in | London
James | Chris | is | going | to | working | at | Disney | in | London

1, 0, 1, 0, 0, 0, 1, 1, 1, 1 (이 tensor 구조가 바로 BOW)

Chris | is | going | to | working | in | London
James | Chris | is | going | to | working | at | Disney | in | London

0, 1, 1, 1, 1, 1, 0, 0, 1, 1

단점

전체 토큰에 비해 한 문장의 토큰은 매우 적으므로, 피처 대부분 0
-> sparse! 정보 대비 크기가 큼. 비효율.
토큰의 순서 정보 반영 X, 토큰의 빈도 만으로는 의미 파악 왜곡 우려

TF-IDF (term freq - inverse doc freq) 단어 빈도수 * 역문서 빈도수

BOW에 가중치 부여: 토큰의 빈도수 * 역문서 빈도수 -> 의미 왜곡 해소
문서 빈도수: 여러 문서의 빈도수를 참조하여, 빈도수에 가중치를 곱한다.

EX) 다른 문서의 빈도수(n)를 역수(1/n)로 만들어 빈도수에 곱한다.

LLM 기반 임베딩

피처 선정시, 모델 활용(대규모 텍스트로 학습된 모델)
해결점: 피처 간 관계, 중요도 등 복잡한 의미 정보를 반영한 임베딩 가능

벡터화 실습: 단어 수준

원핫 인코딩 함수화

def one_hot_encoder(word_ls):

# {단어 : index} 이 단어를 어느 자리에 배정할 것인가?

# 단어가 들어오면 현재 dic의 길이를 값이 자리 위치 값이 된다.

word2id_dics = defaultdict(lambda : len(word2id_dics))

# 단어가 들어올 때마다 index를 배정한다.

# 행렬을 구상하여, 초기값 배치 (단어수, 단어종류)

for word in word_ls:

word2id_dics[word]

one_hot_vectors = np.zeros((len(word_ls), len(word2id_dics)))

# 원하는 자리만 업데이트

for idx, word in enumerate(word_ls):

value_idx = word2id_dics[word]

one_hot_vectors[idx, value_idx] = 1

return one_hot_vectors

Word2Vec

단어 간의 유사도 계산?

강아지 - 고양이 - 책상
강아지 - 고양이 (상대적 유사) - 책상 (상대적 차이)

방식

단어를 벡터화 (의미가 잘 반영된)
벡터끼리 유사도 계산 (코사인 유사도)

두 벡터의 같은 위치끼리 곱하고 각 항을 모두 더함.

예제

강아지: [ 0.007, 0.003, -0.017],
고양이: [ 0.006, 0.008, -0.02],
책상: [-0.003, -0.014, 0.003]
강-고 유사도

0.007*0.006 + 0.003*0.008 + -0.017*-0.02 = 0.000406

# 1. 단어 목록 생성하기, 중복 값은 제거, set type 활용.

words = list(set([word for sentence in sentences for word in sentence]))

# - word 2 idx dict

word_to_idx = {word: i for i, word in enumerate(words)}

# - idx 2 word dict

idx_to_word = {i: word for i, word in enumerate(words)}

# 2. 임베딩 행렬 초기화 (실제로는 학습으로 얻어지는 값)

# 실제 구현에서는 이 부분이 학습됩니다

vocab_size = len(words)

embedding_size = 5 # 설정할 수 있다.

word_vectors = np.random.randn(vocab_size, embedding_size) * 0.01 # 값 예시

# 3. 코사인 유사도 계산 함수

def cosine_similarity(vec1, vec2):

dot_product = np.dot(vec1, vec2)

norm1 = np.linalg.norm(vec1) # 크기: 대각선 길이

norm2 = np.linalg.norm(vec2) # 크기

return dot_product / (norm1 * norm2)

# 4. 두 단어 간 유사도 계산하는 함수 만들기

# input: 단어 2개 입력

# word_to_idx 딕셔너리 사용, word_vectors 위치 확인 및 값 도출

# 두 단어간 유사도 함수 cosine_similarity() 사용

def word_similarity(word1, word2):

if word1 in word_to_idx and word2 in word_to_idx:

vec1 = word_vectors[word_to_idx[word1]]

vec2 = word_vectors[word_to_idx[word2]]

return cosine_similarity(vec1, vec2)

return None

print(word_similarity('강아지를','고양이를'))

# 5. 가장 유사한 단어 찾기

def most_similar(word, top_n=3):
    if word not in word_to_idx:
        return []

    word_vec = word_vectors[word_to_idx[word]]
    print(word, word_vec)
    similarities = []

    for w in words:
        if w != word:
            w_vec = word_vectors[word_to_idx[w]]
            sim = cosine_similarity(word_vec, w_vec)
            print(word, w ,sim)
            similarities.append((w, sim)) # (단어, 유사도)

    return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n] # (단어, 유사도)에서 유사도로 정렬

print("\n'고양이를'와 가장 유사한 단어들:")
for word, sim in most_similar("고양이를"):
    print(f"{word}: {sim:.4f}")

# 6. 임베딩 어휘사전으로 실제 벡터화 과정

노이즈/불용어 처리 및 토큰화
토큰의 적정 길이(문장의 길이) 결정
시퀀스사전 만들기 - 이 단어는 인덱스를 뭐로 할 것인가?

텍스트to시퀀스 dic 및 시퀀스to텍스트 dic 만들기

문장을 시퀀스로 변환

시퀀스를 적정 길이로 패딩

적정 길이보다 긴 문장의 토큰은 잘라낸다.
적정 길이보다 짧은 문장은 패딩토큰(0)을 넣어준다.

임베딩 어휘사전 만들기 - 시퀀스를 임베딩값

학습된 임베딩모델 임배딩 매트릭스 생성

( 문장 시퀀스 ) * 임베딩 어휘사전

임베딩 완료

벡터화 실습: 문장 수준

Bow

docs = ['오늘 동물원에서 원숭이를 봤어',

'오늘 동물원에서 코끼리를 봤어 봤어',

'동물원에서 원숭이에게 바나나를 줬어 바나나를']

# 띄어쓰기 토큰화

words = [stc.split() for stc in docs]

# 각 고유 토큰에 인덱스를 지정

word2id = defaultdict(lambda : len(word2id))

for stc in words:

for w in stc:

word2id[w]

# BoW 생성

BoW_ls = []

for stc in words:

tencer = [0 for _ in range(len(word2id))]

for w in stc:

tencer[word2id[w]] += 1

BoW_ls.append(tencer)

TF-IDF

정리

N: 전체 문서 수
TF: 이 문서에 대한 토큰 빈도
DF: 전체 문서에 대한 토큰 빈도
IDF: DF의 역수 = Log N/DF(=nt)
중요도 = TF / DF = TF * IDF

# 4개의 문서들이 있다.

docs = [

'I like a red apple',

'the color of the banana is yellow',

'long and yellow sweet banana',

'I like fruits. especially apple and banana'

]

# 단어사전 만들기

# ['I', 'a', 'and', 'apple', 'banana', 'color', 'especially', 'fruits.',

# 'is', 'like', 'long', 'of', 'red', 'sweet', 'the', 'yellow']

vocab = list({word for sentence in docs for word in sentence.split()})

vocab.sort()

def tf(t, d): #  특정 문서(d)에서 단어(t)의 등장 횟수를 계산
    words_set = d.split(' ') # 문서를 단어 단위로 나눔
    return words_set.count(t) # 단어 t의 빈도 반환

def idf(t): # 단어(t)가 전체 문서에서 얼마나 흔하게 등장하는지
    df = 0 # 단어 t가 등장한 문서의 개수
    for doc in docs:
        df += t in doc # 단어 t가 문서에 등장하면 1 증가
    return log(N/(df + 1)) # IDF 계산 공식

def tfidf(t, d): #중요도
    return tf(t,d)* idf(t) # TF와 IDF의 곱

# TF table 구하기

result = []

for idx1, sentence in enumerate(docs):

freq = []

for voca in vocab:

freq.append(tf(voca, sentence))

result.append(freq)

display(pd.DataFrame(result, columns=vocab))

# IDF 테이블 구하기

IDF = [idf(voca) for voca in vocab]

display(pd.DataFrame(IDF, index=vocab, columns=["IDF"]))

# TF-IDF 테이블 구하기

result = []

for idx1, sentence in enumerate(docs):

freq = []

for voca in vocab:

freq.append(tfidf(voca, sentence))

result.append(freq)

display(pd.DataFrame(result, columns=vocab))

벡터화 라이브러리 실습

단어: One-Hot Encoder

단어 토큰

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.preprocessing import OneHotEncoder

# 예제 데이터 배열

word_ls = ['원숭이','바나나','사과','사과']

values = np.array(word_ls)

print(values)

# 원핫인코더 객체 생성하기

onehot_enc = OneHotEncoder(sparse_output=False)

# 배열 형변환하기

temp = values.reshape(len(values), 1) # n:1 matrix로 변환

print("\n리셰이프된 배열:")

print(temp)

# 원핫인코딩 수행:

# temp가 n:1 matrix로 변환해야 함.

# 차원의 순서가 '가나다'

onehot_result = onehot_enc.fit_transform(temp)

print("\n원핫인코딩 결과:")

print(onehot_result)

원핫인코딩 결과: [[0. 0. 1.] [1. 0. 0.] [0. 1. 0.] [0. 1. 0.]]

문장

# 실습에 사용할 두 개의 문장

sentences = [

"James is working at Disney in London",

"Emily was studying at Oxford in England"

]

print("원본 문장:")

for i, sentence in enumerate(sentences):

print(f"{i+1}. {sentence}")

# CountVectorizer로 모든 단어 추출

vectorizer = CountVectorizer(binary=True)

vectorizer.fit(sentences) # 두 문장으로부터 어휘 학습

vocabulary = vectorizer.get_feature_names_out()

print("\n모든 단어를 포함한 어휘 목록:")

print(vocabulary)

# 모든 단어를 2D 배열로 변환 (OneHotEncoder 요구사항)

words_array = np.array(vocabulary).reshape(-1, 1)

print(words_array)

# OneHotEncoder를 사용하여 단어 자체를 원핫인코딩

encoder = OneHotEncoder(sparse_output=False)

encoder.fit(words_array)

encoder

# 이제부터 단어만 [[단어]] 형식으로 주입하면 된다.

# encoded = encoder.transform([[단어]])

# @title 문제
# 각 문장의 단어 별로 원핫인코딩 출력하기
oh_sentences = []

for sentence in sentences:
    oh_sentence = []

    words = sentence.lower().split()
    for word in words:
        word_array = np.array([[word]])
        # print(word_array)
        encoded = encoder.transform(word_array)
        # print(encoded[0])
        stc = list(encoded[0].astype(int))
        oh_sentence.append(stc)

    print(sentence)
    print(oh_sentence)
    print()
    oh_sentences.append(oh_sentence)

oh_sentences

단어: 그밖에

1) Word2Vec

개발자: Tomas Mikolov와 Google 연구팀
개발 시기: 2013년
발표 논문: "Efficient Estimation of Word Representations in Vector Space" (ICLR 2013)
신경망 기반 단어 임베딩 모델
지역적 문맥에 중점 (주변 단어를 통한 학습)
CBOW와 Skip-gram 두 가지 알고리즘 제공
CBOW - 처음과 끝을 전달하고 가운데 단어를 맞추게 한다. 이때 가운데 단어의 위치가 가운데 단어의 벡터가 된다.
온라인 학습 방식으로 대규모 코퍼스 처리 가능

2) GloVe

개발자: Jeffrey Pennington, Richard Socher, Christopher Manning (스탠포드 대학교 연구팀)
개발 시기: 2014년
발표 논문: "GloVe: Global Vectors for Word Representation" (EMNLP 2014)
행렬 분해와 지역적 문맥 정보를 결합한 하이브리드 모델
전역적 동시 출현 통계와 지역적 문맥 모두 활용
단어 간 동시 출현 행렬을 명시적으로 분석
벡터 간 관계가 의미적 유사성과 더 일관성 있게 연결됨

3) Fasttext

개발자: Facebook AI Research(FAIR) 팀
개발 시기: 2016년
발표 논문:
"Enriching Word Vectors with Subword Information" (TACL 2017)
"Bag of Tricks for Efficient Text Classification" (EACL 2017)
단어 전체뿐만 아니라 서브워드(부분 단어) 정보 활용:
각 단어를 n-gram 문자 조합으로 분해 예: "apple"이라는 단어는 "<ap", "app", "ppl", "ple", "le>" 등의 서브워드로 분해 최종 단어 벡터는 이러한 서브워드 벡터들의 합으로 표현
OOV(미등록 단어) 처리
훈련 데이터에 없던 새로운 단어도 서브워드 분해를 통해 벡터 생성 가능 오타나 형태소 변형에 강건한 표현 가능 복합어, 접두사/접미사가 있는 단어에 효과적

문서: BOW

# 시각화하기

from IPython import display as ICD

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# 객체생성 - 문장을 넣으면 토큰화 한다.

my_Token = lambda x : x.split()

count_vect = CountVectorizer(tokenizer = my_Token)

# 단어사전 만들기

count_vect.fit(docs)

vocab = count_vect.get_feature_names_out()

# print(vocab)

# 벡터화 하기

BoW = count_vect.fit_transform(docs)

# print(BoW.toarray())

# 출력

for i in range(len(docs)) :

print("문서{} : {}".format(i, docs[i]))

ICD.display(pd.DataFrame([BoW.toarray()[i]], columns=vocab))

print('\n\n')

문서: TF-IDF

docs = ['오늘 동물원에서 원숭이를 봤어',

'오늘 동물원에서 코끼리를 봤어 봤어',

'동물원에서 원숭이에게 바나나를 줬어 바나나를']

# 임베딩화 (문장 -> 토큰 -> 숫자)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

doc_tfidf = tfidf.fit_transform(docs)

doc_tfidf.toarray()

#시각화