๋ฐ˜์‘ํ˜•
Notice
Recent Posts
Recent Comments
Link
ยซ   2025/05   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Tags more
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

์‚ฌ๋žŒ๊ณผ AI

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•: >TF-IDF์™€ Word2Vec ๊ฒฝํ—˜ ๊ธฐ๋ฐ˜ ์™„์ „ ์ •๋ฆฌ ๋ณธ๋ฌธ

์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•: >TF-IDF์™€ Word2Vec ๊ฒฝํ—˜ ๊ธฐ๋ฐ˜ ์™„์ „ ์ •๋ฆฌ

8353cc 2025. 4. 16. 08:52
๋ฐ˜์‘ํ˜•
TF-IDF์™€ Word2Vec ๊ฒฝํ—˜ ๊ธฐ๋ฐ˜ ์™„์ „ ์ •๋ฆฌ

๐Ÿ“ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์‹ค๋ฌด ํšŒ๊ณ ๋ก: TF-IDF์™€ Word2Vec ์ •๋ณต๊ธฐ

์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜์€ ์ œ๊ฐ€ ์‹ค์ œ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ํ”„๋กœ์ ํŠธ์—์„œ ์‚ฌ์šฉํ•œ TF-IDF์™€ Word2Vec์— ๋Œ€ํ•ด ์ •๋ฆฌํ•ด๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค. ์ฒ˜์Œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ์‹œ์ž‘ํ•  ๋•Œ ๋ง‰๋ง‰ํ–ˆ๋˜ ๊ธฐ์–ต์ด ์žˆ์–ด, ๊ฐ™์€ ๊ธธ์„ ๊ฑท๋Š” ๋ถ„๋“ค๊ป˜ ์กฐ๊ธˆ์ด๋‚˜๋งˆ ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋ผ๋ฉฐ ์ด ๊ธ€์„ ์”๋‹ˆ๋‹ค.

1. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ฒกํ„ฐํ™”๋ž€?

ํ…์ŠคํŠธ๋Š” ์ˆซ์ž๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ์ง์ ‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹จ์–ด ๋˜๋Š” ๋ฌธ์žฅ์„ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๊ธฐ๋ฒ•์ด ๋ฐ”๋กœ TF-IDF์™€ Word2Vec์ž…๋‹ˆ๋‹ค.

2. ์ข…๋ฅ˜

  • TF-IDF (Term Frequency-Inverse Document Frequency)
  • Word2Vec (๋‹จ์–ด ์ž„๋ฒ ๋”ฉ)
    • CBOW (Continuous Bag of Words)
    • Skip-gram

3. ์ข…๋ฅ˜๋ณ„ ๊ฐœ๋…๊ณผ ์›๋ฆฌ

๐ŸŸฉ TF-IDF

๋ฌธ์„œ ๋‚ด ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€๋งŒ ์ „์ฒด ๋ฌธ์„œ์—์„œ๋Š” ๋“œ๋ฌธ ๋‹จ์–ด์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

  • TF: ๋‹จ์–ด ๋นˆ๋„
  • IDF: ์—ญ๋ฌธ์„œ ๋นˆ๋„ (์ „์ฒด ๋ฌธ์„œ์—์„œ ๋“œ๋ฌธ ๋‹จ์–ด์— ๋†’์€ ์ ์ˆ˜)

๐ŸŸฆ Word2Vec

๋‹จ์–ด๋ฅผ ์ผ์ • ํฌ๊ธฐ์˜ ๋ฐ€์ง‘๋œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„ ์˜๋ฏธ ์œ ์‚ฌ์„ฑ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

  • CBOW: ์ฃผ๋ณ€ ๋‹จ์–ด๋กœ ์ค‘์‹ฌ ๋‹จ์–ด ์˜ˆ์ธก
  • Skip-gram: ์ค‘์‹ฌ ๋‹จ์–ด๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด ์˜ˆ์ธก

4. ๊ธฐ๋ณธ ์ฝ”๋“œ ์˜ˆ์‹œ

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋‹ค", "๋‚ด์ผ์€ ๋น„๊ฐ€ ์˜จ๋‹ค"]
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(texts)
print(tfidf_matrix.toarray())

Word2Vec

from gensim.models import Word2Vec

sentences = [["์˜ค๋Š˜", "๋‚ ์”จ", "์ข‹๋‹ค"], ["๋‚ด์ผ", "๋น„", "์˜จ๋‹ค"]]
model = Word2Vec(sentences, vector_size=100, window=5, sg=1, min_count=1)
print(model.wv["๋‚ ์”จ"])

5. ์‹ค๋ฌด ์ ์šฉ ๊ฒฝํ—˜๋‹ด

์ œ๊ฐ€ ์ง„ํ–‰ํ•œ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋ถ„๋ฅ˜ ํ”„๋กœ์ ํŠธ์—์„œ, ์ฒ˜์Œ์—๋Š” TF-IDF๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ฃผ์ œ ๊ฐ„ ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ๊ฒน์น  ๊ฒฝ์šฐ ๊ตฌ๋ถ„์ด ์–ด๋ ค์› ๊ณ , ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•ด Word2Vec์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.

Word2Vec ๋„์ž… ํ›„ ๋‹จ์–ด ์œ ์‚ฌ๋„๊ฐ€ ๋ฐ˜์˜๋˜๋ฉด์„œ ์ •ํ™•๋„๊ฐ€ 6~8% ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์œ ์‚ฌ ์ฃผ์ œ ๊ฐ„ ๋ฌธ์žฅ์„ ์ž˜ ๊ตฌ๋ถ„ํ•ด์ค˜์„œ ๋†€๋ผ์› ์Šต๋‹ˆ๋‹ค.

6. ์žฅ๋‹จ์  ๋น„๊ต

๊ธฐ๋ฒ•์žฅ์ ๋‹จ์ 
TF-IDF๋น ๋ฅด๊ณ  ์ง๊ด€์ , ํฌ์†Œํ–‰๋ ฌ ํ™œ์šฉ๋ฌธ๋งฅ/์ˆœ์„œ ๋ฌด์‹œ, ์˜๋ฏธ ๋ฐ˜์˜ ๋ถ€์กฑ
Word2Vec์˜๋ฏธ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„ ํ‘œํ˜„, ๋ฌธ๋งฅ ๋ฐ˜์˜ํ›ˆ๋ จ ํ•„์š”, ๋ฒกํ„ฐ ํฌ๊ธฐ ์กฐ์ • ํ•„์š”

7. ์‚ฌ์šฉ ์‹œ๊ธฐ

  • TF-IDF: ๋น ๋ฅด๊ฒŒ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์‹ถ์„ ๋•Œ, ํ…์ŠคํŠธ๊ฐ€ ๋‹จ์ˆœํ•  ๋•Œ
  • Word2Vec: ์˜๋ฏธ ์œ ์‚ฌ์„ฑ, ๋ฌธ๋งฅ์ด ์ค‘์š”ํ•œ ํƒœ์Šคํฌ์ผ ๋•Œ

8. ์ฃผ์˜์‚ฌํ•ญ

  • TF-IDF๋Š” ๋‹จ์–ด ๋นˆ๋„๊ฐ€ ๊ทน๋‹จ์ ์œผ๋กœ ํ•œ์ชฝ์— ์น˜์šฐ์น˜๋ฉด ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ
  • Word2Vec์€ ๋ง๋ญ‰์น˜๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ปค์•ผ ์˜๋ฏธ ์žˆ๋Š” ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋จ
  • Word2Vec์€ ๋ฐ˜๋“œ์‹œ ํ† ํฐํ™” + ์ •์ œ๋œ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šตํ•ด์•ผ ์•ˆ์ •๋จ

9. ๋งˆ๋ฌด๋ฆฌ

TF-IDF์™€ Word2Vec ๋ชจ๋‘ ์žฅ๋‹จ์ ์ด ํ™•์‹คํ•˜๋ฉฐ, ์ƒํ™ฉ์— ๋”ฐ๋ผ ์„ ํƒํ•˜๊ฑฐ๋‚˜ ๋‘˜์„ ์กฐํ•ฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ €๋Š” ๋‘˜ ๋‹ค ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด concatํ•ด์„œ DNN์— ๋„ฃ์–ด๋ณธ ์ ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ๊ฑด ๋ฐ์ดํ„ฐ์˜ ์„ฑ๊ฒฉ์„ ๋จผ์ € ํŒŒ์•…ํ•˜๊ณ  ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค!

๋„์›€์ด ๋˜์…จ๋‹ค๋ฉด ๋Œ“๊ธ€๋กœ ์—ฌ๋Ÿฌ๋ถ„์˜ ๊ฒฝํ—˜๋„ ๊ณต์œ ํ•ด์ฃผ์„ธ์š” ๐Ÿ˜Š

๋ฐ˜์‘ํ˜•