Data house

๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๊ฐ์„ฑ ๋ถ„์„ - ํ•œ๊ธ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๋ณธ๋ฌธ

Data Science/Machine Learning

๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๊ฐ์„ฑ ๋ถ„์„ - ํ•œ๊ธ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ

l._.been 2023. 7. 31. 19:21
728x90
์ค€๋น„ ๐Ÿƒ๐Ÿป‍โ™€๏ธ
  • ์ „๊ฐœ: ๋ฐ์ดํ„ฐ EDA -> ๋ฐ์ดํ„ฐ ์ •์ œ(๊ฒฐ์ธก์น˜/ ์ด์ƒ์น˜) -> TF-IDF -> ๋ชจ๋ธ๋ง -> ์˜ˆ์ธก  
  • ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ : https://github.com/e9t/nsmc
 

GitHub - e9t/nsmc: Naver sentiment movie corpus

Naver sentiment movie corpus. Contribute to e9t/nsmc development by creating an account on GitHub.

github.com

 

๋ฐ์ดํ„ฐ EDA

 

1. jupyter notebook์— ๋‹ค์šดํ•œ ๋ฐ์ดํ„ฐ ๋กœ๋“œ 

  • read_csv : ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋œ ์ฃผ์†Œ๋ฅผ ๊ธฐ์ž…
  • sep: ๊ตฌ๋ถ„์ž๋ฅผ ๊ธฐ์ž…( tab์œผ๋กœ ๋‚˜๋ˆ ์ ธ ์žˆ์œผ๋ฉด '\t')
import warnings
warnings.filterwarnings(action = 'ignore')

import pandas as pd
train_df = pd.read_csv('../data/nsmc/ratings_train.txt', sep='\t')
train_df.head(3)

 

2. ๊ธ๋ถ€์ • ๋ฐ์ดํ„ฐ์…‹์˜ ์–‘์„ ๋น„๊ตํ•˜๊ธฐ

  • ๊ธ๋ถ€์ • ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๋น„์Šทํ•œ์ง€ check  --> ๋งŒ์•ฝ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์ด ์žˆ๋‹ค๋ฉด oversampling / undersampling ๊ณ ๋ คํ•ด์•ผ ํ•จ
  • value_counts: ํ•ด๋‹น ์—ด์— ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ณ  ๋ช‡๊ฐœ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
  • ๊ฒฐ๊ณผ ๊ฐ’์„ ๋ณด์•„ํ•˜๋‹ˆ, ๊ธ์ • ๋ฐ์ดํ„ฐ(1)์™€ ๋ถ€์ • ๋ฐ์ดํ„ฐ(0)๊ฐ€ ๊ฑฐ์˜ 5:5๋น„์œจ๋กœ ์ž˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
# ํ•™์Šต๋ฐ์ดํ„ฐ์˜ ๋ ˆ์ด๋ธ”
train_df['label'].value_counts()

 

๋ฐ์ดํ„ฐ ์ •์ œ(๊ฒฐ์ธก์น˜ / ์ด์ƒ์น˜)

 

1. null๊ฐ’ ์—ฌ๋ถ€(๊ฒฐ์ธก์น˜) ํ™•์ธํ•˜๊ธฐ

  • info: ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์—ด์˜ ์ด๋ฆ„๊ณผ ํ•ด๋‹น ํ–‰์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ null์˜ ์—ฌ๋ถ€๋ฅผ ํ™•์ธ ๊ฐ€๋Šฅ
  • ์•„๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋‹ˆ, document ์—ด์— 5๊ฐœ์˜ null๊ฐ’์ด ์กด์žฌํ•œ๋‹ค.
# document์— 5๊ฐœ์˜ null ๊ฐ’์ด ์žˆ์Œ
train_df.info()

 

2. document ์—ด์˜ ์ˆซ์ž ์‚ญ์ œํ•˜๊ธฐ

  • ์•„๋ž˜ ์ฝ”๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ์‚ฌ๋žŒ์˜ ๊ฐ์ •์„ ํŒ๋ณ„ํ•˜๋Š”๋ฐ ๊ด€๋ จ์„ฑ์ด ์—†๋Š” ์ˆซ์ž๋“ค์ด ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋˜์–ด์žˆ๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ญ์ œํ•ด์•ผํ•œ๋‹ค!
# set_option(์˜ต์…˜ ์„ค์ •) - head ์ž…๋ ฅ์‹œ ๋ฐ์ดํ„ฐ ํ–‰๊ณผ ์—ด์˜ ์ตœ๋Œ€ ์ถœ๋ ฅ๋Ÿ‰ ์กฐ์ ˆํ•˜๊ธฐ
pd.set_option('display.max_row', 500)
pd.set_option('display.max_columns', 100)

print(train_df['document'].head(100))

 

  • fillna(' ') : ๊ฒฐ์ธก์น˜๋ฅผ ' '(๊ณต๋ฐฑ)์œผ๋กœ ์ฒ˜๋ฆฌ ํ•จ
  • re : regex(์ •๊ทœ ํ‘œํ˜„์‹) ๋‚ด์žฅ ๋ชจ๋“ˆ
  • re.sub(r'\d+', ' ' , x): ๋ฌธ์ž์—ด์—์„œ '\d+' (์ˆซ์ž)๋ฅผ ' '(๊ณต๋ฐฑ)์œผ๋กœ ์ฒ˜๋ฆฌํ•จ
  • drop('id', axis= ? , inplace = ? ):  id ์นผ๋Ÿผ์„ ์‚ญ์ œํ•˜๋Š”๋ฐ axis= 1์ด๋ฉด ์—ด์„ ์‚ญ์ œํ•˜๊ณ , axis=0์ด๋ฉด ํ–‰์„ ์‚ญ์ œํ•œ๋‹ค. inplace๊ฐ€ True์ด๋ฉด drop์„ ์ ์šฉํ•œ dataframe ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๊ณ  inplace๊ฐ€ False์ด๋ฉด ์›๋ณธ๋ฐ์ดํ„ฐ์— drop์„ ๋ฐ˜์˜ํ•˜์ง€ ์•Š๋Š”๋‹ค.
import re

train_df = train_df.fillna(' ')

# re - ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝ
train_df['document'] = train_df['document'].apply(lambda x : re.sub(r'\d+', ' ', x))

# re - ๋ฆฌ๋ทฐ์˜ Null ๋ฐ ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
# \d+ - 1๊ฐœ ์ด์ƒ์˜ ์ˆซ์ž๊ฐ€ ๋‚˜์˜ค๋ฉด ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝ
test_df = pd.read_csv('../data/nsmc/ratings_test.txt', sep='\t')
test_df = test_df.fillna(' ')
test_df['document'] = test_df['document'].apply(lambda x: re.sub(r'\d+', ' ', x))

# id ์นผ๋Ÿผ ์‚ญ์ œ
train_df.drop('id', axis=1, inplace=True)
test_df.drop('id', axis=1, inplace=True)

 

 

TF-IDF ์‚ฌ์šฉ

 

1. konlpy ์‚ฌ์šฉํ•ด์„œ ํ˜•ํƒœ์†Œ ๋ถ„๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ & TF-IDF 

  • tw_tokenizer(): morphs(๋ช…์‚ฌ)๋กœ tokenizer ํ•˜๋Š” ํ•จ์ˆ˜
from konlpy.tag import Twitter
twitter = Twitter()
def tw_tokenizer(text):
    #์ž…๋ ฅ ๋“ค์–ด์˜จ text๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•˜์—ฌ list๊ฐ์ฒด ๋ณ€ํ™˜
    tokens_ko = twitter.morphs(text)
    return tokens_ko
tw_tokenizer('์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค')

  • TfidfVectorizer:  TF-IDF๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# ngram_range = (1,2) : go back, go home
# ngram_range = (1,1) : go, back, home
# min_df = 3 : ์ตœ์†Œ ๋นˆ๋„๊ฐ’์„ ์„ค์ •, df = ๋ฌธ์„œ์˜ ์ˆ˜
# max_df = 0.9 == 90% ์ด์ƒ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋Š” ๋ฌด์‹œํ•˜๊ฒ ๋‹ค. (10๊ฐœ ๋ฌธ์žฅ์—์„œ 9๋ฒˆ ์ด์ƒ ๋‚˜์˜ค๋Š” ๋‹จ์–ด ๋ฌด์‹œ)
tfidf_vect = TfidfVectorizer(tokenizer = tw_tokenizer, ngram_range = (1,2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df['document'])
tfidf_matrix_train = tfidf_vect.transform(train_df['document'])
print(tfidf_matrix_train.shape)

 

๋ชจ๋ธ๋ง

 

1.  ๋ถ„๋ฅ˜ ๋ชจ๋ธ (LogisticRegression) & GridSearchCV ์‚ฌ์šฉํ•˜์—ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”

  • ๋ฒ ์ด์Šค ๋ชจ๋ธ๋กœ Logistic Regression์„ ์‚ฌ์šฉํ•˜๊ณ  ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€์น˜๋กœ ๋งž์ถ”๊ณ ์ž GridsearchCV๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์–ป๊ณ ์ž ํ–ˆ๋‹ค.
  • C: ๊ทœ์ œ๋ฅผ ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ธ๋ฐ GridSearchCV์— ๋„ฃ์„ ๊ฒฝ์šฐ ๊ฐ€์žฅ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์•˜๋˜ ๊ฐ’์ด ๋ฐ˜ํ™˜๋œ๋‹ค
  • ์•„๋ž˜ ๊ฒฐ๊ณผ์—์„œ๋Š” C๊ฐ€ 4.5์ผ๋•Œ 0.8922 ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค.
# Logistic Regression ์ด์šฉ ๊ฐ์„ฑ ๋ถ„์„ Classification ์ˆ˜ํ–‰
lg_clf = LogisticRegression(random_state = 0)

# Parameter C ์ตœ์ ํ™” - GridSearchCV
params = {'C': [1, 3.5, 4.5, 5.5, 10]}
grid_cv = GridSearchCV(lg_clf, param_grid=params, cv= 3, scoring='accuracy', verbose = 1)

grid_cv.fit(tfidf_matrix_train, train_df['label'])
print(grid_cv.best_params_, round(grid_cv.best_score_, 4))

 

2. test ๋ฐ์ดํ„ฐ TF-IDFํ™” ํ•˜๊ธฐ &  ๋ชจ๋ธ ์ •ํ™•๋„ ์ธก์ •

  • accuracy_score: ์ •ํ™•๋„ 
  • ์•„๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์•„ํ•˜๋‹ˆ, GridSearchCV๋กœ ์–ป์€ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•œ Logistic Regression์˜ ์ •ํ™•๋„๋Š” ์•ฝ 86% ์ด๋‹ค.
from sklearn.metrics import accuracy_score

# ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉํ•œ TfidfVectorizer๋ฅผ ์ด์šฉ
# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ TF-IDF ๊ฐ’์œผ๋กœ Feature ๋ณ€ํ™˜
tfidf_matrix_test = tfidf_vect.transform(test_df['document'])

# ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต๋œ classifier๋ฅผ ๊ทธ๋Œ€๋กœ ์ด์šฉ
best_estimator = grid_cv.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print('Logistic Regression ์ •ํ™•๋„: ', accuracy_score(test_df['label'], preds) )

 

์˜ˆ์ธก

 

1. ๋ชจ๋ธ๋ง ํ‰๊ฐ€ - ๊ธฐ์กด ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ

  • document์˜ 150๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•ด๋ณธ ๊ฒฐ๊ณผ, ๊ธ์ •(1)์œผ๋กœ ์ž˜ ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ๋‹ค.
test_df['document'][150]

best_estimator.predict(tfidf_vect.transform([test_df['document'][150]]))

  • document์˜ 2๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์˜ ํ‰๊ฐ€ํ•ด๋ณธ ๊ฒฐ๊ณผ, ๋ถ€์ •(0)์œผ๋กœ ์ž˜ ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ๋‹ค.
test_df['document'][2]

best_estimator.predict(tfidf_vect.transform([test_df['document'][2]]))

 

2. ๋ชจ๋ธ๋ง ํ‰๊ฐ€ - ๋‚ด๊ฐ€ ๋งŒ๋“  ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ

  • ๊ฒฐ๊ณผ๊ฐ€ ์žฌ๋ฐŒ๊ฒŒ ๋‚˜์™”๋‹ค ใ…‹ใ…‹ ๋‚ด ์ƒ๊ฐ๋Œ€๋กœ ๋ชจ๋ธ์ด ๊ธ๋ถ€์ •์„ ์ž˜ ํŒ๋‹จํ•ด๊ณ  ์žˆ๋‹ค ใ…‹ใ…‹ใ…‹
  • ์ผ๋ถ€๋Ÿฌ ๊ฐ์ •์„ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ๊ผฌ์•„์„œ ๋ƒˆ๋Š”๋ฐ ๋‚˜๋ฆ„ ๊ดœ์ฐฎ์€ ํŒ๋‹จ์ด ๋‚˜์™€์„œ ๋†€๋ž๋‹ค ใ…Žใ……ใ…Ž
text = '์˜ํ™” ์žฌ๋ฏธ์—†๋„ค์š”'
text2 = '์˜ํ™”๊ฐ€ ๋‚ด ์Šคํƒ€์ผ์€ ์•„๋‹ˆ์—ˆ์ง€๋งŒ ๋ฐฐ์šฐ๊ฐ€ ์—ฐ๊ธฐ๋ฅผ ์ž˜ ์‚ด๋ ค์„œ ๋ณผ๋งŒ ํ–ˆ๋‹ค'
text3 = '๐Ÿ˜•'
if best_estimator.predict(tfidf_vect.transform([text])) == 0:
    print(f'"{text}" -> ๋ถ€์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text]))[0][0], 2)*100}% ์ž…๋‹ˆ๋‹ค')
else:
    print(f'"{text}" -> ๊ธ์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text]))[0][1], 2)*100}% ์ž…๋‹ˆ๋‹ค')

if best_estimator.predict(tfidf_vect.transform([text2])) == 0:
    print(f'"{text2}" -> ๋ถ€์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text2]))[0][0], 2)*100}% ์ž…๋‹ˆ๋‹ค')
else:
    print(f'"{text2}" -> ๊ธ์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text2]))[0][1], 2)*100}% ์ž…๋‹ˆ๋‹ค')

if best_estimator.predict(tfidf_vect.transform([text3])) == 0:
    print(f'"{text3}" -> ๋ถ€์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text3]))[0][0], 2)*100}% ์ž…๋‹ˆ๋‹ค')
else:
    print(f'"{text3}" -> ๊ธ์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(best_estimator.predict_proba(tfidf_vect.transform([text3]))[0][1], 2)*100}% ์ž…๋‹ˆ๋‹ค')

 

3. ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐ๊ณผ

  • ์•„๊นŒ TF-IDF ํ–ˆ๋˜ test_df์„ ๋ชจ๋ธ์— ๋„ฃ์–ด ์–ป์€ ๊ฒฐ๊ณผ ๊ฐ’๋“ค์„ ์˜ˆ์˜๊ฒŒ dataframe์œผ๋กœ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
result = pd.DataFrame({
    'document': test_df['document'],
    'test': test_df['label'],
    'pred': preds
})
result