네이버 영화 평점 감성 분석 - 한글 텍스트 처리

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Data house

네이버 영화 평점 감성 분석 - 한글 텍스트 처리 본문

Data Science/Machine Learning

네이버 영화 평점 감성 분석 - 한글 텍스트 처리

l._.been 2023. 7. 31. 19:21

728x90

준비 🏃🏻‍♀️

전개: 데이터 EDA -> 데이터 정제(결측치/ 이상치) -> TF-IDF -> 모델링 -> 예측
사용할 데이터 : https://github.com/e9t/nsmc

GitHub - e9t/nsmc: Naver sentiment movie corpus

Naver sentiment movie corpus. Contribute to e9t/nsmc development by creating an account on GitHub.

github.com

데이터 EDA

1. jupyter notebook에 다운한 데이터 로드

read_csv : 데이터가 저장된 주소를 기입
sep: 구분자를 기입( tab으로 나눠져 있으면 '\t')

import warnings
warnings.filterwarnings(action = 'ignore')

import pandas as pd
train_df = pd.read_csv('../data/nsmc/ratings_train.txt', sep='\t')
train_df.head(3)

2. 긍부정 데이터셋의 양을 비교하기

긍부정 데이터의 양이 비슷한지 check --> 만약 데이터 불균형이 있다면 oversampling / undersampling 고려해야 함
value_counts: 해당 열에 어떤 데이터가 있고 몇개가 있는지 확인할 수 있음
결과 값을 보아하니, 긍정 데이터(1)와 부정 데이터(0)가 거의 5:5비율로 잘 있음을 확인할 수 있다.

# 학습데이터의 레이블
train_df['label'].value_counts()

데이터 정제(결측치 / 이상치)

1. null값 여부(결측치) 확인하기

info: 데이터 프레임의 열의 이름과 해당 행의 데이터 타입과 null의 여부를 확인 가능
아래 결과를 보니, document 열에 5개의 null값이 존재한다.

# document에 5개의 null 값이 있음
train_df.info()

2. document 열의 숫자 삭제하기

아래 코드의 결과를 보면, 사람의 감정을 판별하는데 관련성이 없는 숫자들이 데이터에 포함되어있는 것으로 확인되었기 때문에 삭제해야한다!

# set_option(옵션 설정) - head 입력시 데이터 행과 열의 최대 출력량 조절하기
pd.set_option('display.max_row', 500)
pd.set_option('display.max_columns', 100)

print(train_df['document'].head(100))

fillna(' ') : 결측치를 ' '(공백)으로 처리 함
re : regex(정규 표현식) 내장 모듈
re.sub(r'\d+', ' ' , x): 문자열에서 '\d+' (숫자)를 ' '(공백)으로 처리함
drop('id', axis= ? , inplace = ? ): id 칼럼을 삭제하는데 axis= 1이면 열을 삭제하고, axis=0이면 행을 삭제한다. inplace가 True이면 drop을 적용한 dataframe 결과값으로 반환하고 inplace가 False이면 원본데이터에 drop을 반영하지 않는다.

import re

train_df = train_df.fillna(' ')

# re - 숫자를 공백으로 변경
train_df['document'] = train_df['document'].apply(lambda x : re.sub(r'\d+', ' ', x))

# re - 리뷰의 Null 및 숫자를 공백으로 변환
# \d+ - 1개 이상의 숫자가 나오면 공백으로 변경
test_df = pd.read_csv('../data/nsmc/ratings_test.txt', sep='\t')
test_df = test_df.fillna(' ')
test_df['document'] = test_df['document'].apply(lambda x: re.sub(r'\d+', ' ', x))

# id 칼럼 삭제
train_df.drop('id', axis=1, inplace=True)
test_df.drop('id', axis=1, inplace=True)

TF-IDF 사용

1. konlpy 사용해서 형태소 분리하는 함수 만들기 & TF-IDF

tw_tokenizer(): morphs(명사)로 tokenizer 하는 함수

from konlpy.tag import Twitter
twitter = Twitter()
def tw_tokenizer(text):
    #입력 들어온 text를 형태소 단어로 토큰화하여 list객체 변환
    tokens_ko = twitter.morphs(text)
    return tokens_ko
tw_tokenizer('오늘은 날씨가 좋습니다')

TfidfVectorizer: TF-IDF를 가능하게하는 라이브러리

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# ngram_range = (1,2) : go back, go home
# ngram_range = (1,1) : go, back, home
# min_df = 3 : 최소 빈도값을 설정, df = 문서의 수
# max_df = 0.9 == 90% 이상 나타나는 단어는 무시하겠다. (10개 문장에서 9번 이상 나오는 단어 무시)
tfidf_vect = TfidfVectorizer(tokenizer = tw_tokenizer, ngram_range = (1,2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df['document'])
tfidf_matrix_train = tfidf_vect.transform(train_df['document'])

print(tfidf_matrix_train.shape)

모델링

1. 분류 모델 (LogisticRegression) & GridSearchCV 사용하여 하이퍼파라미터 최적화

베이스 모델로 Logistic Regression을 사용하고 모델의 성능을 최대치로 맞추고자 GridsearchCV를 사용해서 최적의 하이퍼파라미터를 얻고자 했다.
C: 규제를 정하는 파라미터인데 GridSearchCV에 넣을 경우 가장 결과가 좋았던 값이 반환된다
아래 결과에서는 C가 4.5일때 0.8922 정확도를 보였다.

# Logistic Regression 이용 감성 분석 Classification 수행
lg_clf = LogisticRegression(random_state = 0)

# Parameter C 최적화 - GridSearchCV
params = {'C': [1, 3.5, 4.5, 5.5, 10]}
grid_cv = GridSearchCV(lg_clf, param_grid=params, cv= 3, scoring='accuracy', verbose = 1)

grid_cv.fit(tfidf_matrix_train, train_df['label'])
print(grid_cv.best_params_, round(grid_cv.best_score_, 4))

2. test 데이터 TF-IDF화 하기 & 모델 정확도 측정

accuracy_score: 정확도
아래 결과를 보아하니, GridSearchCV로 얻은 최적의 하이퍼파라미터를 적용한 Logistic Regression의 정확도는 약 86% 이다.

from sklearn.metrics import accuracy_score

# 학습 데이터를 적용한 TfidfVectorizer를 이용
# 테스트 데이터를 TF-IDF 값으로 Feature 변환
tfidf_matrix_test = tfidf_vect.transform(test_df['document'])

# 최적 파라미터로 학습된 classifier를 그대로 이용
best_estimator = grid_cv.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print('Logistic Regression 정확도: ', accuracy_score(test_df['label'], preds) )

예측

1. 모델링 평가 - 기존 데이터 사용

document의 150번째 데이터를 사용해서 모델을 평가해본 결과, 긍정(1)으로 잘 분류하고 있다.

test_df['document'][150]

best_estimator.predict(tfidf_vect.transform([test_df['document'][150]]))

document의 2번째 데이터를 사용해서 모델의 평가해본 결과, 부정(0)으로 잘 분류하고 있다.

test_df['document'][2]

best_estimator.predict(tfidf_vect.transform([test_df['document'][2]]))

2. 모델링 평가 - 내가 만든 데이터 사용

결과가 재밌게 나왔다 ㅋㅋ 내 생각대로 모델이 긍부정을 잘 판단해고 있다 ㅋㅋㅋ
일부러 감정을 판단하지 못하게 꼬아서 냈는데 나름 괜찮은 판단이 나와서 놀랐다 ㅎㅅㅎ

text = '영화 재미없네요'
text2 = '영화가 내 스타일은 아니었지만 배우가 연기를 잘 살려서 볼만 했다'
text3 = '😕'

if best_estimator.predict(tfidf_vect.transform([text])) == 0:
    print(f'"{text}" -> 부정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text]))[0][0], 2)*100}% 입니다')
else:
    print(f'"{text}" -> 긍정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text]))[0][1], 2)*100}% 입니다')

if best_estimator.predict(tfidf_vect.transform([text2])) == 0:
    print(f'"{text2}" -> 부정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text2]))[0][0], 2)*100}% 입니다')
else:
    print(f'"{text2}" -> 긍정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text2]))[0][1], 2)*100}% 입니다')

if best_estimator.predict(tfidf_vect.transform([text3])) == 0:
    print(f'"{text3}" -> 부정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text3]))[0][0], 2)*100}% 입니다')
else:
    print(f'"{text3}" -> 긍정일 가능성이 {round(best_estimator.predict_proba(tfidf_vect.transform([text3]))[0][1], 2)*100}% 입니다')

3. 모델 예측 결과

아까 TF-IDF 했던 test_df을 모델에 넣어 얻은 결과 값들을 예쁘게 dataframe으로 정리해보면 다음과 같다

result = pd.DataFrame({
    'document': test_df['document'],
    'test': test_df['label'],
    'pred': preds
})

result

'Data Science > Machine Learning' 카테고리의 다른 글

CNN과 OpenCV를 활용한 마스크 감지 모델 (0)	2023.08.23
streamlit - 데이터분석 /ML을 위한 웹앱 (0)	2023.08.02

'Data Science/Machine Learning' Related Articles

Data house

네이버 영화 평점 감성 분석 - 한글 텍스트 처리 본문

네이버 영화 평점 감성 분석 - 한글 텍스트 처리

'Data Science > Machine Learning' 카테고리의 다른 글

티스토리툴바