text_lab2

.pdf

Скачиваний:

Добавлен:

05.07.2022

Размер:

366.82 Кб

Скачать

☆

<<< < Предыдущая 1 2 34 / 44

print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.7184185149469624		f1-score	support
	precision		recall	f1-score	support
	0	0.98	0.87	0.93	390
	1	0.58	0.99	0.73	396
	2	1.00	0.05	0.09	251
accuracy		0.85	0.64	0.72	1037
macro	avg	0.85	0.64	0.58	1037
weighted	avg	0.83	0.72	0.65	1037

Так как значения метрик были выше при использовании стемминга, далее будем рассматривать модель с наличием стемминга.

15) Взвешивание TF при стемминге и отсечении стоп-слов; число информативных признаков

10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000, stop_words = 'english')),

('tf', TfidfTransformer(use_idf = False)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8505303760848602		f1-score	support
	precision		recall	f1-score	support
	0	0.94	0.93	0.93	390
	1	0.75	0.96	0.84	396
	2	0.96	0.57	0.71	251
accuracy		0.88	0.82	0.85	1037
macro	avg	0.88	0.82	0.83	1037
weighted	avg	0.87	0.85	0.84	1037

Так как значения метрик были выше при использовании отсечения стоп-слов, далее будем рассматривать модель с отсечением стоп-слов.

16)Взвешивание TF при стемминге и отсечении стоп-слов; число информативных признаков

5000:

text_clf =	Pipeline([('vect', CountVectorizer(max_features= 5000,
stop_words	= 'english')),
	('tf', TfidfTransformer(use_idf = False)),
	('clf', MultinomialNB ()),])
text_clf =	text_clf.fit(stem_train.data_stem, stem_train.target)
prediction	= text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8698167791706847		f1-score	support
	precision		recall	f1-score	support
	0	0.93	0.94	0.93	390
	1	0.79	0.95	0.86	396
	2	0.94	0.64	0.76	251
accuracy		0.89	0.84	0.87	1037
macro	avg	0.89	0.84	0.85	1037
weighted	avg	0.88	0.87	0.87	1037

17)Взвешивание TF при стемминге и отсечении стоп-слов; число информативных признаков

15000:

text_clf =	Pipeline([('vect', CountVectorizer(max_features= 15000,
stop_words	= 'english')),
	('tf', TfidfTransformer(use_idf = False)),
	('clf', MultinomialNB ()),])
text_clf =	text_clf.fit(stem_train.data_stem, stem_train.target)
prediction	= text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8418514946962391		f1-score	support
	precision		recall	f1-score	support
	0	0.95	0.92	0.93	390
	1	0.73	0.96	0.83	396
	2	0.96	0.53	0.68	251
accuracy		0.88	0.80	0.84	1037
macro	avg	0.88	0.80	0.82	1037
weighted	avg	0.87	0.84	0.83	1037

18)Взвешивание TF при стемминге и отсечении стоп-слов; число информативных признаков

20000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 20000, stop_words = 'english')),

('tf', TfidfTransformer(use_idf = False)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy score: 0.840887174541948

precision recall f1-score support

0	0.95	0.92	0.93	390
1	0.73	0.97	0.83	396
2	0.97	0.51	0.67	251
accuracy	0.88	0.80	0.84	1037
macro avg	0.88	0.80	0.81	1037
weighted avg	0.87	0.84	0.83	1037

Так как значения метрик были выше при использовании числа информативных признаков 5000, наилучшая модель с взвешиванием TF - при стемминге и отсечении стоп-слов с числом информативных признаков 5000.

Сравним результаты, полученные с помощью лучшей модели каждого из трех методов взвешивания:

Метод		Precision		Recall		F1-score
взвешивания	Accuracy
взвешивания		Macro	Weighed	Macro	Weighed	Macro	Weighed
и № лучшей		Macro	Weighed	Macro	Weighed	Macro	Weighed
модели		avg	avg	avg	avg	avg	avg

Count (№ 6)	0,9200	0,92	0,92	0,91	0,92	0,92	0,92

TF-IDF	0,8833	0,90	0,89	0,86	0,88	0,87	0,88
(№ 10)	0,8833	0,90	0,89	0,86	0,88	0,87	0,88
(№ 10)

TF (№ 16)	0,8698	0,89	0,88	0,84	0,87	0,85	0,87

Из таблицы видно, что наибольшие значения метрик достигаются при модели № 6 - с взвешиванием Count при стемминге и отсечении стоп-слов с числом информативных признаков 20000.

<<< < Предыдущая 1 2 34 / 44

Соседние файлы в предмете Основы анализа текстовых данных

#
05.07.20221.4 Mб22text_lab1.pdf
#
05.07.2022366.82 Кб33text_lab2.pdf
#
05.07.2022175.78 Кб23text_lab3.pdf
#
05.07.2022754.07 Кб16text_lab4.pdf