Добавил:

dron4ik Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «МЭИ»

Предмет:

Основы анализа текстовых данных

Файл:

text_lab2

.pdf

Скачиваний:

Добавлен:

05.07.2022

Размер:

366.82 Кб

Скачать

☆

<<< < Предыдущая 1 23 / 43 4 > Следующая >>>

lizard

naturalist

hystaspes

armageddon

used

armageddon

financi

obviou

financi

grin

aladdin

com

hystaspes

nth

chb

grin

jesus

but

somat

aladdin

somat

nth

Со стеммингом

Count

TF-IDF

№

Без стоп-

С стоп-

Без стоп-

С стоп-

Без стоп-

С стоп-

слов

словами

слов

словами

слов

словами

thi

the

thi

tempera

thi

tempera

thrush

use

nitya

and

align

use

align

indians

know

indians

ani

honesti

infanticid

that

know

infanticid

ani

televis

peopl

new

eyebal

doe

honesti

new

for

doe

televis

new

eyebal

edu

you

edu

yogurt

just

yogurt

are

just

gold

offer

gold

know

not

peopl

wis

good

test

doe

have

sale

test

say

analog

good

thi

pleas

analog

peopl

authur

just

offer

authur

ncr

time

with

think

nth

veri

wis

onli

ncr

make

nth

say

good

hill

includ

delusion

think

time

delusion

sale

anoth

make

includ

norman

think

hill

Для класса 0 (misc.forsale):

			Без стемминга
	Count			TF		TF-IDF
№	Без стоп-	С стоп-	Без стоп-		С стоп-	Без стоп-	С стоп-
	слов	словами	слов		словами	слов	словами
1	00	the	sincer		tone	sincer	epistl
2	new	for	netcom		epistl	netcom	tone
3	sale	and	olympu		adjust	olympu	adjust
4	50	to	00		trip	somat	trip
5	10	of	somat		harass	hystaspes	harass
6	dos	in	chb		noon	smartdrug	shaw
7	offer	00	discoveri		impact	chb	00
8	shipping	it	ramada		ima	ramada	impact
9	20	is	hystaspes		obviou	kuhn	noon
10	price	you	smartdrug		word	discoveri	obviou
11	25	with	uucp		youth	difficulti	youth
12	15	or	kuhn		forest	uucp	word
13	condition	have	linux		shaw	linux	forest
14	good	are	difficulti		grin	alejandro	nordenberg

used

all

depriv

agreement

ima

alejandro

trade

attent

motorcycl

edu

new

macintosh

soar

asking

this

attent

motorcycl

lizard

macintosh

mail

that

firmli

accton

temporari

pollut

interested

sale

lawrenc

allergic

bogu

grin

Со стеммингом

Count

TF-IDF

№

Без стоп-

С стоп-

Без стоп-

С стоп-

Без стоп-

С стоп-

слов

словами

слов

словами

слов

словами

the

sale

tempera

offer

eyebal

new

for

offer

eyebal

thi

tempera

thi

and

thi

align

sale

align

sale

use

thrush

use

thrush

offer

new

honesti

new

honesti

use

includ

nitya

sell

includ

infanticid

includ

infanticid

sell

indians

ship

sharon

pleas

nth

price

you

ship

wis

condit

nitya

ship

with

price

yogurt

nth

condit

gold

pleas

yogurt

pleas

have

mail

sharon

game

wis

sell

ask

hill

look

gold

game

are

drive

indians

all

analog

sirri

condit

test

new

lynch

card

smartwrit

good

thi

look

ask

hormon

will

edu

ahead

manual

lynch

Для класса 1 (sci.med):

			Без стемминга
	Count			TF		TF-IDF
№	Без стоп-	С стоп-	Без стоп-		С стоп-	Без стоп-	С стоп-
	слов	словами	слов		словами	слов	словами
1	edu	the	definit		tone	jhu	tone
2	don	of	jhu		trip	difficulti	noon
3	people	to	difficulti		noon	kuhn	trip
4	health	and	kuhn		ima	pip	ima
5	use	in	pip		adjust	invalid	impact
6	medical	is	invalid		impact	definit	adjust
7	like	that	thank		harass	fist	tomographi
8	know	it	mouth		tomographi	threw	harass
9	com	for	threw		youth	sponsorship	youth
10	time	you	fist		epistl	mouth	trade
11	just	this	deem		trade	deem	armageddon
12	patients	are	apocalypt		armageddon	apocalypt	agreement
13	new	be	sponsorship		agreement	speak	naturalist

think

with

predispos

naturalist

farrar

epistl

disease

not

farrar

word

supposedli

miner

good

have

firmli

forest

music

aladdin

msg

speak

nth

thank

forest

food

bro

aladdin

dat

word

years

huxley

miner

huxley

nth

doctor

but

music

bet

desk

bet

Со стеммингом

Count

TF-IDF

№

Без стоп-

С стоп-

Без стоп-

С стоп-

Без стоп-

С стоп-

слов

словами

слов

словами

слов

словами

thi

the

thi

tempera

thi

tempera

thrush

nitya

use

nitya

thrush

and

ani

indians

know

indians

edu

know

align

ani

infanticid

ani

use

infanticid

use

align

medic

edu

honesti

doe

televis

that

televis

honesti

patient

for

doe

yogurt

just

yogurt

time

peopl

eyebal

peopl

authur

peopl

you

think

authur

think

test

health

thi

time

test

diseas

analog

know

are

just

analog

pitt

ncr

year

have

effect

gold

veri

gold

diseas

with

msg

ncr

gordon

eyebal

food

not

food

wis

patient

mistaken

com

gordon

delusion

soon

delusion

caus

diseas

norman

intellect

anoth

doe

bank

anoth

surrend

wis

think

caus

mistaken

geb

norman

Для класса 2 (talk.religion.misc):

			Без стемминга
	Count			TF		TF-IDF
	Без стоп-	С стоп-	Без стоп-		С стоп-	Без	С стоп-
№	Без стоп-	С стоп-	Без стоп-		С стоп-	стоп-	С стоп-
	слов	словами	слов		словами	слов	словами
						слов
1	god	the	financi		tone	financi	tone
2	people	of	pip		trip	pip	noon
3	jesus	to	inner		noon	inner	trip
4	don	and	definit		adjust	invalid	tomographi
5	bible	that	invalid		tomographi	jhu	adjust
6	just	is	thank		ima	definit	ima
7	christian	in	jhu		harass	kuhn	youth
8	think	you	bundl		youth	bundl	harass
9	know	it	kuhn		impact	thank	impact
10	say	not	sized		naturalist	deem	naturalist
11	does	for	deem		agreement	ct	formul

did

firmli

trade

audibl

fauci

good

this

armageddon

astonish

armageddon

are

audibl

epistl

buoyanc

trade

life

buoyanc

aladdin

firmli

agreement

way

have

astonish

forest

threw

aladdin

believe

with

warranti

word

sized

toxin

said

was

sin

bet

sin

wealth

point

threw

formul

kremlin

forest

time

they

prize

wealth

john

bet

Со стеммингом

Count

TF-IDF

№

Без стоп-

С стоп-

Без стоп-

С стоп-

Без стоп-

С стоп-

слов

словами

слов

словами

слов

словами

thi

the

thi

tempera

thi

tempera

thrush

nitya

god

nitya

christian

thrush

and

christian

align

god

televis

christian

that

televis

say

align

peopl

indians

jesu

say

honesti

did

yogurt

say

you

jesu

yogurt

know

honesti

think

infanticid

peopl

infanticid

doe

not

did

ncr

jesu

ncr

think

know

authur

just

authur

did

for

just

analog

doe

got

know

are

doe

test

think

camera

believ

thi

eyebal

franc

moral

gold

analog

bibl

have

believ

delusion

believ

test

just

ani

anoth

point

anoth

with

moral

wis

use

delusion

use

bloodlet

good

vol

onli

vol

ani

tesla

Из полученных таблиц видно, что при взвешивании терминов с помощью Count без отсечения стоп-слов наиболее часто встречающиеся слова – предлоги, союзы, частицы, местоимения. Отсечение стоп-слов позволяет определить в качестве наиболее распространенных более содержательные по смыслу слова.

При отсутствии стемминга взвешивание Count с отсечением стоп-слов позволяет получить лучшие результаты, чем TF и TF-IDF с отсечением слов или без него.

При использовании стемминга и отсечения стоп-слов каждое из взвешиваний Count, TF и TFIDF позволяет получить достаточно неплохой результат.

7.Реализация модели Наивного Байесовского классификатора при помощи конвейера Pipeline и выявление на основе показателей качества (значения полноты, точности, f1меры и аккуратности), какая предварительная обработка данных обеспечит наилучшие результаты классификации.

Исследуемые характеристики:

−Наличие - отсутствие стемминга

−Отсечение – не отсечение стоп-слов

−Количество информативных терминов (max_features)

−Взвешивание: Count, TF, TF-IDF

1)Взвешивание Count при отсутствии стемминга и отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target) prediction = text_clf.predict(twenty_test.data)

print('Accuracy score: ', accuracy_score(prediction, twenty_test.target)) print(classification_report(twenty_test.target, prediction))

Accuracy	score:	0.90549662487946		f1-score	support
	precision		recall	f1-score	support
	0	0.99	0.91	0.94	390
	1	0.89	0.91	0.90	396
	2	0.82	0.90	0.86	251
accuracy		0.90	0.90	0.91	1037
macro	avg	0.90	0.90	0.90	1037
weighted	avg	0.91	0.91	0.91	1037

2) Взвешивание Count при стемминге и отсутствии отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.9122468659594986		f1-score	support
	precision		recall	f1-score	support
	0	0.99	0.92	0.95	390
	1	0.89	0.92	0.91	396
	2	0.84	0.88	0.86	251
accuracy		0.91	0.91	0.91	1037
macro	avg	0.91	0.91	0.91	1037
weighted	avg	0.92	0.91	0.91	1037

Так как значения метрик были выше при использовании стемминга, далее будем рассматривать модель с наличием стемминга.

3) Взвешивание Count при стемминге и отсечении стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000,

stop_words = 'english')),

('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.9170684667309547		f1-score	support
	precision		recall	f1-score	support
	0	0.97	0.93	0.95	390
	1	0.90	0.93	0.91	396
	2	0.87	0.88	0.88	251
accuracy		0.91	0.91	0.92	1037
macro	avg	0.91	0.91	0.91	1037
weighted	avg	0.92	0.92	0.92	1037

Так как значения метрик были выше при использовании отсечения стоп-слов, далее будем рассматривать модель с отсечением стоп-слов.

4)Взвешивание Count при стемминге и отсечении стоп-слов; число информативных признаков 5000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 5000, stop_words = 'english')),

('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.9122468659594986		f1-score	support
	precision		recall	f1-score	support
	0	0.96	0.93	0.95	390
	1	0.89	0.92	0.91	396
	2	0.87	0.88	0.87	251
accuracy		0.91	0.91	0.91	1037
macro	avg	0.91	0.91	0.91	1037
weighted	avg	0.91	0.91	0.91	1037

5)Взвешивание Count при стемминге и отсечении стоп-слов; число информативных признаков 15000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 15000, stop_words = 'english')),

('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target))

print(classification_report(stem_test.target, prediction)) Accuracy score: 0.9180327868852459

precision recall f1-score support

0	0.97	0.93	0.95	390
1	0.89	0.93	0.91	396
2	0.88	0.88	0.88	251
accuracy	0.91	0.91	0.92	1037
macro avg	0.91	0.91	0.91	1037
weighted avg	0.92	0.92	0.92	1037

6)Взвешивание Count при стемминге и отсечении стоп-слов; число информативных признаков 20000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 20000, stop_words = 'english')),

('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.9199614271938283		f1-score	support
	precision		recall	f1-score	support
	0	0.98	0.93	0.95	390
	1	0.89	0.94	0.91	396
	2	0.88	0.88	0.88	251
accuracy		0.92	0.91	0.92	1037
macro	avg	0.92	0.91	0.92	1037
weighted	avg	0.92	0.92	0.92	1037

Так как значения метрик были выше при использовании числа информативных признаков 20000, наилучшая модель с взвешиванием Count - при стемминге и отсечении стоп-слов с числом информативных признаков 20000.

7) Взвешивание TF-IDF при отсутствии стемминга и отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target) prediction = text_clf.predict(twenty_test.data)

print('Accuracy score: ', accuracy_score(prediction, twenty_test.target)) print(classification_report(twenty_test.target, prediction))

Accuracy score:	0.8187078109932497		f1-score	support
precision		recall	f1-score	support
0	0.97	0.91	0.94	390
1	0.69	0.97	0.81	396
2	0.95	0.44	0.60	251
accuracy			0.82	1037

macro	avg	0.87	0.77	0.78	1037
weighted	avg	0.86	0.82	0.81	1037

8) Взвешивание TF-IDF при стемминге и отсутствии отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.819672131147541		f1-score	support
	precision		recall	f1-score	support
	0	0.99	0.91	0.95	390
	1	0.69	0.98	0.81	396
	2	0.95	0.42	0.58	251
accuracy		0.88	0.77	0.82	1037
macro	avg	0.88	0.77	0.78	1037
weighted	avg	0.86	0.82	0.81	1037

9) Взвешивание TF-IDF при стемминге и отсечении стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000, stop_words = 'english')),

('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8727097396335584		f1-score	support
	precision		recall	f1-score	support
	0	0.95	0.94	0.94	390
	1	0.79	0.96	0.86	396
	2	0.95	0.63	0.76	251
accuracy		0.89	0.84	0.87	1037
macro	avg	0.89	0.84	0.86	1037
weighted	avg	0.89	0.87	0.87	1037

10)Взвешивание TF-IDF при стемминге и отсечении стоп-слов; число информативных признаков 5000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 5000, stop_words = 'english')),

('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8833172613307618		f1-score	support
	precision		recall	f1-score	support
	0	0.94	0.95	0.94	390
	1	0.81	0.94	0.87	396
	2	0.95	0.69	0.80	251
accuracy		0.90	0.86	0.88	1037
macro	avg	0.90	0.86	0.87	1037
weighted	avg	0.89	0.88	0.88	1037

11)Взвешивание TF-IDF при стемминге и отсечении стоп-слов; число информативных признаков 15000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 15000, stop_words = 'english')),

('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8717454194792671		f1-score	support
	precision		recall	f1-score	support
	0	0.95	0.94	0.95	390
	1	0.78	0.97	0.86	396
	2	0.96	0.61	0.75	251
accuracy		0.90	0.84	0.87	1037
macro	avg	0.90	0.84	0.85	1037
weighted	avg	0.89	0.87	0.87	1037

12)Взвешивание TF-IDF при стемминге и отсечении стоп-слов; число информативных признаков 20000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 20000, stop_words = 'english')),

('tfidf', TfidfTransformer(use_idf = True)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target)) print(classification_report(stem_test.target, prediction))

Accuracy	score:	0.8688524590163934		f1-score	support
	precision		recall	f1-score	support
	0	0.96	0.94	0.95	390
	1	0.77	0.97	0.86	396
	2	0.96	0.59	0.73	251
accuracy		0.90	0.84	0.87	1037
macro	avg	0.90	0.84	0.85	1037
weighted	avg	0.89	0.87	0.86	1037

Так как значения метрик были выше при использовании числа информативных признаков 5000, наилучшая модель с взвешиванием TF-IDF - при стемминге и отсечении стоп-слов с числом информативных признаков 5000.

13) Взвешивание TF при отсутствии стемминга и отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('tf', TfidfTransformer(use_idf = False)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target) prediction = text_clf.predict(twenty_test.data)

print('Accuracy score: ', accuracy_score(prediction, twenty_test.target)) print(classification_report(twenty_test.target, prediction))

Accuracy	score:	0.712632594021215		f1-score	support
	precision		recall	f1-score	support
	0	0.98	0.87	0.92	390
	1	0.57	0.99	0.73	396
	2	1.00	0.04	0.08	251
accuracy		0.85	0.63	0.71	1037
macro	avg	0.85	0.63	0.57	1037
weighted	avg	0.83	0.71	0.64	1037

14) Взвешивание TF при стемминге и отсутствии отсечения стоп-слов; число информативных признаков 10000:

text_clf = Pipeline([('vect', CountVectorizer(max_features= 10000)), ('tf', TfidfTransformer(use_idf = False)), ('clf', MultinomialNB ()),])

text_clf = text_clf.fit(stem_train.data_stem, stem_train.target) prediction = text_clf.predict(stem_test.data_stem)

print('Accuracy score: ', accuracy_score(prediction, stem_test.target))

<<< < Предыдущая 1 23 / 43 4 > Следующая >>>

Соседние файлы в предмете Основы анализа текстовых данных

#
05.07.20221.4 Mб33text_lab1.pdf
#
05.07.2022366.82 Кб45text_lab2.pdf
#
05.07.2022175.78 Кб36text_lab3.pdf
#
05.07.2022754.07 Кб21text_lab4.pdf