Добавил:

dron4ik Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «МЭИ»

Предмет:

Основы анализа текстовых данных

Файл:

text_lab2

.pdf

Скачиваний:

Добавлен:

05.07.2022

Размер:

366.82 Кб

Скачать

☆

<<< < Предыдущая 12 / 42 3 4 > Следующая >>>

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf05.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('sincer', 34.54646229673236), ('netcom', 28.077512091566557), ('olympu', 27.6220068617385), ('somat', 19.683849213268186), ('hystaspes_', 17.947106624726292), ('smartdrug', 17.807524062154958), ('chb', 17.466174374675173), ('ramada', 17.231561177853152), ('kuhn', 16.362970323046493), ('discoveri', 16.21071663713713), ('difficulti', 15.453276489726257), ('uucp', 15.255323869019449), ('linux', 15.089472689813165), ('alejandro', 15.062320833913612), ('00', 14.82971578635708), ('attent', 12.458055352101512), ('10', 12.371532745257124), ('lizard', 11.97963864235259), ('', 10.66834389031414), ('bogu', 10.488474646840226)]

Вывод для класса 1 при отсечении стоп-слов (TF-IDF):

train_data_tfidf06 = tfidf.transform(train_data06)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf06.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('jhu', 23.092821894269477), ('difficulti', 20.197092979122758), ('kuhn', 19.356281029226984), ('pip', 19.027666528183083), ('invalid', 17.6868260941888), ('definit', 17.280919323911885), ('fist', 15.291576935732573), ('threw', 15.024744303080045), ('sponsorship', 14.669940470329403), ('mouth', 13.952556008148695), ('deem', 13.56887593393774), ('apocalypt', 13.464924000560059), ('speak', 13.146288502979447), ('farrar', 13.132513027424977), ('supposedli', 13.00502890097202), ('music', 12.54453937826036), ('thank', 12.25625888741643), ('dat', 12.196856540591606), ('huxley', 12.14259453179389), ('desk', 12.08895194966892)]

Вывод для класса 2 при отсечении стоп-слов (TF-IDF):

train_data_tfidf07 = tfidf.transform(train_data07)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf07.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('financi', 23.184811350596423), ('pip', 17.063300675140464), ('inner', 14.64452075646358), ('invalid', 14.404957069911507), ('jhu', 12.233946376737656), ('definit', 10.355299086888017), ('kuhn', 10.021111440685743), ('bundl', 9.857105000423246), ('thank', 9.691442384555897), ('deem', 9.36018785626855), ('ct', 8.73899504098878), ('audibl', 8.558498990464523), ('astonish', 7.922011312364973), ('buoyanc', 7.8867422611415305), ('firmli', 7.492502093472581), ('threw', 7.44336427233471), ('sized', 7.133404082434186), ('sin', 6.926598257452059), ('kremlin', 6.606023300797752), ('john', 6.282523102619925)]

Вывод для всей выборки при стемминге (TF-IDF):

tfidf = TfidfTransformer(use_idf = True).fit(train_data_stem) train_data_tfidf_stem = tfidf.transform(train_data_stem)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem.sum(axis=0)))

)

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 157.79077188103213), ('thrush', 100.54611617981654), ('nitya', 96.37542952691709), ('align', 86.03873447487518), ('indians', 79.02337798446179), ('infanticid', 73.35370972445031), ('televis', 69.29079256033657), ('honesti', 67.50217104571504), ('eyebal', 62.187454742857405), ('yogurt', 60.030882671777086), ('gold', 41.652825619004545), ('test', 41.330882236154565), ('analog', 40.7325328622444), ('authur', 40.59652077080643), ('ncr', 39.282120778346865), ('wis', 38.75720305808977), ('nth', 37.11687450264329), ('delusion', 35.089057012458646), ('anoth', 32.51051518572325), ('hill', 32.50072703858639)]

Вывод для класса 0 при стемминге (TF-IDF):

train_data_tfidf_stem01 = tfidf.transform(train_data_stem01)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem01.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('eyebal', 36.354530600690474), ('tempera', 32.18255201830497), ('align', 25.509134102049355), ('thrush', 24.052903285410206), ('honesti', 18.144083561394954), ('00', 17.84610110065537), ('infanticid', 17.7669380047153), ('sharon', 17.698630630799315), ('nl', 17.141969733679776), ('nitya', 16.93568280287904), ('nth', 16.554376405683428), ('yogurt', 16.47997584693423), ('wis', 15.110185003208079), ('gold', 14.775121966332373), ('indians', 14.375132496303328), ('sirri', 13.670828710724956), ('mp', 13.510601943112349), ('smartwrit', 13.284301444974792), ('hormon', 12.888871463365266), ('lynch', 12.269116489967567)]

Вывод для класса 1 при стемминге (TF-IDF):

train_data_tfidf_stem02 = tfidf.transform(train_data_stem02)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem02.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 73.19785320181145), ('nitya', 47.894015002175934), ('thrush', 46.72623639639594), ('indians', 42.058831895867094), ('infanticid', 39.21671119663035), ('align', 37.099079684379184), ('televis', 32.92962955950493), ('honesti', 31.595252991003772), ('yogurt', 21.65357298190406), ('authur', 20.06554088557756), ('test', 19.207134738137114), ('analog', 17.989122824343898),

('ncr', 17.534951889388697), ('gold', 16.826066041620084), ('eyebal', 16.725968637581865), ('mistaken', 15.931423595352783), ('delusion', 15.832190268159595), ('anoth', 15.210556211911053), ('wis', 15.036587365870055), ('norman', 14.803083907323126)]

Вывод для класса 2 при стемминге (TF-IDF):

train_data_tfidf_stem03 = tfidf.transform(train_data_stem03)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem03.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 52.41036666091578), ('nitya', 31.5457317218622), ('thrush', 29.76697649801028), ('televis', 28.442502261872875), ('align', 23.430520688446705), ('indians', 22.58941359229138), ('yogurt', 21.897333842938895), ('honesti', 17.76283449331637), ('infanticid', 16.370060523104573), ('ncr', 14.768409319406455), ('authur', 14.030398235121133), ('got', 13.119913791383484), ('camera', 12.560723988055729), ('franc', 12.426242320756325), ('analog', 12.085934748812717), ('test', 11.880798613011809), ('anoth', 11.606394740614364), ('delusion', 11.324976575959763), ('vol', 10.664728033702119), ('tesla', 10.650853317974791)]

Вывод для всей выборки при отсечении стоп-слов и стемминге (TF-IDF):

train_data_tfidf_stem04 = tfidf.transform(train_data_stem04)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem04.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('thi', 129.32800919419773), ('wa', 76.97728386002947), ('ha', 53.69445658099784), ('use', 53.51224701143201), ('know', 43.32344772442169), ('like', 42.195836240882116), ('ani', 39.70519916099226), ('doe', 38.21698090733035), ('new', 35.93673401728441), ('just', 35.86760778006493), ('offer', 33.36406397481835), ('good', 32.747739647032375), ('say', 30.273363219549942), ('peopl', 30.205409363344643), ('hi', 29.733312069224763), ('veri', 29.326246895236878), ('make', 29.146452922239888), ('includ', 28.938623574701445), ('sale', 28.28901917050075), ('think', 27.98031918427612)]

Вывод для класса 0 при отсечении стоп-слов и стемминге (TF-IDF):

train_data_tfidf_stem05 = tfidf.transform(train_data_stem05)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem05.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('offer', 31.300454250625975), ('thi', 30.71294422915274), ('sale', 27.88415855254791), ('use', 26.19807462984792), ('new', 25.12228543439097), ('sell', 24.447046378171603), ('includ', 23.128480232840484), ('ship', 21.646752580610237), ('price', 20.066339982514304), ('condit', 17.982579325953647), ('ha', 16.277814070979513), ('pleas', 16.275703057169903), ('game', 15.77903267968755), ('look', 15.548768824682524), ('drive', 15.353737243885698), ('00', 15.151775703440967), ('like', 14.728363624494344), ('card', 13.724999418206192), ('ask', 13.488615400356453), ('manual', 12.707447575244652)]

Вывод для класса 1 при отсечении стоп-слов и стемминге (TF-IDF):

train_data_tfidf_stem06 = tfidf.transform(train_data_stem06)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem06.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('thi', 60.31506016065788), ('wa', 36.56870904199805), ('ha', 26.8877664273872), ('know', 23.287296067425668), ('ani', 21.266027064071608), ('use', 19.13538740405915), ('doe', 18.897663436655616), ('like', 17.903345304925175), ('just', 15.018185272421794), ('peopl', 14.586979652841226), ('think', 14.257205933281572), ('diseas', 14.082694098863271), ('pitt', 13.84867776057893), ('veri', 13.825231559868936), ('gordon', 13.784368382302274), ('patient', 13.763866772938552), ('soon', 13.194187022117266), ('intellect', 13.181920085996099), ('surrend', 13.121708345783816), ('geb', 13.047250414750424)]

Вывод для класса 2 при отсечении стоп-слов и стемминге (TF-IDF):

train_data_tfidf_stem07 = tfidf.transform(train_data_stem07)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf_stem07.sum(axis=0)

)))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('thi', 38.30000480438703), ('wa', 29.230707163320695), ('christian', 18.278353325066014), ('god', 18.1522753870565), ('say', 15.13522226440012), ('hi', 14.319276389373092), ('did', 12.83520308410148), ('know', 12.72928745538582), ('peopl', 12.715392601434948), ('jesu', 12.166534186801478), ('just', 11.962814393957474), ('doe', 11.693381611698983), ('think', 11.137948615830004), ('ha', 10.52887608263115), ('like', 9.564127311462622), ('believ', 8.826101977033428), ('point', 8.337902894547822), ('use', 8.178784977524954), ('good', 8.079523992870506), ('ani', 7.999518371547461)]

tf = TfidfTransformer(use_idf = False).fit(train_data) train_data_tf = tf.transform(train_data)

Вывод для всей выборки (TF):

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tone', 410.6466942091253), ('trip', 251.7142371449254), ('noon', 224.7563651678267), ('adjust', 213.50667797912118), ('ima', 171.8087951834235), ('harass', 154.16287320333123), ('epistl', 150.42899781013273), ('impact', 144.3601581889401), ('tomographi', 129.90078758447194), ('youth', 107.74646955575315), ('word', 76.73952240379423), ('forest', 74.76904182863568), ('trade', 73.98236585775292), ('agreement', 72.37175564019299), ('obviou', 70.35049990487211), ('naturalist', 65.67375220517485), ('armageddon', 63.48328583893221), ('grin', 57.567519020817166), ('nth', 54.056582370621975), ('aladdin', 50.53911703806337)]

Вывод для класса 0 (TF):

train_data_tf01 = tf.transform(train_data01)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf01.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tone', 96.58390854005752), ('epistl', 95.3261336507378), ('adjust', 72.56618853288931), ('trip', 68.6076816082005), ('harass', 47.77980505637252), ('noon', 46.35046197506701), ('impact', 39.032646000694484), ('ima', 35.699082341690136), ('obviou', 35.138785865799754), ('word', 33.91689638398363), ('youth', 32.98610807152292), ('forest', 31.33059776539425), ('shaw', 24.781641581778313), ('grin', 22.48655967632584), ('agreement', 21.895673322279677), ('trade', 20.766371909080622), ('macintosh', 20.657689863631422), ('motorcycl', 20.374454329778718), ('accton', 20.011925026927194), ('allergic', 19.119756075822647)]

Вывод для класса 1 (TF):

train_data_tf02 = tf.transform(train_data02)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf02.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tone', 188.0767804748418), ('trip', 115.21215212812078), ('noon', 110.67112192210904), ('ima', 89.98315588904559), ('adjust', 89.28622378616487), ('impact', 76.12173829264304), ('harass', 69.91959963457383), ('tomographi', 62.83599770461837), ('youth', 38.86540409957052), ('epistl', 36.81753725915403), ('trade', 33.88799150799945), ('armageddon', 31.780279373041203), ('agreement', 31.084212644712675), ('naturalist', 29.849046600200147), ('word', 27.92515164881687), ('forest', 27.465964484283415), ('nth', 24.889911285715417), ('aladdin', 23.762950253559957), ('miner', 23.631055152617392), ('bet', 22.74486509005074)]

Вывод для класса 2 (TF):

train_data_tf03 = tf.transform(train_data03)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf03.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tone', 125.98600519422573), ('trip', 67.89440340860428), ('noon', 67.73478127065086), ('adjust', 51.6542656600671), ('tomographi', 50.60526399005471), ('ima', 46.12655695268774), ('harass', 36.463468512385006), ('youth', 35.89495738465961), ('impact', 29.205773895602697), ('naturalist', 22.2608827293163), ('agreement', 19.39186967320067), ('trade', 19.328002440672854), ('armageddon', 19.290882428807176), ('epistl', 18.28532690024091), ('aladdin', 16.472927010282717), ('forest', 15.972479578958051), ('word', 14.897474370993647), ('bet', 14.011322719216535), ('formul', 13.872733047811504), ('wealth', 13.807224650294263)]

Вывод для всей выборки при отсечении стоп-слов (TF):

train_data_tf04 = tf.transform(train_data04)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf04.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('definit', 42.19836218494229), ('kuhn', 41.94033888116838), ('netcom', 38.49189557572986), ('jhu', 38.43222500691624), ('invalid', 36.441798363953794), ('pip', 36.20820897355665), ('difficulti', 34.694153609728126), ('sincer', 33.10807332706017), ('firmli', 32.20601101295828), ('thank', 28.946781478743482), ('deem', 26.955539997275125), ('threw', 26.0987325817474), ('olympu', 25.200411794386508), ('uucp', 25.15020882674478), ('00', 22.847118231150784), ('lizard', 22.24617480778731), ('uu', 21.080078313079557), ('financi', 21.00989280692224), ('hystaspes_', 20.947258397082738), ('somat', 20.37436327922064)]

Вывод для класса 0 при отсечении стоп-слов (TF):

train_data_tf05 = tf.transform(train_data05)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf05.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('sincer', 32.65483761325764), ('netcom', 26.531791875665423), ('olympu', 24.091705286864663), ('00', 22.292416466161598), ('somat', 20.22797546329431), ('chb', 17.557342544980045), ('discoveri', 16.378840818878814), ('ramada', 15.73734249317387), ('hystaspes_', 15.590855972723451), ('smartdrug', 15.570422707355517), ('uucp', 15.358457141843216), ('kuhn', 15.092333169147988), ('linux', 14.857875711021709), ('difficulti', 14.641828043137563), ('depriv', 14.286874745376672), ('alejandro', 13.878085620513344), ('10', 12.252398950200005),

('attent', 11.955520632053954), ('firmli', 11.362005986278653), ('lawrenc', 11.27035006975232)]

Вывод для класса 1 при отсечении стоп-слов (TF):

train_data_tf06 = tf.transform(train_data06)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf06.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('definit', 21.589878916084416), ('jhu', 21.04119998231627), ('difficulti', 19.297138878101315), ('kuhn', 17.73731563284006), ('pip', 17.405873900275285), ('invalid', 15.294268932063357), ('thank', 14.6495413220078), ('mouth', 14.026245731941556), ('threw', 13.778853949223942), ('fist', 13.314589187179175), ('deem', 12.997218833062192), ('apocalypt', 12.908770782465169), ('sponsorship', 12.801233869447662), ('predispos', 12.652476415348302), ('farrar', 12.59139169317015), ('firmli', 12.457034152057558), ('speak', 12.119701910783679), ('bro', 12.043005411898942), ('huxley', 12.043005411898942), ('music', 12.043005411898942)]

Вывод для класса 2 при отсечении стоп-слов (TF):

train_data_tf07 = tf.transform(train_data07)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf07.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('financi', 20.434219703353282), ('pip', 15.544106122265065), ('inner', 13.646184382428856), ('definit', 12.819186474289085), ('invalid', 12.485953360589166), ('thank', 11.524269384086313), ('jhu', 11.05933486881957), ('bundl', 9.897003235170516), ('kuhn', 9.110690079180278), ('sized', 8.816862412082582), ('deem', 8.790042992168944), ('firmli', 8.386970874622094), ('ct', 8.334783193358227), ('audibl', 7.864683989743237), ('buoyanc', 7.60326957536585), ('astonish', 7.558513996512791), ('warranti', 7.2329456679772575), ('sin', 6.865274433532172), ('threw', 6.785575289449005), ('prize', 6.488367064561185)]

tf = TfidfTransformer(use_idf = False).fit(train_data_stem) train_data_tf_stem = tf.transform(train_data_stem)

Вывод для всей выборки при стемминге (TF):

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 401.81008124918867), ('thrush', 246.43616002561686), ('nitya', 220.04264038039668), ('align', 208.92142892485631), ('indians', 171.16656294896478), ('honesti', 150.94494561031672), ('infanticid', 147.44374510244367), ('eyebal', 146.91482648292387), ('televis', 127.33554116137952), ('yogurt', 105.57423200403775), ('gold', 79.221066957697), ('wis', 75.18014089604114), ('test', 72.51020616551847), ('analog', 72.47888994113973), ('authur', 69.19216169860725), ('nth', 68.8353980802429), ('ncr', 66.23585307278694), ('hill', 56.35525307163919), ('delusion', 55.063274509705), ('norman', 53.153453200297555)]

Вывод для класса 0 при стемминге (TF):

train_data_tf_stem01 = tf.transform(train_data_stem01)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem01.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 94.04935491072541), ('eyebal', 92.96362352155147), ('align', 70.79112133765337), ('thrush', 66.92086119240213), ('honesti', 46.73861863339445), ('nitya', 45.20242858920657), ('infanticid', 39.78767019234924), ('indians', 35.220078052626214), ('nth', 34.332996471572976), ('wis', 33.205533346101404), ('yogurt', 32.2425480897156), ('gold', 31.14662753051662), ('sharon', 24.510514206247038), ('hill', 21.988642203421733), ('nl', 21.720431196979277), ('analog', 21.553662886201675), ('test', 20.28116495452412), ('lynch', 20.21743632025493), ('mp', 19.880550608315847), ('ahead', 19.315159932279066)]

Вывод для класса 1 при стемминге (TF):

train_data_tf_stem02 = tf.transform(train_data_stem02)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem02.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 184.1512866766698), ('thrush', 112.87914637294261), ('nitya', 108.444862358752), ('indians', 89.89711469136624), ('align', 87.3596328802749), ('infanticid', 77.22934287223913), ('honesti', 68.46334225840928), ('televis', 61.52419000124554), ('yogurt', 38.0706252510112), ('eyebal', 35.98989783048909), ('authur', 34.32259962899431), ('test', 33.07011122143357), ('analog', 31.14306618264141), ('gold', 30.872613979039862), ('ncr', 29.897706867671953), ('wis', 27.37113007333662), ('delusion', 24.46925300850209), ('norman', 24.413594141530663), ('anoth', 23.21997127767892), ('mistaken', 23.12020789565872)]

Вывод для класса 2 при стемминге (TF):

train_data_tf_stem03 = tf.transform(train_data_stem03)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem03.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('tempera', 123.60943966179349), ('thrush', 66.63615246027227), ('nitya', 66.39534943243824), ('align', 50.77067470692797), ('televis', 49.63967938013265), ('indians', 46.04937020497246), ('honesti', 35.74298471851306), ('yogurt', 35.26105866331091), ('infanticid', 30.426732037855363), ('ncr', 22.75197862101842), ('authur', 22.2099437447239), ('analog', 19.78216087229663), ('test', 19.158929989560644), ('eyebal', 17.96130513088313), ('gold', 17.20182544814053), ('delusion', 16.466740215999895), ('anoth', 16.15824068007798), ('wis', 14.603477476603075), ('bloodlet', 13.774900026852691), ('vol', 13.695637430060259)]

Вывод для всей выборки при отсечении стоп-слов и стемминге (TF):

train_data_tf_stem04 = tf.transform(train_data_stem04)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem04.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('thi', 119.68411086611744), ('wa', 67.8130647845095), ('use', 53.0711584791655), ('ha', 51.626233026528816), ('like', 41.828280956114455), ('ani', 41.39152359405588), ('know', 37.9050419504865), ('new', 36.165670658450054), ('doe', 33.61151211578544), ('edu', 33.01669813235763), ('just', 33.004399253703504), ('peopl', 32.72368676176806), ('sale', 32.05565812608838), ('pleas', 31.290983063620995), ('offer', 30.698152509386688), ('think', 30.54565526782595), ('hi', 30.38787760294542), ('good', 30.26122588815749), ('time', 28.452666865132706), ('includ', 28.415345939750672)]

Вывод для класса 0 при отсечении стоп-слов и стемминге (TF):

train_data_tf_stem05 = tf.transform(train_data_stem05)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem05.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('sale', 31.591562426243993), ('offer', 28.805502550285986), ('thi', 28.28394413892743), ('use', 25.857801323838395), ('new', 25.0919622282373), ('includ', 22.68719596527167), ('00', 21.959919582798438), ('sell', 21.361979341324773), ('pleas', 20.843770936340054), ('ship', 19.873997035254465), ('price', 17.411116674387507), ('condit', 17.043775025500484), ('mail', 16.239637850800797), ('ask', 15.690863391115833), ('drive', 15.587622555928782), ('email', 15.549580299988085), ('ha', 15.531062831057048), ('like', 14.588858566159105), ('look', 14.151025390415866), ('edu', 14.086486221599179)]

Вывод для класса 1 при отсечении стоп-слов и стемминге (TF):

train_data_tf_stem06 = tf.transform(train_data_stem06)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem06.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True)

print (x[:20])

[('thi', 55.971573297030275), ('wa', 32.31353830566072), ('ha', 25.95505265472963), ('ani', 22.33128544375682), ('know', 20.450948164730875), ('use', 19.056635939142186), ('edu', 18.263792739201925), ('like', 17.787831589506578), ('doe', 16.61238212946485), ('peopl', 15.890083890356072), ('think', 15.625262532822678), ('time', 14.685409516342581), ('just', 13.844498079975486), ('effect', 13.09896125699551), ('msg', 12.864695193945531), ('food', 12.85761355532246), ('gordon', 12.661294467723826), ('diseas', 12.49352264416372), ('bank', 12.40319784529103), ('caus', 12.312875825395574)]

Вывод для класса 2 при отсечении стоп-слов и стемминге (TF):

train_data_tf_stem07 = tf.transform(train_data_stem07)

x = list(zip(vect.get_feature_names(), np.ravel(train_data_tf_stem07.sum(axis=0))))

def SortbyTF(inputStr): return inputStr[1]

x.sort(key=SortbyTF, reverse = True) print (x[:20])

[('thi', 35.42859343015953), ('wa', 25.68486517557042), ('god', 18.952767252233688), ('christian', 17.988216233103145), ('hi', 14.550540659362783), ('peopl', 13.782319501874834), ('say', 13.270867799637418), ('jesu', 12.283127952715487), ('think', 12.093941266157671), ('did', 11.92943972135626), ('know', 11.153315193164616), ('just', 11.057138983847045), ('doe', 10.325487270114326), ('ha', 10.140117540742123), ('like', 9.451590800448765), ('believ', 9.016125161400216), ('ani', 8.342630762560646), ('moral', 8.242981256858124), ('use', 8.156721216184955), ('onli', 7.7708407333348095)]

6.Результаты пункта 5 в виде таблиц с наиболее частотными терминами обучающей выборки и каждого класса по отдельности.

Для всей выборки:

			Без стемминга
	Count			TF		TF-IDF
№	Без стоп-	С стоп-	Без стоп-		С стоп-	Без стоп-	С стоп-
	слов	словами	слов		словами	слов	словами
1	00	the	definit		tone	kuhn	tone
2	people	of	kuhn		trip	jhu	trip
3	new	to	netcom		noon	invalid	noon
4	edu	and	jhu		adjust	netcom	adjust
5	don	in	invalid		ima	pip	ima
6	like	is	pip		harass	difficulti	impact
7	good	that	difficulti		epistl	sincer	tomographi
8	just	it	sincer		impact	definit	harass
9	know	for	firmli		tomographi	olympu	epistl
10	10	you	thank		youth	firmli	youth
11	use	this	deem		word	threw	trade
12	god	are	threw		forest	deem	agreement
13	time	with	olympu		trade	uucp	forest
14	think	not	uucp		agreement	thank	word
15	does	have	00		obviou	lizard	naturalist

<<< < Предыдущая 12 / 42 3 4 > Следующая >>>

Соседние файлы в предмете Основы анализа текстовых данных

#
05.07.20221.4 Mб22text_lab1.pdf
#
05.07.2022366.82 Кб33text_lab2.pdf
#
05.07.2022175.78 Кб23text_lab3.pdf
#
05.07.2022754.07 Кб16text_lab4.pdf