Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Gauld A.Learning to program (Python)_1.pdf
Скачиваний:
23
Добавлен:
23.08.2013
Размер:
1.34 Mб
Скачать

The final grammar module

The only thing remaining is to improve the reporting to include the punctuation characters and the counts. Replace the existing reportStats() function with this:

def reportStats():

global sentence_count, clause_count for p in stop_tokens:

sentence_count = sentence_count + punctuation_counts[p] for c in punctuation_counts.keys():

clause_count = clause_count + punctuation_counts[c] print format % (sys.argv[1],

para_count, line_count, sentence_count, clause_count, len(groups))

print "The following punctuation characters were used:" for p in punctuation_counts.keys():

print "\t%s\t:\t%3d" % (p, punctuation_counts[p])

If you have carefully stitched all the above functions in place you should now be able to type:

C:> python grammar.py myfile.txt

and get a report on the stats for your file myfile.txt (or whatever it's really called). How useful this is to you is debateable but hopefully reading through the evolution of the code has helped you get some idea of how to create your own programs. The main thing is to try things out. There's no shame in trying several approaches, often you learn valuable lessons in the process.

To conclude our course we will rework the grammar module to use OO techniques. In the process you will see how an OO approach results in modules which are even more flexible for the user and more extensible too.

Classes and objects

One of the biggest problems for the user of our module is the reliance on global variables. This means that it can only analyze one document at a time, any attempt to handle more than that will result in the global values being over-written.

By moving these globals into a class we can then create multiple instances of the class (one per file) and each instance gets its own set of variables. Further, by making the methods sufficiently granular we can create an architecture whereby it is easy for the creator of a new type of document object to modify the search criteria to cater for the rules of the new type. (eg. by rejecting all HTML tags from the word list).

Our first attempt at this is:

#! /usr/local/bin/python

################################

#Module: document.py

#Author: A.J. Gauld

#Date: 2000/08/12

#Version: 2.0

################################

#This module provides a Document class which

#can be subclassed for different categories of

#Document(text, HTML, Latex etc). Text and HTML are

#provided as samples.

#

#Primary services available include

#- getCharGroups(),

106

#- getWords(),

#- reportStats().

################################

import sys,string

class Document:

def __init__(self, filename): self.filename = filename self.para_count = 1

self.line_count, self.sentence_count, self.clause_count, self.word_count = 0,0,0,0

self.alphas = string.letters + string.digits self.stop_tokens = ['.','?','!']

self.punctuation_chars = ['&','(',')','-',';',':',','] + self.stop_tokens

self.lines = [] self.groups = []

self.punctuation_counts = {}

for c in self.punctuation_chars + self.stop_tokens: self.punctuation_counts[c] = 0

self.format = """%s contains:

%d paragraphs, %d lines and %d sentences.

These in turn contain %d clauses and a total of %d words."""

def getLines(self): try:

self.infile = open(self.filename,"r") self.lines = self.infile.readlines()

except:

print "Failed to read file ",self.filename sys.exit()

def getCharGroups(self, lines): for line in lines:

line = line[:-1] # lose the '\n' at the end self.line_count = self.line_count + 1

if len(line) == 0: # empty => para break self.para_count = self.para_count + 1

else:

self.groups = self.groups + string.split(line)

def getWords(self): pass

def reportStats(self, paras=1, lines=1, sentences=1, words=1, punc=1):

pass

def Analyze(self): self.getLines() self.getCharGroups(self.lines) self.getWords() self.reportStats()

class TextDocument(Document): pass

class HTMLDocument(Document): pass

107

if __name__ == "__main__": if len(sys.argv) != 2:

print "Usage: python document.py <filename>" sys.exit()

else:

D = Document(sys.argv[1]) D.Analyze()

Now to implement the class we need to define the getWords method. We could simply copy what we did in the previous version and create a trim method, however we want the OO version to be easily extendible so instead we'll break getWords down into a series of steps. Then in subclasses we only need to override the substeps and not the whole getWords method. This should allow a much wider scope for dealing with different types of document.

Specifically we will add methods to reject groups which we recognise as invalid, trim unwanted characters from the front and from the back. Thus we add 3 methods to Document and implement getWords in terms of these methods.

class Document:

# .... as above def getWords(self):

for w in self.groups: self.ltrim(w) self.rtrim(w) self.removeExceptions()

def removeExceptions(self): pass

def ltrim(self,word): pass

def rtrim(self,word): pass

Notice however that we define the bodies with the single command pass, which does absolutely nothing. Instead we will define how these methods operate for each concrete document type.

Text Document

A text document looks like:

class TextDocument(Document): def ltrim(self,word):

while (len(word) > 0) and (word[0] not in self.alphas): ch = word[0]

if ch in self.c_punctuation.keys(): self.c_punctuation[ch] = self.c_punctuation[ch] + 1

word = word[1:] return word

def rtrim(self,word):

while (len(word) > 0) and (word[-1] not in self.alphas): ch = word[-1]

if ch in self.c_punctuation.keys(): self.c_punctuation[ch] = self.c_punctuation[ch] + 1

word = word[:-1] return word

108

def removeExceptions(self): top = len(self.groups) n = 0

while n < top:

if (len(self.groups[n]) == 0): del(self.groups[n])

top = top - 1 n = n+1

The trim functions are virtually identical to our grammar.py module's trim function, but split into two. The removeExceptions function has been defined to remove blank words.

Notice that I have changed the structure of the latter method to use a while loop instead of the previous for. This is because during testing a bug was found whereby if we deleted elements from the list the range (calculated at the beginning) still had the original length and we wound up trying to access members of the list beyond the end. To avoid that we use a while loop and adjust the maximum index each time we remove an element.

HTML Document

For HTML we will use a feature of Python that we haven't seen before: regular exressions. These are special string patterns that we can use for finding complex strings. Here we use them to remove anything between < and >. This means we will need to redefine getWords. The actual stripping of punctuation should be the same as for plain text so instead of inheriting directly from Document we will inherit from TextDocument and reuse its trim methods.

Thus HTMLDocument looks like:

class HTMLDocument(TextDocument): def removeExceptions(self):

""" use regular expressions to remove all <.+?> """

import re

tag = re.compile("<.+?>")# use non greedy re L = 0

while L < len(self.lines):

if len(self.lines[L]) > 1: # if its not blank self.lines[L] = tag.sub('', self.lines[L]) if len(self.lines[L]) == 1:

del(self.lines[L]) else: L = L+1

else: L = L+1

def getWords(self): self.removeExceptions()

for i in range(len(self.groups)): w = self.groups[i]

w = self.ltrim(w) self.groups[i] = self.rtrim(w)

TextDocument.removeExceptions(self)# now strip empty words

Note 1: The only thing to note here is the call to self.removeExceptions before trimming and then calling TextDocument.removeExceptions. If we had relied on the inherited getWords it would have called our removeExceptions after trimming which we don't want.

109