
- •Introduction
- •Introduction - What, Why, Who etc.
- •Why am I writing this?
- •What will I cover
- •Who should read it?
- •Why Python?
- •Other resources
- •Concepts
- •What do I need?
- •Generally
- •Python
- •QBASIC
- •What is Programming?
- •Back to BASICs
- •Let me say that again
- •A little history
- •The common features of all programs
- •Let's clear up some terminology
- •The structure of a program
- •Batch programs
- •Event driven programs
- •Getting Started
- •A word about error messages
- •The Basics
- •Simple Sequences
- •>>> print 'Hello there!'
- •>>>print 6 + 5
- •>>>print 'The total is: ', 23+45
- •>>>import sys
- •>>>sys.exit()
- •Using Tcl
- •And BASIC too...
- •The Raw Materials
- •Introduction
- •Data
- •Variables
- •Primitive Data Types
- •Character Strings
- •String Operators
- •String operators
- •BASIC String Variables
- •Tcl Strings
- •Integers
- •Arithmetic Operators
- •Arithmetic and Bitwise Operators
- •BASIC Integers
- •Tcl Numbers
- •Real Numbers
- •Complex or Imaginary Numbers
- •Boolean Values - True and False
- •Boolean (or Logical) Operators
- •Collections
- •Python Collections
- •List
- •List operations
- •Tcl Lists
- •Tuple
- •Dictionary or Hash
- •Other Collection Types
- •Array or Vector
- •Stack
- •Queue
- •Files
- •Dates and Times
- •Complex/User Defined
- •Accessing Complex Types
- •User Defined Operators
- •Python Specific Operators
- •More information on the Address example
- •More Sequences and Other Things
- •The joy of being IDLE
- •A quick comment
- •Sequences using variables
- •Order matters
- •A Multiplication Table
- •Looping - Or the art of repeating oneself!
- •FOR Loops
- •Here's the same loop in BASIC:
- •WHILE Loops
- •More Flexible Loops
- •Looping the loop
- •Other loops
- •Coding Style
- •Comments
- •Version history information
- •Commenting out redundant code
- •Documentation strings
- •Indentation
- •Variable Names
- •Modular Programming
- •Conversing with the user
- •>>> print raw_input("Type something: ")
- •BASIC INPUT
- •Reading input in Tcl
- •A word about stdin and stdout
- •Command Line Parameters
- •Tcl's Command line
- •And BASIC
- •Decisions, Decisions
- •The if statement
- •Boolean Expressions
- •Tcl branches
- •Case statements
- •Modular Programming
- •What's a Module?
- •Using Functions
- •BASIC: MID$(str$,n,m)
- •BASIC: ENVIRON$(str$)
- •Tcl: llength L
- •Python: pow(x,y)
- •Python: dir(m)
- •Using Modules
- •Other modules and what they contain
- •Tcl Functions
- •A Word of Caution
- •Creating our own modules
- •Python Modules
- •Modules in BASIC and Tcl
- •Handling Files and Text
- •Files - Input and Output
- •Counting Words
- •BASIC and Tcl
- •BASIC Version
- •Tcl Version
- •Handling Errors
- •The Traditional Way
- •The Exceptional Way
- •Generating Errors
- •Tcl's Error Mechanism
- •BASIC Error Handling
- •Advanced Topics
- •Recursion
- •Note: This is a fairly advanced topic and for most applications you don't need to know anything about it. Occasionally, it is so useful that it is invaluable, so I present it here for your study. Just don't panic if it doesn't make sense stright away.
- •What is it?
- •Recursing over lists
- •Object Oriented Programming
- •What is it?
- •Data and Function - together
- •Defining Classes
- •Using Classes
- •Same thing, Different thing
- •Inheritance
- •The BankAccount class
- •The InterestAccount class
- •The ChargingAccount class
- •Testing our system
- •Namespaces
- •Introduction
- •Python's approach
- •And BASIC too
- •Event Driven Programming
- •Simulating an Event Loop
- •A GUI program
- •GUI Programming with Tkinter
- •GUI principles
- •A Tour of Some Common Widgets
- •>>> F = Frame(top)
- •>>>F.pack()
- •>>>lHello = Label(F, text="Hello world")
- •>>>lHello.pack()
- •>>> lHello.configure(text="Goodbye")
- •>>> lHello['text'] = "Hello again"
- •>>> F.master.title("Hello")
- •>>> bQuit = Button(F, text="Quit", command=F.quit)
- •>>>bQuit.pack()
- •>>>top.mainloop()
- •Exploring Layout
- •Controlling Appearance using Frames and the Packer
- •Adding more widgets
- •Binding events - from widgets to code
- •A Short Message
- •The Tcl view
- •Wrapping Applications as Objects
- •An alternative - wxPython
- •Functional Programming
- •What is Functional Programming?
- •How does Python do it?
- •map(aFunction, aSequence)
- •filter(aFunction, aSequence)
- •reduce(aFunction, aSequence)
- •lambda
- •Other constructs
- •Short Circuit evaluation
- •Conclusions
- •Other resources
- •Conclusions
- •A Case Study
- •Counting lines, words and characters
- •Counting sentences instead of lines
- •Turning it into a module
- •getCharGroups()
- •getPunctuation()
- •The final grammar module
- •Classes and objects
- •Text Document
- •HTML Document
- •Adding a GUI
- •Refactoring the Document Class
- •Designing a GUI
- •References
- •Books to read
- •Python
- •BASIC
- •General Programming
- •Object Oriented Programming
- •Other books worth reading are:
- •Web sites to visit
- •Languages
- •Python
- •BASIC
- •Other languages of interest
- •Programming in General
- •Object Oriented Programming
- •Projects to try
- •Topics for further study
The final grammar module
The only thing remaining is to improve the reporting to include the punctuation characters and the counts. Replace the existing reportStats() function with this:
def reportStats():
global sentence_count, clause_count for p in stop_tokens:
sentence_count = sentence_count + punctuation_counts[p] for c in punctuation_counts.keys():
clause_count = clause_count + punctuation_counts[c] print format % (sys.argv[1],
para_count, line_count, sentence_count, clause_count, len(groups))
print "The following punctuation characters were used:" for p in punctuation_counts.keys():
print "\t%s\t:\t%3d" % (p, punctuation_counts[p])
If you have carefully stitched all the above functions in place you should now be able to type:
C:> python grammar.py myfile.txt
and get a report on the stats for your file myfile.txt (or whatever it's really called). How useful this is to you is debateable but hopefully reading through the evolution of the code has helped you get some idea of how to create your own programs. The main thing is to try things out. There's no shame in trying several approaches, often you learn valuable lessons in the process.
To conclude our course we will rework the grammar module to use OO techniques. In the process you will see how an OO approach results in modules which are even more flexible for the user and more extensible too.
Classes and objects
One of the biggest problems for the user of our module is the reliance on global variables. This means that it can only analyze one document at a time, any attempt to handle more than that will result in the global values being over-written.
By moving these globals into a class we can then create multiple instances of the class (one per file) and each instance gets its own set of variables. Further, by making the methods sufficiently granular we can create an architecture whereby it is easy for the creator of a new type of document object to modify the search criteria to cater for the rules of the new type. (eg. by rejecting all HTML tags from the word list).
Our first attempt at this is:
#! /usr/local/bin/python
################################
#Module: document.py
#Author: A.J. Gauld
#Date: 2000/08/12
#Version: 2.0
################################
#This module provides a Document class which
#can be subclassed for different categories of
#Document(text, HTML, Latex etc). Text and HTML are
#provided as samples.
#
#Primary services available include
#- getCharGroups(),
106
#- getWords(),
#- reportStats().
################################
import sys,string
class Document:
def __init__(self, filename): self.filename = filename self.para_count = 1
self.line_count, self.sentence_count, self.clause_count, self.word_count = 0,0,0,0
self.alphas = string.letters + string.digits self.stop_tokens = ['.','?','!']
self.punctuation_chars = ['&','(',')','-',';',':',','] + self.stop_tokens
self.lines = [] self.groups = []
self.punctuation_counts = {}
for c in self.punctuation_chars + self.stop_tokens: self.punctuation_counts[c] = 0
self.format = """%s contains:
%d paragraphs, %d lines and %d sentences.
These in turn contain %d clauses and a total of %d words."""
def getLines(self): try:
self.infile = open(self.filename,"r") self.lines = self.infile.readlines()
except:
print "Failed to read file ",self.filename sys.exit()
def getCharGroups(self, lines): for line in lines:
line = line[:-1] # lose the '\n' at the end self.line_count = self.line_count + 1
if len(line) == 0: # empty => para break self.para_count = self.para_count + 1
else:
self.groups = self.groups + string.split(line)
def getWords(self): pass
def reportStats(self, paras=1, lines=1, sentences=1, words=1, punc=1):
pass
def Analyze(self): self.getLines() self.getCharGroups(self.lines) self.getWords() self.reportStats()
class TextDocument(Document): pass
class HTMLDocument(Document): pass
107
if __name__ == "__main__": if len(sys.argv) != 2:
print "Usage: python document.py <filename>" sys.exit()
else:
D = Document(sys.argv[1]) D.Analyze()
Now to implement the class we need to define the getWords method. We could simply copy what we did in the previous version and create a trim method, however we want the OO version to be easily extendible so instead we'll break getWords down into a series of steps. Then in subclasses we only need to override the substeps and not the whole getWords method. This should allow a much wider scope for dealing with different types of document.
Specifically we will add methods to reject groups which we recognise as invalid, trim unwanted characters from the front and from the back. Thus we add 3 methods to Document and implement getWords in terms of these methods.
class Document:
# .... as above def getWords(self):
for w in self.groups: self.ltrim(w) self.rtrim(w) self.removeExceptions()
def removeExceptions(self): pass
def ltrim(self,word): pass
def rtrim(self,word): pass
Notice however that we define the bodies with the single command pass, which does absolutely nothing. Instead we will define how these methods operate for each concrete document type.
Text Document
A text document looks like:
class TextDocument(Document): def ltrim(self,word):
while (len(word) > 0) and (word[0] not in self.alphas): ch = word[0]
if ch in self.c_punctuation.keys(): self.c_punctuation[ch] = self.c_punctuation[ch] + 1
word = word[1:] return word
def rtrim(self,word):
while (len(word) > 0) and (word[-1] not in self.alphas): ch = word[-1]
if ch in self.c_punctuation.keys(): self.c_punctuation[ch] = self.c_punctuation[ch] + 1
word = word[:-1] return word
108
def removeExceptions(self): top = len(self.groups) n = 0
while n < top:
if (len(self.groups[n]) == 0): del(self.groups[n])
top = top - 1 n = n+1
The trim functions are virtually identical to our grammar.py module's trim function, but split into two. The removeExceptions function has been defined to remove blank words.
Notice that I have changed the structure of the latter method to use a while loop instead of the previous for. This is because during testing a bug was found whereby if we deleted elements from the list the range (calculated at the beginning) still had the original length and we wound up trying to access members of the list beyond the end. To avoid that we use a while loop and adjust the maximum index each time we remove an element.
HTML Document
For HTML we will use a feature of Python that we haven't seen before: regular exressions. These are special string patterns that we can use for finding complex strings. Here we use them to remove anything between < and >. This means we will need to redefine getWords. The actual stripping of punctuation should be the same as for plain text so instead of inheriting directly from Document we will inherit from TextDocument and reuse its trim methods.
Thus HTMLDocument looks like:
class HTMLDocument(TextDocument): def removeExceptions(self):
""" use regular expressions to remove all <.+?> """
import re
tag = re.compile("<.+?>")# use non greedy re L = 0
while L < len(self.lines):
if len(self.lines[L]) > 1: # if its not blank self.lines[L] = tag.sub('', self.lines[L]) if len(self.lines[L]) == 1:
del(self.lines[L]) else: L = L+1
else: L = L+1
def getWords(self): self.removeExceptions()
for i in range(len(self.groups)): w = self.groups[i]
w = self.ltrim(w) self.groups[i] = self.rtrim(w)
TextDocument.removeExceptions(self)# now strip empty words
Note 1: The only thing to note here is the call to self.removeExceptions before trimming and then calling TextDocument.removeExceptions. If we had relied on the inherited getWords it would have called our removeExceptions after trimming which we don't want.
109