
- •Introduction
- •Introduction - What, Why, Who etc.
- •Why am I writing this?
- •What will I cover
- •Who should read it?
- •Why Python?
- •Other resources
- •Concepts
- •What do I need?
- •Generally
- •Python
- •QBASIC
- •What is Programming?
- •Back to BASICs
- •Let me say that again
- •A little history
- •The common features of all programs
- •Let's clear up some terminology
- •The structure of a program
- •Batch programs
- •Event driven programs
- •Getting Started
- •A word about error messages
- •The Basics
- •Simple Sequences
- •>>> print 'Hello there!'
- •>>>print 6 + 5
- •>>>print 'The total is: ', 23+45
- •>>>import sys
- •>>>sys.exit()
- •Using Tcl
- •And BASIC too...
- •The Raw Materials
- •Introduction
- •Data
- •Variables
- •Primitive Data Types
- •Character Strings
- •String Operators
- •String operators
- •BASIC String Variables
- •Tcl Strings
- •Integers
- •Arithmetic Operators
- •Arithmetic and Bitwise Operators
- •BASIC Integers
- •Tcl Numbers
- •Real Numbers
- •Complex or Imaginary Numbers
- •Boolean Values - True and False
- •Boolean (or Logical) Operators
- •Collections
- •Python Collections
- •List
- •List operations
- •Tcl Lists
- •Tuple
- •Dictionary or Hash
- •Other Collection Types
- •Array or Vector
- •Stack
- •Queue
- •Files
- •Dates and Times
- •Complex/User Defined
- •Accessing Complex Types
- •User Defined Operators
- •Python Specific Operators
- •More information on the Address example
- •More Sequences and Other Things
- •The joy of being IDLE
- •A quick comment
- •Sequences using variables
- •Order matters
- •A Multiplication Table
- •Looping - Or the art of repeating oneself!
- •FOR Loops
- •Here's the same loop in BASIC:
- •WHILE Loops
- •More Flexible Loops
- •Looping the loop
- •Other loops
- •Coding Style
- •Comments
- •Version history information
- •Commenting out redundant code
- •Documentation strings
- •Indentation
- •Variable Names
- •Modular Programming
- •Conversing with the user
- •>>> print raw_input("Type something: ")
- •BASIC INPUT
- •Reading input in Tcl
- •A word about stdin and stdout
- •Command Line Parameters
- •Tcl's Command line
- •And BASIC
- •Decisions, Decisions
- •The if statement
- •Boolean Expressions
- •Tcl branches
- •Case statements
- •Modular Programming
- •What's a Module?
- •Using Functions
- •BASIC: MID$(str$,n,m)
- •BASIC: ENVIRON$(str$)
- •Tcl: llength L
- •Python: pow(x,y)
- •Python: dir(m)
- •Using Modules
- •Other modules and what they contain
- •Tcl Functions
- •A Word of Caution
- •Creating our own modules
- •Python Modules
- •Modules in BASIC and Tcl
- •Handling Files and Text
- •Files - Input and Output
- •Counting Words
- •BASIC and Tcl
- •BASIC Version
- •Tcl Version
- •Handling Errors
- •The Traditional Way
- •The Exceptional Way
- •Generating Errors
- •Tcl's Error Mechanism
- •BASIC Error Handling
- •Advanced Topics
- •Recursion
- •Note: This is a fairly advanced topic and for most applications you don't need to know anything about it. Occasionally, it is so useful that it is invaluable, so I present it here for your study. Just don't panic if it doesn't make sense stright away.
- •What is it?
- •Recursing over lists
- •Object Oriented Programming
- •What is it?
- •Data and Function - together
- •Defining Classes
- •Using Classes
- •Same thing, Different thing
- •Inheritance
- •The BankAccount class
- •The InterestAccount class
- •The ChargingAccount class
- •Testing our system
- •Namespaces
- •Introduction
- •Python's approach
- •And BASIC too
- •Event Driven Programming
- •Simulating an Event Loop
- •A GUI program
- •GUI Programming with Tkinter
- •GUI principles
- •A Tour of Some Common Widgets
- •>>> F = Frame(top)
- •>>>F.pack()
- •>>>lHello = Label(F, text="Hello world")
- •>>>lHello.pack()
- •>>> lHello.configure(text="Goodbye")
- •>>> lHello['text'] = "Hello again"
- •>>> F.master.title("Hello")
- •>>> bQuit = Button(F, text="Quit", command=F.quit)
- •>>>bQuit.pack()
- •>>>top.mainloop()
- •Exploring Layout
- •Controlling Appearance using Frames and the Packer
- •Adding more widgets
- •Binding events - from widgets to code
- •A Short Message
- •The Tcl view
- •Wrapping Applications as Objects
- •An alternative - wxPython
- •Functional Programming
- •What is Functional Programming?
- •How does Python do it?
- •map(aFunction, aSequence)
- •filter(aFunction, aSequence)
- •reduce(aFunction, aSequence)
- •lambda
- •Other constructs
- •Short Circuit evaluation
- •Conclusions
- •Other resources
- •Conclusions
- •A Case Study
- •Counting lines, words and characters
- •Counting sentences instead of lines
- •Turning it into a module
- •getCharGroups()
- •getPunctuation()
- •The final grammar module
- •Classes and objects
- •Text Document
- •HTML Document
- •Adding a GUI
- •Refactoring the Document Class
- •Designing a GUI
- •References
- •Books to read
- •Python
- •BASIC
- •General Programming
- •Object Oriented Programming
- •Other books worth reading are:
- •Web sites to visit
- •Languages
- •Python
- •BASIC
- •Other languages of interest
- •Programming in General
- •Object Oriented Programming
- •Projects to try
- •Topics for further study
sentence count = sum of('.', '?', '!')
clause count = sum of all punctuation (very poor definition...)
report paras, lines, sentences, clauses, groups, words. foreach puntuation char:
report count
That looks like we could create maybe 4 functions using the natural grouping above. This might help us build a module that could be reused either whole or in part.
Turning it into a module
The key functions are: getCharGroups(infile), and getPuntuation(wordList).
Let's see what we come up with based on the pseudo code...
#############################
#Module: grammar
#Created: A.J. Gauld, 2000,8,12
#Funtion:
#counts paragraphs, lines, sentences, 'clauses', char groups,
#words and punctuation for a prose like text file. It assumes
#that sentences end with [.!?] and paragraphs have a blank line
#between them. A 'clause' is simply a segment of sentence
#separated by punctuation(braindead but maybe someday we'll
#do better!)
#
#Usage: Basic usage takes a filename parameter and outputs all
#stats. Its really intended that a second module use the
#functions provided to produce more useful commands.
#############################
import string, sys
############################
# initialise global variables para_count = 1
line_count, sentence_count, clause_count, word_count = 0,0,0,0 groups = []
punctuation_counts = {}
alphas = string.letters + string.digits stop_tokens = ['.','?','!']
punctuation_chars = ['&','(',')','-',';',':',','] + stop_tokens for c in punctuation_chars:
punctuation_counts[c] = 0 format = """%s contains:
%d paragraphs, %d lines and %d sentences.
These in turn contain %d clauses and a total of %d words."""
############################
# Now define the functions that do the work
def getCharGroups(infile): pass
def getPunctuation(wordList): pass
102
def reportStats():
print format % (sys.argv[1],para_count, line_count, sentence_count, clause_count, word_count)
def Analyze(infile): getCharGroups(infile) getPunctuation(groups) reportStats()
#Make it run if called from the command line (in which
#case the 'magic' __name__ variable gets set to '__main__'
if __name__ == "__main__": if len(sys.argv) != 2:
print "Usage: python grammer.py <filename >" sys.exit()
else:
Document = open(sys.argv[1],"r") Analyze(Document)
Rather than trying to show the whole thing in one long listing I'll discuss this skeleton then we will look at each of the 3 significant functions in turn. To make the program work you will need to paste it all together at the end however.
First thing to notice is the commenting at the top. This is common practice to let readers of the file get an idea of what it contains and how it should be used. The version information(Author and date) is useful too if comparing results with someone else who may be using a more or less recent version.
The final section is a feature of Python that calls any module loaded at the command line "__main__" . We can test the special, built-in __name__ variable and if its main we know the module is not just being imported but run and so we execute the trigger code inside the if.
This trigger code includes a user friendly hint about how the program should be run if no filename is provided, or indeed if too many filenames are provided.
Finally notice that the Analyze() function simply calls the other functions in the right order. Again this is quite common practice to allow a user to choose to either use all of the functionality in a straightforward manner (through Analyze()) or to call the low level primitive functions directly.
getCharGroups()
The pseudo code for this segment was: foreach line in file:
increment line count if line empty:
increment paragraph count split line into character groups
We can implement this in Python with very little extra effort:
# use global counter variables and list of char groups def getCharGroups(infile):
global para_count, line_count, groups try:
for line in infile.readlines(): line_count = line_count + 1
if len(line) == 1: # only newline => para break
103
para_count = para_count + 1 else:
groups = groups + string.split(line)
except:
print "Failed to read file ", sys.argv[1] sys.exit()
Note 1: We have to use the global keyword here to declare the variables which are created outside of the function. If we didn't when we assign to them Python will create new variables of the same name local to this function. Changing these local variables will have no effect on the module level (or global) values
Note 2: We have used a try/except clause here to trap any errors, report the failure and exit the program.
getPunctuation()
This takes a little bit more effort and uses a couple of new features of Python.
The pseudo code looked like:
foreach character group: increment group count
extract punctuation chars into a dictionary - {char:count} if no chars left:
delete group
else: increment word count
My first attempt looked like this:
def getPunctuation(wordList): global punctuation_counts for item in wordList:
while item and (item[-1] not in alphas): p = item[-1]
item = item[:-1]
if p in punctuation_counts.keys(): punctuation_counts[p] = punctuation_counts[p] + 1
else: punctuation_counts[p] = 1
Notice that this does not include the final if/else clause of the psudo-code version. I left it off for simplicity and because I felt that in practice very few words containing only punctuation characters would be found. We will however add it to the final version of the code.
Note 1: We have paramaterised the wordList so that users of the module can supply their own list rather than being forced to work from a file.
Note 2: We assigned item[:-1] to item. This is known as slicing in Python and the colon simply says treat the index as a range. We could for example have specified item[3:6] to extract item[3}, item[4] and item[5] into a list.
The default range is the start or end of the list depending on which side of the colon is blank. Thus item[3:] would signify all members of item from item[3] to the end. Again this is a very useful Python feature. The original item list is lost (and duly garbage collected) and the newly created list assigned to item
104
Note 3: We use a negative index to extract the last character from item. This is a very useful Python feature. Also we loop in case there are multiple punctuation characters at the end of a group.
In testing this it became obvious that we need to do the same at the front of a group too, since although closing brackets are detected opening ones aren't! To overcome this problem I will create a new function trim() that will remove punctuation from front and back of a single char group:
#########################################################
#Note trim uses recursion where the terminating condition
#is either 0 or -1. An "InvalidEnd" error is raised for
#anything other than -1, 0 or 2.
##########################################################
def trim(item,end = 2):
"""remove non alphas from left(0), right(-1) or both ends of item"""
if end not in [0,-1,2]: raise "InvalidEnd"
if end == 2: trim(item, 0) trim(item, -1)
else:
while (len(item) > 0) and (item[end] not in alphas): ch = item[end]
if ch in punctuation_counts.keys(): punctuation_counts[ch] = punctuation_counts[ch] + 1
if end == 0: item = item[1:] if end == -1: item = item[:-1]
Notice how the use of recursion combined with defaulted a parameter enables us to define a single trim function which by default trims both ends, but by passing in an end value can be made to operate on only one end. The end values are chosen to reflect Python's indexing system: 0 for the left end and -1 for the right. I originally wrote two trim fnctions, one for each end but the amount of similarity made me realize that I could combine them using a parameter.
And getPuntuation becomes the nearly trivial:
def getPunctuation(wordList): for item in wordList:
trim(item)
# Now delete any empty 'words' for i in range(len(wordList)): if len(wordList[i]) == 0: del(wordList[i])
Note 1: This now includes the deletion of blank words.
Note 2: In the interests of reusability we might have been better to break trim down into smaller chunks yet. This would have enabled us to create a function for removing a single punctuation character from either front or back of a word and returning the character removed. Then another function would call that one repeatedly to get the end result. However since our module is really about producing statistics from text rather than general text processing that should properly involve creating a separate module which we could then import. But since it would only have the one function that doesn't seem too useful either. So I'll leave it as is!
105