Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Gauld A.Learning to program (Python)_1.pdf
Скачиваний:
23
Добавлен:
23.08.2013
Размер:
1.34 Mб
Скачать

Conclusions

A Case Study

For this case study we are going to expand on the word counting program we developed earlier. We are going to create a program which mimics the Unix wc program in that it outputs the number of lines, words and characters in a file. We will go further than that however and also output the number of sentences, clauses, words, letters and punctuation characters in a text file. We will follow the development of this program stage by stage gradually increasing its capability then moving it into a module to make it reusable and finally turning it into an OO implementation for maximum extendability.

It will be a Python implementation but at least the initial stages could be written in BASIC or Tcl instead. As we move to the more complex parts we will make increasing use of Python's built in data structures and therefore the difficulty in using BASIC will increase, although Tcl will still be an option. Finally the OO aspects will only apply to Python.

Additional features that could be implemented but will be left as excercises for the reader are:

calculate the FOG index of the text,

calculate the number of unique words used and their frequency,

create a new version which analyses RTF files

Counting lines, words and characters

Let's revisit the previous word counter: import string

def numwords(s):

list = string.split(s) return len(list)

inp = open("menu.txt","r") total = 0

# accumulate totals for each line for line in inp.readlines():

total = total + numwords(line) print "File had %d words" % total

inp.close()

We need to add a line and character count. The line count is easy since we loop over each line we just need a variable to increment on each iteration of the loop. The character count is only marginally harder since we can iterate over the list of words adding their lengths in yet another variable.

We also need to make the program more general purpose by reading the name of the file from the command line or if not provided, prompting the user for the name. (An alternative strategy would be to read from standard input, which is what the real wc does.)

100

So the final wc looks like:

import sys, string

# Get the file name either from the commandline or the user if len(sys.argv) != 2:

name = raw_input("Enter the file name: ") else:

name = sys.argv[1]

inp = open(name,"r")

# initialise counters to zero; which also creates variables words = 0

lines = 0 chars = 0

for line in inp.readlines(): lines = lines + 1

# Break into a list of words and count them list = string.split(line)

words = words + len(list)

chars = len(line)# Use the original line length which includes spaces etc.

print "%s has %d lines, %d words and %d characters" % (name, lines, words, chars)

inp.close()

If you are familiar with the Unix wc command you know that you can pass it a wild-carded filename to get stats for all matching files as well as a grand total. This program only caters for straight filenames. If you want to extend it to cater for wild cards take a look at the glob module and build a list of names then simply iterate over the file list. You'll need temporary counters for each file then cumulative counters for the grand totals. Or you could use a dictionary instead...

Counting sentences instead of lines

When I started to think about how we could extend this to count sentences and words rather than 'character groups' as above, my initial idea was to first loop through the file extracting the lines into a list then loop through each line extracting the words into another list. Finally to process each 'word' to remove extraneous characters.

Thinking about it a little further it becomes evident that if we simply collect the words and punctuation characters we can analyse the latter to count sentences, clauses etc. (by defining what we consider a sentence/clause in terms of punctuation items). This means we only need to interate over the file once and then iterate over the punctuation - a much smaller list. Let's try sketching that in pseudo-code:

foreach line in file: increment line count if line empty:

increment paragraph count split line into character groups

foreach character group: increment group count

extract punctuation chars into a dictionary - {char:count} if no chars left:

delete group

else: increment word count

101