Introduction

The ability to generate random data that conforms to a particular format can be handy. It's not uncommon for us to want to test our programs with a large amount of data, so that we can find bugs and inefficiencies, especially those that arise from dealing with large data sets; this kind of testing is often called stress testing. Generating large amounts of random data is a tedious task for people, but a task that we can certainly imagine that a computer would be good at. (We talked a lot in lecture about how we should trust our computers to do boring, repetitive work, so that we can concentrate on the truly interesting or difficult stuff; that's why automating tests is so useful, for example.) It's not hard to see that a program that's capable of generating data in a given format would be very useful in these kinds of circumstances.

For this project, you will write a program that randomly generates data in a form described by a grammar. (Grammars, in addition to being a nice way to model this problem, are recurrent in the study of computer science; you're likely to see them again in your studies — more than once.) You will also gain practice implementing a recursive algorithm in Java, a vital skill that you will find to be useful in many ways as you continue your study of computing.

Grammars

A grammar is a collection of substitution rules that describe a set of sentences. Each sentence is a sequence of terminals. Different kinds of sentence fragments are represented by variables, with a rule for each variable specifying how it can be replaced by one of a set of possible sequences of variables and terminals. One of the variables is designated as the start variable, which means that it represents an entire sentence.

An example of a grammar follows. The start variable is A. The variables are A and B, while the terminals are 0, 1, and #.

  • A → 0A1A | B
  • B → #

This grammar states that the variable A can be replaced either with the sequence 0A1A or the variable B, while the variable B can only be replaced with #.

From a conceptual point of view, a grammar can be used to generate strings of terminals in the following manner. (I should point out that this isn't precisely how your program will generate its strings, but your program will do something that has an equivalent effect.)

  • Begin with the start variable.
  • So long as there are still variables that have not been substituted, pick a variable and a rule with that variable on the left-hand side. Replace the variable with the right-hand side of the rule that you chose.

A sequence of substitutions leading from the start variable to a string of terminals is called a derivation. When the leftmost variable is always replaced at each step, the derivation is called a leftmost derivation. The string 00#1#1# can be generated by the grammar above. The following leftmost derivation — which begins with the start variable, with one substitution made at each step — proves that it can be done.

A ⇒ 0A1A ⇒ 00A1A1A ⇒ 00B1A1A ⇒ 00#1A1A ⇒ 00#1B1A ⇒ 00#1#1A ⇒ 00#1#1B ⇒ 00#1#1#

Since 00#1#1# can be generated by the grammar, we would say that the string 00#1#1# is in the language of the grammar. In other words, the language of a grammar is the set of all strings that can be generated from it. (Many grammars, including this one, have an infinite number of strings in their languages. This grammar generates an infinite number of strings since the rule A → 0A1A can be used an arbitrary number of times in a derivation.)

The concept of a grammar will be central to our random sentence generator. A grammar will describe a set of sentences (which may indeed be infinite). Each sentence you generate will be a sequence of words, with the words being the terminals in the grammar. The variables in the grammar will describe sentence fragments, with the start variable describing an entire sentence.

The program

You will write a Java program that takes a grammar file as input and writes a specified number of randomly-generated sentences, one per line, into an output file. The name of the input file, the number of sentences, the name of the start variable, and the name of the output file should be passed as command-line arguments to the program. Your main() method should be placed into a class called Generator, so we will easily be able to figure out how to run it. An example of how your program could be executed from the command line is:

java Generator grammar.grm 10 start sentences.out

This command specifies that the grammar file is a file called grammar.grm. Ten sentences should be generated using the start variable start and written into an output file called sentences.out. Your program is required to take these four command-line arguments in this order (though, of course, their values might be different). If more or fewer than four command-line arguments are provided to your program, it should print an error message and terminate.

The format of the grammar file

One of the inputs to your program will be a grammar file containing one or more rules of the following form.

  • Each rule starts with a left curly brace ("{") on its own line and ends with a right curly brace ("}") on its own line.
  • After the opening brace, the first line of the rule is its left-hand side. Remember that the left-hand side of a rule is a variable. Its name is delimited by brackets ("[" and "]"), which you should drop before storing the variable's name in memory. (The substring() method in the String class will be handy here.)
  • Subsequent lines of the rule are alternative right-hand sides (i.e., different ways of rewriting the left-hand side). Each right-hand side consists of a sequence of variables (whose names are enclosed in brackets) and terminals in any combination, separated by whitespace. The sequence ends with a semicolon, which will always be preceded by at least one space.
  • There may be lines of text outside the braces. These lines are intended to be comments in the grammar and should be ignored by your program.

You may assume that grammar files will always be in precisely this form. You are not required to check for errors in the grammar file.

Here is an example grammar file that generates random Facile programs. Since our program generates sentences that each appear on one line, the word NL is used in place of the newline character that separates lines in a Facile program; you can use find-and-replace in a text editor to solve this program and make it possible to execute the generated Facile programs using your interpreter from Project #3. (The generated programs will not have syntax errors in them, but they may have run-time errors or other problems, such as infinite loops or division by zero.)

Do not make assumptions about variable names because of what you see in the example grammar file. You may assume that the names of variables will never have spaces or tabs in them, but you otherwise may not make assumptions about what the names of the variables will be, how many rules there will be, how many right-hand sides each rule will have, and so on. In general, any grammar file that meets the requirements above should be parsed successfully by your program.

Storing the grammar as a set of objects

From the description of the grammar file, we can come to the following conclusions:

  • The grammar file contains a collection of rules.
  • Each of the rules has a variable and a collection of right-hand sides.
  • A right-hand side is a sequence of terminals and variables.

These facts lead directly to an idea of how to design the set of objects that will be used to store a grammar in memory. I imagine the following classes comprising your design of a grammar.

  • Terminal. Contains only the word that makes up one terminal.
  • Variable. Contains only the name of the variable.
  • RightHandSide. Contains a sequence of Terminals and Variables. "Sequence" implies that the order is important, so I suggest using an ArrayList to store them. (Since you'll want a sequence that can contain Terminals or Variables, but not other kinds of objects, it might make sense for Terminal and Variable to implement an interface called Symbol, which won't need methods, but will "mark" Terminal and Variable as being similar, for the purposes of grouping them together into an ArrayList. If Terminal and Variable each implement Symbol, but nothing else does, an ArrayList will be able to contain Terminals or Variables, but nothing else.)
  • Rule. Contains the name of a variable and a set of RightHandSides. Since you'll need to select a RightHandSide randomly by generating a random number, it makes sense to store these RightHandSides in an ArrayList, though the order of them turns out to be irrelevant.
  • Grammar. Contains the name of the start variable and a collection of Rules. You will often need to search this collection, looking for the Rule for a particular variable. You may use a flat data structure, such as an ArrayList or a linked list, to store the Rules, then do a linear search if you'd like. However, there are some better approaches you could employ, which are discussed in the "Additional challenges" section later in the write-up. (You'll learn more about these better approaches in ICS 23 / CSE 23.)

I'd like you to use this, or a similar, object-oriented approach for storing the grammar in memory, as it will lead to a clean, recursive algorithm for generating sentences. This algorithm is described in the next section.

Generating random sentences from a grammar

Once you've stored your grammar as the set of objects described in the previous section, it is possible to implement a relatively straightforward recursive algorithm to generate random sentences from it. The algorithm revolves around the idea of generating sentence fragments, then putting the fragments together into a complete sentence.

The first step in the implementation is to establish the fact that all of the objects that make up a grammar — Grammars, Rules, RightHandSides, Variables, and Terminals — can have sentence fragments (or complete sentences) generated from them. This concept can be wrapped up in an interface called Generable, which might look something like this:

public interface Generable { public String generate(Grammar g, Random r); }

Grammar, Rule, RightHandSide, Variable, and Terminal should all implement this interface, though, of course, the actual implementation of the generate() method will differ from one class to another. It's necessary for generate() to take a Grammar as a parameter, because, during the process of generating a sentence, it's sometimes necessary to look up the rule in the grammar that corresponds to a particular variable.

Here is a sketch of the sentence generation algorithm:

  • To generate a sentence from a Grammar, call the generate() method on the Grammar object. The Grammar object will then look up the Rule corresponding to the start variable and call generate() on it, passing itself (this) as a parameter.
  • The generate() method for Rule chooses one of the RightHandSides at random, using the Random object passed to it as a parameter, and calls generate() on it.
  • The generate() method for RightHandSide iterates through its terminals and variables from left to right, calling generate() on each, then putting the results together into a sentence fragment.
  • The generate() method for a Variable asks the Grammar for the Rule corresponding to that Variable, then calls generate() on it, returning the result as a sentence fragment.
  • The generate() method for a Terminal simply returns the value of the terminal as a sentence fragment.

The amazing thing about this recursive strategy is that, with relatively little code, you'll be able to ask a grammar to generate its own random sentences.

A caveat about your Symbol interface

The Symbol interface, which should be implemented by both Terminal and Variable as a way to "mark" that both Terminals and Variables are symbols, should also specify that all symbols are generable; in other words, it should be possible to generate a sentence fragment from any kind of symbol, be it a terminal or a variable.

To do this requires one additional piece of syntax that we haven't learned. It's possible for one interface to extend another. In general, if an interface J extends the interface I, J contains all of the methods of I plus the methods that are listed in J. In our case, the Symbol interface does not introduce any new methods, but it should include the generate() method from the Generable interface. We write this in Java as follows:

public interface Symbol extends Generable { }

Because you'll be implementing the Symbol interface in Variable and Terminal, both Variable and Terminal will be required to have a generate() method; it will not be necessary for you to explicitly state that Variable or Terminal implements Generable, because that's implied by the fact that they implement Symbol.

Random number generation

How does random number generation work?

Since the program must generate sentences randomly, it is necessary for us to briefly discuss how random number generators work. Computers cannot generate sequences of genuinely random numbers. Instead, they generate sequence of numbers that satisfy statistical tests of randomness; these numbers are often called pseudo-random numbers. Provided that the algorithm used to generate the sequence is well-chosen, there is little practical difference between random numbers and pseudo-random ones.

A random number generator is an object that generates a sequence of pseudo-random numbers. Here's how it works:

  • Each number in the sequence is determined by evaluating a mathematical function on the previous number in the sequence. So, each time you say "Give me the next pseudo-random number," a calculation is performed on the one you got back last time, with the result returned to you and a copy of it stored so it can be used later to determine the next number in the sequence.
  • In order to start the process, some number needs to be picked that will be used to calculate the opening number in the sequence. This number is called the seed.

The selection of a seed is important. The same mathematical function is applied every time you ask for a random number. If you always picked the same number as the seed, you'd always get the same sequence of pseudo-random numbers. (This is handy when testing some programs that behave randomly, but it's not so good when you want truly random behavior.) So a randomly-selected seed is also necessary. Of course, computers cannot select numbers at random, but a common way to supply a "randomly-selected" seed is to take the current value in the system's clock and use it as the seed. (This isn't truly random, but will be essentially impossible to control precisely, since the system clock is kept in terms of milliseconds.)

In Java, the Random class in the java.util package provides an easy-to-use random number generator that you can use in your program. First, you need to create a Random object, which represents a sequence of pseudo-random numbers. When you construct the object, it initializes the seed using the system clock. Once you've created the Random object, you can ask it for the next pseudo-random number by calling methods such as nextInt(), nextLong(), nextBoolean(), nextDouble(), and so on. An example of using the nextInt() method follows.

Random r = new Random(); // this should only be done once, with the
// resulting Random object being reused throughout
// the execution of the program

int i = r.nextInt(100); // nextInt(100) returns a pseudo-random number in
// the range 0..99 inclusive

One important thing to remember about random number generation as you work on this project: a single Random object represents, in essence, a sequence of random numbers. It's important to create one Random object and use it throughout the program's execution. A good technique for solving this problem is to create your Random object outside of your Grammar and pass it as a parameter in each call you make to generate(). (There are other solutions to this problem, too, but this one makes it possible to test your program even though it's supposed to behave randomly. I'll explain this further in the "Testing" section later in the write-up.)

Why infinite recursion is expected behavior (in some cases)

Grammars are permitted to be recursive. For example, the following grammar is legal:

A → Ax

If you write this grammar in our input format and pass it to your program as input, your program will recurse infinitely and terminate with a StackOverflowError — which is what happens when Java runs out of space to store the "call stack" that keeps track of information about all the methods that are currently executing.

The reason this program will cause your program to crash is simply that there's no way out of the recursion:

  • Suppose A is the start variable.
  • Grammar.generate() will be called.
  • The rule corresponding to A (the start variable) will be looked up.
  • Rule.generate() will be called on that rule.
  • Variable.generate() will be called on the first symbol on the rule's right-hand side, which is the Variable A.
  • The rule corresponding to A will be looked up in the Grammar.
  • Rule.generate() will be called on that rule.
  • Variable.generate() will be called on the first symbol on the rule's right-hand side, which is the Variable A.
  • The rule corresponding to A will be looked up in the Grammar.
  • ... and so on ...

We won't test your program with grammars like this, and it's expected — and, in fact, very difficult to avoid! — that your program to crash when given such grammars. Another way to think about a grammar like this is that the only sentences in its language are infinitely long, so it's not surprising that a program to generate such sentences would recurse infinitely.

Note that not all recursive grammars have this problem:

A → xA | x

This grammar describes an infinite language of sentences, each of which is a sequence of one or more x's. If written in our input format and passed to your program, this grammar should cause your program to generate sentences with various numbers of x's. (Roughly half of them will have one x, roughly one-quarter of them will have two x's, roughly one-eighth of them will have three x's, and so on. Stop and think for a minute about why, on average, the sentences will be distributed this way.)

Testing

The problem with randomness, revisited

In this project, we're confronted with the problem of testing a program that has intentionally random behavior. The randomness that makes the program more useful when it's finished complicates the matter of testing, because it's difficult to know whether the difference between what you expect and what you get is caused by a program bug or (intentionally) random fluctuations in the output.

There is a solution to this problem, which we'll explore in the testing phase of this project. The solution is to set up something called a "mock object." A mock object is one that takes the place of an actual object that does some task that may behave unpredictably, or may require using resources external to the program (e.g., connect to a web site, load information from a database) that will cloud the results of our unit tests with possible failures that are unrelated to the issues we're testing for.

In this case, we'll use a "mock" random number generator called MockRandom. It will behave just like the Random object from Java, in that it will have a nextInt() method. It will also extend Random, so that a MockRandom can be used in place of a Random anywhere we'd like. The difference is that our MockRandom, rather than giving back a sequence of random numbers, will actually give back exactly the sequence of numbers I ask it to. This is useful in a test of our code here, because then we'll know exactly what to expect as output!

Here is a link to the code for MockRandom, which I'm providing for your use. Check out the comment in this file for an example of how it works.

  • MockRandom.java

What you'll need to test

Write a JUnit test of the generate() methods in your various grammar classes. Because so many of the objects are interrelated — Grammars contain Rules, Rules contain RightHandSides, RightHandSides contain Symbols, etc. — it's best to just write one JUnit test case class that tests sentence-generating behavior in total.

Note that you're not required to test the parsing of the input file or writing the output file. It's just necessary to write tests that set up a Grammar (and its various sub-objects) and call generate(). Pass a MockRandom object to generate() in place of an actual Random object, deliberately choosing your sequence of "random" numbers so that you'll know what the result of generating a sentence should be.

Try to think of some interesting grammars to test with; don't just use one and generate two sentences from it and leave it at that. What are some interesting cases you'd need to deal with? Different numbers of right-hand sides in a rule? Recursive rules (but not infinitely recursive rules)? Think of what you might try that will uncover a bug. It's fine to hard-code your grammars directly into your test methods. So, for example, you might write something like this. (You may need to write something slightly different, depending on the design and the names of your various methods.)

Grammar g = new Grammar();
Rule r1 = new Rule("X");
RightHandSide rhs1 = new RightHandSide();
rhs1.addSymbol(new Terminal("Alex"));
rhs1.addSymbol(new Terminal("is"));
rhs1.addSymbol(new Variable("Y"));
r1.addRightHandSide(rhs1);
// ...

Ultimately, you'll want each of your test methods to create a grammar, create a MockRandom with your chosen sequence of "random" numbers, call the generate() method, and assert that the resulting sentence is what you expected it to be.

Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.