Sunday, January 31, 2021

Personalizing Hello World -- Char Arrays, and Giving the User a Menu

[TOC]

Continuing with the idea of greeting to further extend our beachhead, let's say we want the computer to give the user a list of people to greet, and let the user choose who gets greeted from that.

Hold on to your hat, this is a significantly longer and more involved program.


/* Extending the Hello World! greeting beachhead --
** Let the user choose from a list whom the computer should greet.
** This instance the work of Joel Rees.
** Copyright 2021 Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational use.
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


#define MENU_CT 10
#define MENU_ITEM_LN 12


char menu[ MENU_CT ][ MENU_ITEM_LN + 1 ] =
{
  "Johnny",
  "Ginnie",
  "Marion",
  "Deborah Ann",
  "Howard",
  "Joe",
  "Robin",
  "Dawn",
  "Cornelia Maxi", /* <= Look closely at this! */
  "Nina"
};


void my_puts( char string[] ) /* What's different this time? */
{
  int i;

  for ( i = 0; string[ i ] != '\0'; ++i )
  {
    putchar( string[ i ] );
  }
  putchar( '\n' );
}


/* Convert a number from zero to nine to a digit character. */
int textdigit( int n )  
{
  return n + '0';  /* A trick of ASCII encoding! */
}


/* Convert a number to a text digit and put it on the output device. */
void putdigit( int n )  
{
  putchar( textdigit( n ) );
}


int main( int argc, char * argv[] )
{
  int i;
  int ch;

  my_puts( "From among" );
  for ( i = 0; i < MENU_CT; ++i )
  {
    putchar( '\t' );  putdigit ( i );  putchar( ':' );
    putchar( ' ' );   my_puts( menu[i] );
  }

  my_puts( "Whom should I greet?" );
  ch = getchar();
  while ( !isdigit( ch ) )
  {
    putchar( ch );  putchar( '?' );
    fputs( "Please enter a number from 0 to ", stdout );  putdigit( MENU_CT - 1 ); 
    my_puts( ":" );  /* <= Why do I do it this way? */
    ch = getchar();
  }

  fputs( "Oh-kay, ", stdout );  putchar( ch );  my_puts( "." );
  putchar( '\n' );  putchar( '\n' );  putchar( '\n' );  putchar( '\n' );
  fputs( "Hal-looooooooooooo ", stdout );
  my_puts( menu[ ch - '0' ] );
}

Copy/paste that into your favorite text editor window, or at least one you're comfortable with, and keep it open where you can reference it, and let's work through it.

This program references the ctype library. This library allows you to check characters in the ASCII range, to determine such things as whether they are digits, punctuation, space, etc. It is where get isdigit(), which we use to check the menu selection, so we #include the header.

#define gives you one way to define constants. For many pre-ANSI compilers, #define constants are the only constants. We'll return to #define later to discuss the differences between macros and constants, but, for now, that's what I use it for here, defining the constant count of menu items, MENU_CT, and the constant maximum number of characters in each, MENU_ITEM_LN.

menu[][] is a two-dimensional array of characters, whose size is defined by the above #define constants, MENU_CT, and MENU_ITEM_LN. And it is a true two-dimensional array, allocating MENU_CT times (MENU_ITEM_LN + 1) bytes of memory space.

I just had to bring our my_puts() function in for sentimental reasons. Or, maybe, so I could show a different way to declare its string parameter. It will be useful to stop and compare this definition with the last one, before continuing.

You may by now be asking about the difference between 

char * string;

and 

char string[];

You may, you know. It's a good thing to ask about. 

Well, when declaring string as a parameter to a function, there isn't any effective difference.

If we were declaring string as, say, a global variable, there would be an important difference, but let's not distract ourselves with that just yet. We have too much ground to cover first.

Moving on, for the moment, let's just assume that textdigit() and putdigit() do what their names imply and the comments say, the one converting a number to a digit character, and the other putting a number on the output device. I'll explain pretty soon. I promise.

(I think the ASCII trick will work for the digits in EBCDIC, as well. I'll have to test it sometime.)

Skipping forward to the main() function, the following lines declare two integer variables called i and ch:

int i;
int ch;

Maybe we need to go on a long detour, here.

------ Side Note on Integers ------

These are not the ideal integers of mathematics that extend in range both directions to infinity. Variables in computers have limited range. (You could say integer variables provide the basis for implementing certain types of a mathematical concept called a ring, but let's not go there today. I'll get there, too, eventually.) 

On a sixteen-bit CPU, they will (probably) have a range of 

(-215 .. 215 - 1)

or from -32,768 to 32,767. 

On a modern 32-bit CPU, the range will probably be

(-231 .. 231 - 1)

or from -2,147,483,648 to 2,147,483,647. 

(Yes, I am an American, and I use the comma to group columns in numbers. If you are from a country where they use something different, please make the substitution. You can let me know about it in the comments. And, by the way, there are ways to deal with that in standard C libraries. Sort-of.)

On a modern 64-bit CPU, int variables may be 32-bit integers or they may be 64-bit integers, depending on how the compiler architect interprets the CPU resources and whether the sales managers insist on coddling past programmers who hard-wired 32-bit integers into their programs. Or (more likely) depending on compiler switches.

If int is a 64-bit integer, i and ch will be able to take the range

 (-263 .. 263 - 1)

or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Nice big numbers. Pretty close to minus and plus infinity, from a practical point of view.

Now, we're not using even the full range of 8-bit integers, so we could have declared them as short int, or even as char in this program. But, why?

Oh. Wait. Before that, why, you ask, is ch, which sounds like it's going to be a character, declared as an int?

Excellent question. I'll tell you about EOF later, but, for now, I'll just say it's a convention, and it's a good programming habit to make sure your integer variables will always have enough space to hold their values. Remember, char is an integer type, and a sub-range of int, usually a proper sub-range.

Are you interested in the range that char can take on, since I insist it's an integer type? For the usual size of char,

  • signed char: (-27 .. 27 - 1), or -128 to 127. 
  • unsigned char: (0 .. 28 - 1), or 0 to 255.

------ End of Side Note on Integers ------ 

Back to the program. You've seen the for loop before, in my_puts(), but I haven't explained it. 

Hmm. Before I explain the for loop, I should explain the while loop.

Loops are conditional constructs, much like the if selection construct. But, not only do they branch around code, they repeatedly run through code. Of course they repeat. That's why they're called loops.

It's a common misconception, but, as I said previously, conditionals are not functions. In C, they require parenthesis for the condition expression, but what is inside is a set of conditions, rather than parameters. 

Also, the if, while, for, and do ... while conditionals never have return values in C. 

And, as I have mentioned, they don't have to have curly brace-enclosed blocks if they only apply to a single statement. But it is usually wiser and less confusing to give them explicit blocks anyway. You often find that you actually wanted more than one statement under the conditional part.

I'm dancing around what a loop is because I don't want to show you the accursed goto. And I don't want to do more hand-compiled assembly language. So, let's look at a theoretical example loop, instead:

start( music );
while ( music_is_playing( party ) )
    dance();

This is going to invite more confusion, I just know it. 

The dancing doesn't stop immediately when the music stops. The loop checks that the music is playing, and then the program dances for a bit. Then it checks again, and then it dances some more. That's the way software works. (This is very important to remember. Many expensive commercial projects have met disaster because a programmer forget that conditionals are not constantly monitored.)

Let's look at another example:

fill( plate );
while ( food_remains( plate ) )
{
   eat_a_bite( rice );
   eat_a_bite( sashimi );
   eat_a_bite( pickled_ginger );
   eat_a_leaf( shiso );
   eat_a_bite( nattō );
   eat_a_bite( pickled_radish );
}

And the way this is constructed, once we take that first bite of rice in the loop, we will continue on through the sashimi, all the way through the bite of pickled radish, before checking again whether there is food on the plate. (There is a way to break the loop between bites. And there is concurrent execution, which .... Again, later.)

while loops test their condition before entry, so the condition must be prepared -- primed. That's what fill( plate ) and start( music ) do above.

The for loop primes itself.

There is a do ... while () loop where you jump in before testing, but it turns out to be not very useful. I'll explain why later.

We need something concrete to look at before we fall asleep. Computers are good at counting, we hear. Let's try a counting loop:

count = 0;
while ( count < 100000 )  count = count + 1; 
/* Only one statement, no need for braces. Note that trailing semicolon. */

Note that, since 100,000 won't fit in 16 bits, the count variable must be declared to be one of the 32-bit types for the CPU and compiler.

I'd (cough) like to show you how that would look in 6809 assembly language, but the 6809 needs lots of extra instructions to do 32-bit math, and the extra instructions would cloud the issues. So I'll use 68000 assembly language. It looks different, but my comments should clear things up.


                  ; Uses 32 bit .Long instructions.
* count = 0;
 MOVEQ #0,D7      ; Compiler conveniently put count in D7.
* while ( count < 100000 )
_wb_0000
 CMP.L #100000,D7 ; Compare -- subtract 100000 from D7,
                  ; but don't store result.
 BGE _we_0000     ; Branch over increment and loop end
                  ; if D7 is greater than or equal to 100,000.
*   count = count + 1;
 ADD.L #1,D7      ; Add 1 to D7.
 BRA _wb_0000     ; Branch always back to beginning of loop.  
_we_0000
                  ; Code continues here.

(That's fairly well optimized object code. But there is one further optimization to make which would be confusing, so I won't make it. It's also hand-compiled and untested. But it's fairly understandable this way.)

This is a good way to make the computer waste a little time. On the venerable 6809, it would take about a second or so. On the 68000 at mid-1980s speeds, it would take between a fifth and a tenth of a second. On modern CPUs, it would take something in the range of a millisecond, if that. Just a little time.

And the count stops at 100,000.

------ Side Note on Incrementing ------

Adding one to a count happens so much in programs that C has a nice shorthand for it:

++count;

is the same as 

count = count + 1;

Incrementing by other than 1 has a shorthand, as well, and sometimes you want to increment a value after you use it instead of before. We'll look at that later, too.

------ End of Side Note on Incrementing ------

Let's remake that counting loop as a for loop:

for ( count = 0; count < 100000; ++count ) /* Loop body looks empty. */ ;

Notice that there is nothing between the end of the condition expression and the semicolon except for a comment for humans to read and notice that the space is intentionally left blank.  

That's an example of an empty loop. 

In a sense, it isn't really completely empty, because the loop statement itself contains the counting, in addition to the testing. But, again, the only effect you notice is a small bit of time wasted, and count ends at 100000. (Some compilers will helpfully optimize such loops completely out of the program and just set count to 100,000 -- unless you tell them not to because you know you want to waste the computer's time.)

Empty loops have a significant disadvantage. They are easy to misread. If you have some reason to use an empty loop, use a comment to make it clear, and I recommend giving it a full empty block, just to make it really clear:

for ( count = 0; count < 100000; ++count )  {  /* Empty loop! */  }

This for loop is exactly equivalent to the while loop above, plus the statement priming the count. And the code output will be the same, as well.

One final note of clarification, if we want a counting loop that prints the count out, the for version of the loop might look something like this:

for ( count = 0; count < 100; ++count )
   printf( "%d\n", count );

 And this for loop is exactly identical to the following primed while loop:

count = 0;
while ( count < 100 )
{
   printf( "%d\n", count );
   ++count;
}

We'll be looking more at loops (and printf()) later, but that should be enough to continue with reading my_puts()

It declares a char array, string,  as its only parameter. 

In early C, we definitely did not want to copy whole arrays. It took a lot of precious processor time and memory space. So the authors of C decided that an array parameter would be treated the same as a pointer to its first element.

Arrays still usually aren't something you want to make lots of copies of, so this design optimization might not be a bad thing, even in our current world where RAM and processor cycles are cheap. But it does invite confusion, since both a pointer and an array can be modified by the indexing operator. Specifically, given

char ch_array [ 10 ] = "A string";
char * ch_ptr stringB = "B string";

in 6809 assembler we would see something like this:

ch_array FCC "A string"
  FCB 0
_s00019 FCC "B string"
  FCB 0
ch_ptr FDB _s00019

so you see that "B string" is stored in an array with one of those odd names that won't be visible in the C source -- thus, anonymous, and ch_ptr is a pointer that is initialized to point to the anonymous string. On the other hand, "A string" is stored directly under the name ch_array, which is very much visible in the C source.

However, unless we overwrite ch_ptr with some other pointer,

ch_array[ 0 ] points to 'A',
ch_ptr[ 0 ] points to 'B' and
ch_array[ 3 ] and ch_ptr[ 3 ] both point to (different) 't's.

This leads to headaches if you aren't careful, but it also means that my_puts() is quite readable. Take the char array that gets passed in and count up it, looking at each char and putting it out on the output device as we count -- until we reach a 0. And the way the test is arranged, it will see that the char is 0 and stop before outputting it.

I'm going to present both 6809 compiler output and 68000 compiler output. Both are very much not optimized and not tested, but you can read my comments and see how the thing fits together.

6809 first:


* void my_puts( char * string )
_my_puts_6809
* {
*   int i;
  LEAU -2,U     ; Allocate i.
* 
*   for ( i = 0; string[ i ] != '\0'; ++i )
  LDD #0        ; Initialize i.
  STD ,U
_my_puts_loop_beginning
                ; Split stack, no return PC to avoid.
  LDX 2,U       ; Get string pointer.
  LDD ,U        ; Get i.
  LDB D,X       ; Get string[ i ] (destroying i!)
                ; LDB will see 0 for us, no CMP necessary,
                ; but let's refrain from confusing optimizations.
  CMPB #0       ; Is this char 0?
  BEQ _my_puts_loop_end
*   {
*     putchar( string[ i ] );
                ; Even simple optimization would not repeat this.
  LDX 2,U       ; Get string pointer.
  LDD ,U        ; Get i.
  LDB D,X       ; Get string[ i ] (destroying i!)
  CLRA          ; It was unsigned, extend it to 16 bits. 
  PSHU D        ; Push the parameter for putchar().
  JSR _putchar  ; Call putchar().
*   }
  LDD ,U        ; Increment i.
  ADDD #1
  STD ,U
                ; Go back for more.
  BRA _my_puts_loop_beginning
_my_puts_loop_end
*   putchar( '\n' );
  LDD #_C_NEWLINE
  PSHU D
  JSR _putchar
* }
  RTS

Now 68000:


* void my_puts( char * string )
_my_puts_68000
                    ; The compiler has been told to use 32-bit int .
* {
*   int i;
                    ; The compiler will conveniently put i in D7.
* 
*   for ( i = 0; string[ i ] != '\0'; ++i )
  MOVEQ #0,D7       ; Initialize i.
_my_puts_loop_beginning
                    ; Split stack, no return PC to avoid.
  MOVE.L (A6),A0    ; Get string pointer.
  MOVE.B (D7,A6),D0 ; Get string[ i ] 
                    ; MOVE.B will see 0 for us, no CMP necessary,
                    ; but let's refrain from confusing optimizations.
  CMP.B #0,D0       ; Is this char 0?
  BEQ _my_puts_loop_end
*   {
*     putchar( string[ i ] );
  MOVEQ #0,D0       ; Avoid need to extend char to int .
  MOVE.B (D7,A6),D0 ; Get string[ i ] 
  MOVE.L D0,-(A6)   ; Push the parameter for putchar().
  JSR _putchar      ; Call putchar().
*   }
  ADD.L #1,D7       ; Increment i.
                    ; Go back for more.
  BRA _my_puts_loop_beginning
_my_puts_loop_end
*   putchar( '\n' );
  MOVEQ #_C_NEWLINE,D0
  MOVE.L D0,-(A6)   ; Push the parameter for putchar().
  JSR _putchar
* }
  RTS

The reason I give some (hand-)compiled output is to help motivate the idea that C programs are (effectively) performed step-by-step in the order that the source code dictates. This includes the test conditions in conditional constructs. It's part of the rules of the game for C, even though other languages do something different. 

Those languages have rules, as well. Without the promise of order that the rules give, programs would not function.

(Optimization can break this promise, however. More later.)

To understand textdigit(), we need to look at an ASCII chart, or at least the part where the numbers are:

Code (decimal) Character
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
58 :

Characters are represented by codes inside the computer, and the codes are numbers -- integers, to be specific. You can add numbers to these integers, and the result may be a different character. (Or it may not fall within the table, depending on the number, but we won't worry about that.)

So, if we start with a number from 0 to 9 in the parameter n and we add the code for the character '0' to it, we get a new code for the character version of the number that was in n.

The addition may be more clear if we show the codes in hexadecimal:

Code (decimal)Code (hexadecimal)Character
472F/
48300
49311
50322
51333
52344
53355
54366
55377
56388
57399
583A:

And then we return the resulting character. 

I'd show the assembly language for this, but it's dead simple. On the 6809, convention will have the compiler load the return value in register D before executing the return from subroutine. On the 68000, it will probably be loaded into D0. Other CPUs will have similar conventions for where to put the return value. There may be better ways, but this is the usual way now.

The putdigit() routine is essentially just semantic sugar. I hope it makes the program easier to understand. It just uses textdigit() to convert the number to a character and use putchar() to put it on the output device.

That brings us back to main().

The first loop in main() is a for loop, and it formats and prints out the menu array, along with using putdigit() to put out numbers for selecting a name from the menu array.

By keeping the number of menu items to ten or less, we can use our simplified output routines. We'll show how to deal with more later.

The second loop in main is a while loop, and its purpose is to read characters from the input, and complain and discard them if they are not numbers, until it gets a number.

My odd choice of which routines to use where has something to do with giving you a reason to read through the source in my_puts(), and also something to do with output buffering. (my_puts() forces the output buffer to be flushed with the newline it puts out. Otherwise, we would have no guarantee that the characters we are putting out make it to the screen in time to tell the user what we want to tell him or her. This is something else we will look at later.)

I think the rest of main() is understandable at this point.

Hopefully, you've seen what the bug I planted in the menu does by now. It has to do with allocating enough room for trailing NULs for strings. I'll leave the fix as an exercise, for now.

Here's the screenshot:
 

How long it will take to get the next step up, I don't know. I keep taking on too many projects.

In the meantime, play with what you've learned so far. Fix the bug, or course. Experiment and explore.

The next one is ready sooner than I expected. I decided to show you how to get an overview of the ASCII characters.

[TOC]

Personalizing Hello World -- A Greet Command

[TOC

Continuing with another version of Hello World! to extend our beachhead, let's say we want the Hello! program to be less general. Specifically, instead of having the computer greet the world, let's write a program that allows the user to tell the computer whom to greet: 


/* A greet command beachhead program
** as a light introduction to command-line parameters.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>


int main( int argument_count, char *argument_variables[] )
{
  char * whom = "World";

  if ( argument_count > 1 )
  {
    whom = argument_variables[ 1 ];  /* Where did the initial value go? */
  }
  fputs( "Hello ", stdout ); /* Still avoiding printf(). */
  fputs( whom, stdout );
  putchar( '!' );
  putchar( '\n' );
  return EXIT_SUCCESS;
}

Comparing this to the first exercise, we see that we are actually using those command-line parameters. I'd like to have postponed that a bit further because they are a rather confusing beast. And some people who want to follow along want to do so on platforms that don't have command-line parameters under the usual operating system interface. (Such as the original/Classic Mac without MPW, and ROM-based game machine/PCs like the Tandy/TRS-80 Color Computer without Microware's OS-9, etc.) 

But I have reasons. 

For now, just kind-of assume that there is more to them than meets the eye. 

(If your platform won't allow you to follow along, read the explanation, examine the screenshot carefully, and at least consider downloading Cygwin or installing a libre *nix OS so you can actually try these. For these purposes, an old machine sleeping somewhere might work well with NetBSD or a lightweight Linux OS.)

Again, if you are using a K&R (pre-ANSI) compiler like Microware's compiler for OS-9, move the function parameters for main down below the declaration line. Also, shorten the parameter names, since those compilers typically get confused over long names that start too much the same -- which is the real reason argument_count is usually written argc and argument_variables is usually written argv:

int main( argc, argv )
int argc;
char *argv[];
{
/* etc. */
}

And I'm throwing another fastball at you. There is a conditional in this program. Conditionals are another thing you should assume I'm not telling the whole story about here.

But be aware that, while puts(), fputs(), and putchar() are function calls, 

if ( condition )  {  }

is not. Nor is it a function declaration, such as   

void my_puts( char * string )
{
  ...
}

which you might recall from the first exercise.  

It's a test of a condition. If the condition between the parentheses evaluates to true, the stuff between the braces gets done. If not, the stuff between the braces gets jumped over. (The braces aren't required if there is only one statement to be jumped over, but they are advised for a number of reasons. And there is an optional else clause. And the values of true and false need to be discussed. More detail later.)

Note also that arrays are indexed from 0 up to the array size minus 1. Thus, the first element of an array is element array[ 0 ]. And the last is array[ ARRAY_SIZE - 1 ], for a total of ARRAY_SIZE elements.

If you were compiling to M6809 object code and had the compiler output the assembler source, you would see something like the following -- except that I have added explanation. 

(I'm not asking you to learn 6809 assembly language, just giving it as something to hang my comments on.)

I've mixed in the original C source on the comment lines that start with an asterisk. On code lines, everything following a semicolon is my explanatory comments:


* int main( int argument_count, char *argument_variables[] )
s0000 FCC "World"  ; Allocate the string.
 FCB 0             ; NUL terminate it.
s0001 FCC "Hello " ; See above.
 FCB 0

_C_main
* {
*   char * whom = "World";
 LEAU -2,U   ; Allocate the variable whom.
 LDX #s0000  ; Load a pointer to the World string and
 STX ,U      ; store it in whom.
*
*   if ( argument_count > 1 )
 LDD 2,U     ; Get argument_count.
 CMPD #1     ; Compare it to 1.
 BLE _C_main_001  ; If less than or equal to 1, branch to _C_main_001
*   {
*     whom = argument_variables[ 1 ];  /* Where did the initial value go? */
             ; This code is executed if argument_count is 2 or more.
 LDY 4,U     ; Get the pointer to the argument_variables array.
 LDX 2,Y     ; Get the second pointer in the argument_variables array.
 STX ,U      ; Store it in whom.
*   }
_C_main_001
*   fputs( "Hello ", stdout ); /* Still avoiding printf(). */
 LDX #_f_stdout ; Get the file pointer and
 PSHU X      ; save it as a parameter.
 LDX #s0001  ; Get a pointer to the Hello string and
 PSHU X      ; save it as a parameter.
 JSR _fputs  ; Call (jump to subroutine) fputs() --
             ; fputs() cleans up U before returning.
*   fputs( whom, stdout );
 LDX #_f_stdout ; See above.
 PSHU X
 LEAX ,U     ; Get the address of whom and
 PSHU X      ; Save it as a parameter.
 JSR _fputs
*   putchar( '!' );
 LDD #'!'
 PSHU D
 JSR _fputchar  ; putchar also cleans up the stack after itself.
*   putchar( '\n' );
 LDD #_c_newline
 PSHU D
 JSR _fputchar
*   return EXIT_SUCCESS;
 LDD #_v_EXIT_SUCCESS  ; Leave the return value in D.
 LEAU 2,U  ; Clean up the stack before returning.
 RTS  ; And return.
* }

(Unless I say otherwise, all my assembly language examples are hand-compiled and untested. But I'm fairly confident this one will work, with appropriately defined libraries using a properly split stack.)

(If you understand stack frames, note that this code uses a split stack and does not need explicit stack frames. The return PC is on the S stack, out of the way. Thus the parameters are immediately above the local variables.) 

Again, don't worry how well you understood all of that.

Just note the code produced for the if clause produces code that tests argument_count, and if it is 1 or less skips the following block. If it is 2 (or more) the following block is executed, and the char pointer whom gets overwritten by the second command-line parameter.

Don't assume you know all there is to know about conditionals from that short introduction, any more than you know all about the command-line parameters. Compile it and run it and maybe add some code to get a look at the first entry in argument_variables[] if you're interested and can immediately see how. That's good for now.

I guess we'll get a screen shot of compiling and running this.

Details:

rm greet

deletes a previously compiled version of the program.

ls

as before, lists the files in the current directory. I've saved this version in greet.c, so

cc -Wall -o greet greet.c

will compile the program, with full syntax checking. 

./greet

calls it without parameters. (Except for that first one we haven't looked at yet.)

./greet Harry

calls it with one. (Ergo, two.)

./greet Harry Truman

calls it with two (ergo, three). How would you get a look at the second/third one?

You might be interested to see what is in the first actual command-line parameter. Or you might not be interested. I've mentioned that you could get at it. Can you think of a way to do so? If you do, do you recognize what it contains? 

(The first argument actually varies from platform to platform, but it isn't something the user usually consciously specifies, which is why it isn't usually counted as a parameter. I won't spoil the surprise here, but I will explain later.)

And you might also be interested in looking at the assembly language output for the processor you are using. The command-line option for that on gcc is the -S option, which looks like this: 

cc -S greet.c

You can use the -Wall options as well, like this:

cc -S -Wall greet.c

Either way, that will leave the assembly language source output in a file called greet.s , and you can look at it by bringing it into your favorite text editor, or with the more command, etc.

Where does the string that whom gets initialized with go, by the way? 

Nowhere. But we didn't save the pointer to it anywhere, so it just becomes (more-or-less) inaccessible. It's short enough we don't care too much, especially in this program, but it's just cluttering up memory. 

There's a lot to think about here, so let's keep it short. The next one one is going to be pretty long, when I get it put together and really deep.

Before the next one is up, or before you go look at it, play with this one a bit more. Again, explore. See what happens if (whatever gets your curiosity up), and then see if you can find a reason why.

 And the next one is ready now, here. We'll give the user a menu to choose whom to greet.

[TOC]

Monday, January 25, 2021

Looking Deeper into Hello World in C -- char Type and Characters and Strings

[TOC]

I regularly see questions about handling characters in C, regularly enough that it may be time to write a tutorial on the subject.

What is this character thing and what is the C?

C is a very widely used programming language. 

Characters, well, they're a bit hard to pin down, but they have something to do with the letters we write words in -- A, B, C, い、ろ、は、イ、ロ、は、色、匂、散、 etc. Without characters, getting information into and out of computers quickly becomes rather difficult.

Let's look at the archetype of introductory programs:


/* The archetypical Hello, World!" program.
** This instance the work of Joel Rees,
** but it's to trivial to be copyrightable.
*/


#include <stdio.h>
#include <stdlib.h>

int main( int argc, char *argv[] )
{
  puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  return EXIT_SUCCESS;
}

Recapping and explaining the C programming language elements:

Between the /* and the */ is comment for humans. The compiler (mostly) ignores it.

The #include statements are there to tell the compiler to invoke interface definition headers for standard function libraries, standard IO and standard miscellaneous lib(rary). A linking loader will later link the functions in, and we don't need to think too deeply about them just now.

The line "int main( int argc, char * argv[] )" tells the compiler that this is the main() function of the program, that it returns an integer to the operating system (as main() functions traditionally should), and that it knows about the command-line variables that inform the program of the number ("argc") of parameters passed in from the OS and the content ("argv") of the parameters. (We'll look at the command-line parameters soon.)

If you're using a pre-ANSI C compiler, you may need to change the definition of main() as follows:

int main( argc, argv ) /* You may even need to leave out the parameter names. */
int argc;
char ** argv;   /* char * argv[] should also work in most cases. */
{
   /* Same code as above comes here. */
}

For some compilers/operating systems, the system does not provide the parameter count or array. In such cases, just leave the parameters to main() out completely. (And I'll explain that later, too.)

Which all sound like technobabble to the beginning C student, but it does become meaningful at some point.

For now, main() is where your program starts. (Essentially.) 

------ Side Note on Code Formatting ------

The curly braces ("{}") define a block of code, and the fact that they come immediately after the line that says this is main(), with no punctuation between, tells the compiler that the stuff between is what the program should do.

A note on those curly braces. Some -- well, a lot of -- people think that the opening brace belongs on the line above where I've put it, like this:

int main( int argc, char *argv[] ) {

It's wrong, but that's their preferences and their business. The whole world can be wrong sometimes. In this tutorial, I'm putting the open brace down where I and you can see it.

(You'll need to get used to both ways, and some variations of both, if you try to make a living programming. Don't fuss over it. And if it's easier for you to see the other way, when you copy the programs out of the tutorial put them where it's easier for you to see.)

------ End Side Note on Code Formatting ------

So the block of code that defines what this program does consists of two lines. The second of those tells the program to pass back to the operating system a code that tells the OS that the program exited with success. 

The meat of the program is the line that puts the string, "Hello, World!" on the output device, which is usually a terminal window these days.

If you've seen the Hello World program in C before, you may have seen it done with printf() instead of puts(). I chose puts() here because it's a much simpler function to explain. I mean, I've already explained it.

Now I can focus on the string of characters in this program. Not the string of characters which comprises the source code of the program, but the string which the program, when compiled and run, should output, the five letters 'H', 'e', 'l', 'l', and 'o', the space which follows, the five letters 'W', 'o', 'r', 'l', and 'd', and the punctuation character which follows, the exclamation mark. (Or, the exclamation point in some parts of the world.):

Hello World!

This is a string of characters, as we say. And puts() puts them out on the output device, whatever the output device is. Here's a sample output when run on a Linux OS, in a terminal emulator:


Now this terminal screen is not all that obvious. In the modern world, you would have bells and whistles and dancing assistants, explaining what the picture shows, and the picture would be a video instead of a still shot. So I'll show you the dancing assistant's script:

In the above screenshot on a typical computer running Ubuntu, you can see me 

  1. moving to the directory where the source code is stored:

cd ダウロード/FBcomp/

("cd" stands for "change directory".) 

  1. listing the contents of the directory:

ls

("ls" stands for "list". There is only one file, the source file Hello_World.c , and you see it listed in the line below the command. In MSWindows command line shells, it would be "dir".) 

  1. issuing the compile command:

cc -Wall -o Hello_World Hello_World.c

("cc" stands for "C compile". 

"-Wall" stands for "Warn all warnings". 

"-o Hello_World" means "name the executable object file for the program 'Hello_World'.". 

 "Hello_World.c" is the name of the file containing the source code. It occurs to me now that using a different name for the source and object files would have been a little less confusing.) 

  1. and issuing the command to run the program:

./Hello_World

("./" in a *nix shell says look only in the current local directory.) And you can see the output after the last command:

Hello World!

Hmm. I could do a video of this. It's something to think about. But until I have the time, I'll hope you can follow this well enough. 

------ Side Note on Getting a Compiler ------

To actually compile this and run the programs, you'll need a compiler and some system software that supports it. 

*nix (LInux and BSD OSses, et. al., and Plan 9):

If you are on a Linux or BSD or similar libre operating system, you'll have a package manager that will help you install the compiler tools (if they are not already installed), and the web site for your OS distribution (vendor) will have pages on how to check if they are installed, and how to install them if they aren't.

*Mac: 

If you're on a Mac, I understand the current official thing is to get XCode from the App store. It looks like Apple will push you to learn Swift, which I suppose is okay, but I can't help you with that. XCode should allow you to compile C programs with clang. Clang is like gnu cc, but, instead of typing the command "cc" like I show above, you type the command "clang". (Clang can also be used on Linux, and gcc can also be used on Mac, but that requires some setup, and is a topic for another day.)

*Microsoft Windows:

Microsoft's Visual Studio will only continue to push you to remain in Microsoft's world, so I don't recommend that. The Hello World for that world is different from what I describe here, and will send you jumping through hoops to open a window to display it in, which is fine for opening a window just to display a string in, but doesn't really help you start understanding what a string is or what is really happening underneath or how to go to the next step.

Microsoft also has a Windows Subsystem for Linux, which allows you to install full Linux distributions in your MSWindows OS (apparently to run under emulation). I have not used it. I can't recommend either for or against it. But layering more layers over reality never helps learn about reality. Still, if you just want to get your toes wet, it might be the thing for you.

*Cygwin:

Cygwin allows you to install the Gnu C Compiler tools on MSWindows computers, along with certain other software from the libre software world. If you must run MSWindows, I think I recommend Cygwin.

Instructions for downloading and installing Cygwin on MSWindows can be found at 

Get the gnu C compiler and basic libraries using the installer, after you check the checksum to make sure it downloaded safely, and, if you have gnupg or other way to check the fingerprint, check the fingerprint so you know it came from the right place..

*Android:

There are a several apps to install a C compiler and walled runtime environment on Android. I have not used any of them, can't recommend either for against, but the layers principle applies. And they take up space you may need for other things. (Space is the primary reason I have not used any of them.) But they may be good for getting your feet wet.

There are also partial Linux systems (like NoRoot) that can allow more than just compilers to be installed, but don't allow full access to the phone. (The walls do help keep the phone somewhat safe.) You'll need to search for an app that is compatible with your phone and Android OS, however.

I have heard that recent Android Phones can officially (by Google approved methods) be turned into full Linux computers, but that seems to be more rumor than fact.

*Downloading Linux and BSD OSses:

Instructions for downloading and installing a full Linux or BSD system to replace or dual-boot with your MSWindows OS can be found at their respective sites. As I mentioned above, once you have one of those installed, getting the compiler and other tools is simply a matter of running the package manager and telling it to install the tools. 

(I'm currently running Ubuntu on an older Panasonic Let's Note with 4G  RAM, using about 100G of the internal SSD.)

Some OSses I have a bit of experience with include

There are other distributions of both LInux and BSD OSses, such as Dragonfly BSD, Arch Linux, OpenSUSE, Fedora (Red Hat), CentOS, Mint OS, and so forth.

*Others: 

There are other options similar to the BSD distributions, such as Minix, Plan 9 (Inferno), and Open Solaris. Your web search engine should find relevant information quickly. 

If you are using an older (classic) Macintosh, Apple's Macintosh Programmer's Workbench (MPW) has a good compiler and a fun, if quirky by today's standards, workbench environment. Codewarrior for the Mac was also good, if quirky in other ways.

Radio Shack/Tandy's venerable 8-bit Color Computer had OS-9/6809, for which a compiler was available from Microware. (Not Microsoft, okay?) It was a pre-ANSI compiler. Other ANSI compatible and pre-ANSI compilers were available for all that gear that is now retro gear, and you can often find those compilers to download. I'll discuss pre-ANSI (K&R) C syntax, but I won't try to deal with Small C.

*Trusting an Alternate OS:

If you are wondering how you can trust one of these alternate OSses, I talk about that a little here:

https://defining-computers.blogspot.com/2019/05/analyzing-mechanized-trust.html 

Some of what I say there, I'm not completely sure is universal, but if you have questions, that rant should give you some good pointers to start researching your questions.

------ End Side Note on Getting a Compiler ------

Back to the Hello World program.

It's a simple program. It also invites some misunderstanding, which is the real reason for this post. While you read this, keep the text editor window where you copied and pasted the source open for reference.

I am now going to tell you something that may have you thinking I'm telling you lies. But Kernighan and Ritchie explain it as well, in their book The C Programming Language. I'm just going to try to make it more obvious.

The programming language C does not inherently support strings of text. No real character type in the language proper, no real character string type, either.

Okay, I said it. 

(The support is indirect, through library functions, and is not nearly as complete in the standard libraries as you want to think.) 

In the program, collecting the "Hello World!" between the quotes into a byte array and terminating it with a NUL byte is, uhm, well, it's part of C, but it isn't.

To explain that somewhat carefully, let's do a few alternate versions of the Hello World program:


/* Not so typical Hello, World!" program.
** This instance the work of Joel Rees,
** Copyright 2021 Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>

char greeting1[ 32 ] = "Sekai yo! Konnichiwa!";

/* This could be done on a single line.
-- I'm doing it this way for effect.
*/
char greeting2[ 16 ] =
{
  'H',
  'o',
  'l',
  'a',
  ' ',
  'M',
  'u',
  'n',
  'd',
  'o',
  '!',
  '\0', /* <= Look! It's a NUL! */
  0 /* <= Look! It's another NUL! */
}; 

int main( int argc, char *argv[] )
{
  puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  puts( greeting1 );
  puts( greeting2 );
  return EXIT_SUCCESS;
}

You can change that for pre-ANSI compilers as described above.

Several questions should come to mind. 

One, why do I use Rōmaji instead of kana or Kanji in the Japanese, and where is the leading inverted exclamation mark in the Spanish?

I'll get to that in a bit. Maybe in another post.

The other, why do I put explicit NUL bytes on the end of the Spanish version?

Right? Those were the two most obvious questions, right?

No?

Okay, let's work through your questions. 

The string that I might have named greeting0, "Hello World!", is automatically collected in an array like greeting1 and greeting2, but not given a name that the programmer can use. 

It's anonymous. 

(If the same identical string occurs elsewhere in the same file, modern compilers will recognize that and only store it once -- unless you tell them not to. Older compilers may not search for identical strings, and just store another copy. But the string doesn't have a name that the C program can directly use.)

Other than that, the three greetings strings are all treated exactly the same way. They are collected as arrays of char, and a trailing 0 (NUL) terminator byte is attached.

(cough.)

Well, sort of. If I had declared them without size, ..., oh, hang on. No, we won't go there quite yet.

Oh. The size. I guess we do need to go there.

So, no, not quite exactly the same. The anonymous string is allocated enough bytes for the text and the trailing NUL (and maybe some extra, at the discretion of the compiler). 

The other two are allocated the number of bytes that the source specifies, thirty-two for greeting1 and sixteen for greeting2 (and maybe some extra, at the discretion of the compiler). And, if there is enough room after the text is stored, the rest of the array is filled with zeroes, effectively putting at least one NUL terminator byte at the end. But only if there is enough room. 

Will the compiler complain if you've declared a size too small for the text specified?

It should. Usually. I mean, yes. Maybe. Usually.

Which is why, when you explicitly declare strings, you usually declare strings like this:

char greeting1prime[] = "Sekai yo! Konnichiwa!";

Or like this:

char * greeting3prime = "Bonjour le monde!";

Here, greeting1prime is an array of bytes like greeting1, but the size allocated is enough bytes for the string plus the trailing NUL (plus extra, if the compiler wants to).

On the other hand, greeting3prime is a pointer to a byte, initialized with the address of the anonymously allocated, NUL terminated array "Bonjour le monde!". In other words, in addition to the string in greeting3prime's case, you've declared a variable of type 

char *

which is a byte pointer -- a variable. You can change what it points to. You can even lose what it was pointing to, if you're not careful.

Hmm. 

Before we go any further, let me explain something.

C has no native character type. 

Again, you're doubting my sanity. I know you are. This whole post is a discussion of characters in C, right?

Not yet. Time for a little history. (This is definitely not an aside.)

Back in the 1970s, when Kernighan and Ritchie and some of their coworkers were playing around with BCPL and the early versions of C, we didn't know nearly as much about how to deal with text in computers as we know now. (And there's still a lot we need to learn.)

Even the size of a byte in a computer was not set. Some had 6-bit bytes, some had 8-bit bytes, and a few had 9-bit bytes. Other sizes also existed, look them up if you want.

Nowadays, we can be sort-of comfortable thinking of a byte as 8 bits. But a byte now is (usually) defined as the smallest addressable unit of memory in a computer.

Which is precisely what C defines the char type as. This was something of a mistake. The type should really have been called byte.

You can alias the char type to byte with

typedef char byte;

Why did the conflation occur? Glad you asked. (You did ask, I hope.)

Back then, in the western world, we didn't really know much about eastern languages, so we didn't really consider them. 

Western languages all used (as we thought) less than 100 characters, and even Japanese had the kana, of which there are only around 50 (depending on what's included). 

And computers were beginning to standardize on 8-bit bytes

We assumed that Kanji were built in some orderly manner from smaller parts, and that those parts would number less than 250, so that they could also be encoded in bytes.

And we really didn't think about a single encoding that would encompass all languages, like Unicode. 256 was just too much a magical number to ignore.

256 is still a magical number, but we now know that even English actually needs more characters than that. (Some members of the computer typesetting industry of the early 1970s knew that we needed more than 256, but they weren't writing operating systems and programming languages.)

And, anyway, a typeface with more than 256 characters was known to need more computing power than an ordinary office could afford. (The 68000 overcame that barrier, but that was several years after the early K&R C had been defined.)

And that's the simplified version of how characters were conflated with bytes -- a bit historically incomplete, but good enough to help us think beyond the names of types.

char in C is an integer type. It can be either signed or unsigned, according to the what the C compiler engineers think works best for a particular family of computers.

And that should be enough for an introduction to characters in C. No, wait. One more version of Hello World! --


/* A non-typical Hello, World!" program.
** This instance the work of Joel Rees.
** Copyright 2021 Joel Matthew Rees. 
** Permission granted to modify, compile, and run
** for personal and educational use.
*/


#include <stdio.h>
#include <stdlib.h>


/* Implementing our very own puts():
*/
void my_puts( char * string )
{
  int i;

  for ( i = 0; string[ i ] != '\0'; ++i )
  {
    putchar( string[ i ] );
  }
  putchar( '\n' );
}

char greeting1[ 32 ] = "Sekai yo! Konnichiwa!";

/* This could be done on a single line.
-- I'm doing it this way for effect.
*/
char greeting2[ 16 ] =
{
  'H',
  'o',
  'l',
  'a',
  ' ',
  'M',
  'u',
  'n',
  'd',
  'o',
  '!',
  '\0', /* We only need one terminator, 
  ** but even this isn't really necessary here
  ** because we specify a large enough size.
  */
}; 

char * greeting3 = "Bonjour le monde!"; 

char greeting4[] = "Hallo Welt!";

int main( int argc, char *argv[] )
{
  my_puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  my_puts( greeting1 );
  my_puts( greeting2 );
  my_puts( greeting3 );
  my_puts( greeting4 );
  return EXIT_SUCCESS;
}

For pre-ANSI C compilers, change main() as described above, and change my_puts() as follows:

void my_puts( string )
char * string;
{
  /* Code that goes here is the same. */
}

A few things I didn't mention before --

Double quotes are for strings. 

(Well, they're for telling C to automatically collect the text into a char array, and to terminate the array with a NUL if the size isn't specified as just big enough for the text without the NUL. Try that yourself and see what happens, by the way. Something like

char fluke[ 15 ] = "This is a test.";

What did it do when you tried to puts( fluke )?)

The single quotes we used in several places are for individual characters, not for strings. 

(In some compilers, you can actually pack multiple characters in between the single quotes, but I'm not going to try to confuse you by telling you that. Okay? You didn't hear me say that. Okay? Good. You don't want to know what that does in relation to byte order, in particular. ;))

So, 'A' is a single capital A. 

You saw that '\0' is shorthand for a NUL byte, eight bits of zero. Well, zero in eight bits. (No matter how wide it is, zero is still zero, in any byte order, thank heavens.) 

And you saw in our third version, that '\n' is shorthand for a newline character.

Oh, and "void" is (among other things) for telling the compiler that a particular function doesn't have a return value. Also, control returns from a function at the trailing brace if there is no explicit return.

And you might have noticed, in our my_puts(), that you can use a character pointer variable as if it were the name of an array in many cases. (There are other ways to write my_puts(), but we won't go there quite yet, either.)

I have other things to do tonight, so that's it for now.

It's your turn to think of things you can do with this. Explore. Get results you don't expect. Get a copy of the Kernighan & Ritchie's The C Programming Language if you don't already have one, or look it up on the web and figure out why.

(I may or may not write a follow-up to this sometime soon.)

The next step in this tutorial is ready, now. We'll tell the computer whom to greet.

 [TOC]