Showing posts with label char. Show all posts
Showing posts with label char. Show all posts

Wednesday, February 3, 2021

ASCII Table in C -- Char Arrays and Char Pointer Arrays

[TOC]

Let's set Hello World! aside for a while. 

In the last project, we needed to look at a part of the ASCII table to explain the code for the menu selections. I happen to have that part of the table mostly memorized, and I know about the ASCII man page in modern *nix OSses, so I just built the code by hand. But it would be nice to have our own ASCII chart, and it would be cool to let the computer build it for us, right?

Let's give it a try.

First, we'll build a simple ASCII table. I don't have the control code mnemonics completely memorized, so we'll go to Wikipedia's ASCII page and the *nix manual pages mentioned above for reference.

man ascii 

(Using the terminal window's copy function, I pasted the manual page contents into an empty gedit text document window, and used gedit's regular expression search-and-replace to extract the parts I wanted. Very convenient, once you get the hang of it.)

With a little bit of work (a very little bit), I came up with the following simple table generator:


/* A simple program to print the ASCII chart, 
** as part of a tutorial introduction to C.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>


/* reference: 
== *nix man command: man ascii 
== Also, wikipedia: https://en.wikipedia.org/wiki/ASCII
*/
char *ctrl_code[33] =
{
  "NUL", 	/* '\0': null character */
  "SOH", 	/* ----  start of heading */
  "STX", 	/* ----  start of text */
  "ETX", 	/* ----  end of text */
  "EOT", 	/* ----  end of transmission */
  "ENQ", 	/* ----  enquiry */
  "ACK", 	/* ----  acknowledge(ment) */
  "BEL", 	/* '\a': bell */
  "BS", 	/* '\b': backspace */
  "HT", 	/* '\t': horizontal tab */
  "LF", 	/* '\n': line feed / new line */
  "VT", 	/* '\v': vertical tab */
  "FF", 	/* '\f': form feed */
  "CR", 	/* '\r': carriage ret */
  "SO", 	/* ----  shift out */
  "SI", 	/* ----  shift in */
  "DLE", 	/* ----  data link escape */
  "DC1", 	/* ----  device control 1 / XON */
  "DC2", 	/* ----  device control 2 */
  "DC3", 	/* ----  device control 3 / XOFF */
  "DC4", 	/* ----  device control 4 */
  "NAK", 	/* ----  negative ack. */
  "SYN", 	/* ----  synchronous idle */
  "ETB", 	/* ----  end of trans. blk */
  "CAN", 	/* ----  cancel */
  "EM", 	/* ----  end of medium */
  "SUB", 	/* ----  substitute */
  "ESC", 	/* ----  escape */
  "FS", 	/* ----  file separator */
  "GS", 	/* ----  group separator */
  "RS", 	/* ----  record separator */
  "US",  	/* ----  unit separator */ 
  "SPACE"  	/* ----  space */ 
};

char *del_code = 
  "DEL";	/* ----  delete */


int main( int argc, char *argv[] )
{
  int i;

  for ( i = 0; i < 33; ++i )
  {
    printf( "\t%3d 0x%2x: %s\n", i, i, ctrl_code[ i ] );
  }
  for ( i = 33; i < 127; ++i )
  {
    printf( "\t%3d 0x%2x: %c\n", i, i, i ); 
  }
  printf( "\t%3d 0x%2x: %s\n", 127, 127, del_code );
  return EXIT_SUCCESS;
}

(Again, if you're working on a pre-ANSI C compiler, remember to change the main()  function declaration to the pre-ANSI K&R style:

int main( argc, argv )
int argc;
char *argv[];
{ ...
}

I think most K&R C compilers should compile it with that change.)

Looking through the code, you probably think you recognize what the ctrl_code[] array is, but look close. It is not a two dimensional array of char. It's an array of char *, and the pointers point to anonymous char arrays of varying length. If we'd written it out with explicit C char array strings, it would look something like this:

char ent00[] = "NUL";    /* '\0': null character */
char ent01[] = "SOH";   /* ----  start of heading */
char ent02[] = "STX";   /* ----  start of text */
...
char ent33[] = "SPACE";      /* ----  space */

char *ctrl_code[] =
{
  ent00, ent01, ent02, ... ent33
};

But C takes care of all of this for us, without the nuisance of all the entNN names.

The advantage of this structure is pretty clear, I think. Well, maybe it's clear. 

In many cases, we can save some memory space because each char string does not have to be as long as the longest plus the byte for the trailing NUL. 

More importantly, we don't have to check whether we've accidentally clipped off that trailing NUL. (Right?)  

(Clear as mud? Well, follow along with me anyway. It does clear up.)

The compiler is free to allocate just enough space for the char array and its trailing NUL, and to take care of the petty details for us. 

All we have to do is remember that It's not really a two dimensional array. It just looks an awful lot like one. 

----- Side Note on Memory Usage -----

You may be wondering how much space is actually saved in this particular table by using an array of char * pointers instead of a two-dimensional array of char. It's a good question. Let's calculate it out.

If this array were declared as a two-dimensional array, we'd want the rows long enough to handle the longest mnemonic plus NUL. The longest mnemonic is SPACE, so that's 6 bytes:

char ctrl_code[ 33][ 6 ];

Total space is 33 rows times 6 bytes per row, or 198 bytes.

As we've declared it above, it will be 33 times the size of a pointer plus the individual string lengths. If pointers are sixteen bits (16-bit addresses), that's 66 bytes. If pointers are 32 bits (1990s computers, 32-bit addresses), that's double, or 132 bytes. 

On modern (64-bit address) computers, that's 8 bytes per address, or 264 bytes, just for the pointers themselves.

For the individual string lengths, there are 19 three-byte mnemonics, 13 two-byte mnemonics, and 1 five-byte mnemonic. Adding in the NUL, that's

19 x 4 + 13 x 3 + 6 == 76 + 39 + 6 == 121

For the various address sizes:

  • 16-bit: 66 + 121 == 187 bytes (11 bytes saved)
  • 32-bit: 132 + 121 == 253 bytes
  • 64-bit: 264 + 121 == 385 bytes

del_code could go either way, independent of the ctrl_code array. Declared as a pointer to an anonymous array, it consumes 2 bytes for the pointer (on a 16-bit architecture) and four for the anonymous array. We really don't need the pointer pointing to it, and accessing the array directly would use the same syntax. But sometimes you do things for consistency, and it is not always a bad thing to do so.

So, in this case, we really aren't saving space, unless we're working on a retro 16-bit computer, and even then not much. 

 The benefit of not having to worry about the trailing NULs is no small benefit, and the extra memory use does not worry us nearly as much on machines where a few hundred bytes are well less than a millionth of the total available memory. 

----- End Side Note on Memory Usage -----

The source code itself for this control code table gives us a good table of control codes, for reference, of course. But since it is source code, we can use it to make other tables from it.

Anyway, let's look at the source code. Since the source you copied out is way up off the screen, refer to it from the file where you copied it while you read this.

I include SPACE in it for convenience, even though SPACE really isn't classified as a control code by C's libraries, or by the language itself. That's no problem, is it? -- as long as we both remember that I'm playing a small game with semantics.

DEL is way up at the top of the ASCII range, and I don't have anything to pre-define for the visible character range, so DEL is not in the control code table. It gets its own named, NUL-terminated char array. Again, I just have to remember to print it's line out after I've done the rest.

I've declared the traditional counter for for loops, i, and I have one loop that is dedicated to the control codes.

This time, I'm using printf() instead of puts() or my_puts(). One reason is that the previous three projects should have gotten us comfortable with some of the details that you miss when using printf(). Another is that we want numeric and textual formatted output, and we aren't ready to write numeric output routines ourselves, and printf() does numeric output and was designed for formatted output. 

A lot of people read printf() as "print file". I forget and read it that way myself from time to time. It's habit that's catching. But it's not what printf() means. printf() means "print formatted".

And that's what it does. The first parameter is a format string. The parameters after the first are what we want formatted.

The format string for the first and third printf()s is this:

"\t%3d 0x%2x: %s\n"

Working through the format --

\t is the tab character. It'll give us a little regular space on the left.

%d is decimal (base ten) integer (not real or fractional) numeric output. %3d is three columns wide, right-justified. We print the loop counter out here, because the array of control code mnemonics is arranged so that the code is the same as the index, and we are using the loop counter as the index.

The space character that comes next is significant. We output it as-is, along with the zero and lower-case x which follow.

%x is hexadecimal numeric output, and %2x is two columns wide, right justified. (This was a little bit of a mistake. I'll show you how to fix it, below.) Then we use this format to output the loop counter again, so we can see the code in hexadecimal.

Then the colon and the following space are output as-is, and %s just outputs the char array passed in as a string of text. We pass in the mnemonic, and the formatted print is done. The output looks like this:

       0 0x 0: NUL
       1 0x 1: SOH
      ...
      13 0x d: CR
      ...
      31 0x1f: US
      32 0x20: SPACE

The second loop has a slightly different format, but the result is adjusted to the first:

"\t%3d 0x%2x: %c\n"

The first two formats are the same. The third is a char format, which outputs the integer given to it as an (ASCII range) character. All three get the loop counter, so we see the character codes in decimal, then in hexadecimal, then the actual character. It looks like this:

      33 0x21: !
      34 0x22: "
      ...
      ...
      47 0x2f: /
      48 0x30: 0
      49 0x31: 1
      ...
      64 0x40: @
      65 0x41: A
      66 0x42: B
      ...
     125 0x7d: }
     126 0x7e: ~

Then the DEL is output with the same format as the other control characters. The loop counter ends at 127 after the visible character range finishes, so we could have used the counter, but we go ahead and pass it the code for DEL as a literal constant.

     127 0x7f: DEL

To demonstrate that we have quite a bit of flexibility in output formats, I've written a bit more involved table generator, and it follows. It gives a few more examples of ways to use the formatted printing. Also it gives us  a look at the use of struct to organize information:


/* A more involved program to print the ASCII chart, 
** as part of a tutorial introduction to C.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


struct ctrl_code_s 
{
  char * mnemonic;
  char * c_esc;
  char * description;
};


/* reference: 
== *nix man command: man ascii 
== Also, wikipedia: https://en.wikipedia.org/wiki/ASCII
*/
struct ctrl_code_s ctrl_code[33] =
{
  {  "NUL",	"'\\0'",	"null character"  },
  {  "SOH",	"----", 	"start of heading"  },
  {  "STX",	"----", 	"start of text"  },
  {  "ETX",	"----", 	"end of text"  },
  {  "EOT",	"----", 	"end of transmission"  },
  {  "ENQ",	"----", 	"enquiry"  },
  {  "ACK",	"----", 	"acknowledge(ment)"  },
  {  "BEL",	"'\\a'",	"bell"  },
  {  "BS",	"'\\b'",	"backspace"  },
  {  "HT",	"'\\t'",	"horizontal tab"  },
  {  "LF",	"'\\n'",	"line feed / new line"  },
  {  "VT",	"'\\v'",	"vertical tab"  },
  {  "FF",	"'\\f'",	"form feed"  },
  {  "CR",	"'\\r'",	"carriage ret"  },
  {  "SO",	"----", 	"shift out"  },
  {  "SI",	"----", 	"shift in"  },
  {  "DLE",	"----", 	"data link escape"  },
  {  "DC1",	"----", 	"device control 1 / XON"  },
  {  "DC2",	"----", 	"device control 2"  },
  {  "DC3",	"----", 	"device control 3 / XOFF"  },
  {  "DC4",	"----", 	"device control 4"  },
  {  "NAK",	"----", 	"negative acknowledgement"  },
  {  "SYN",	"----", 	"synchronous idle"  },
  {  "ETB",	"----", 	"end of transmission block"  },
  {  "CAN",	"----", 	"cancel"  },
  {  "EM",	"----", 	"end of medium"  },
  {  "SUB",	"----", 	"substitute"  },
  {  "ESC",	"----", 	"escape"  },
  {  "FS",	"----", 	"file separator"  },
  {  "GS",	"----", 	"group separator"  },
  {  "RS",	"----", 	"record separator"  },
  {  "US",	"----", 	"unit separator"  }, 
  {  "SPACE",	"----", 	"space"  }
};

struct ctrl_code_s del_code = 
{  "DEL",	"----",	"delete"  };

char ctrl_format[] = "\t%3d 0x%02x: %6s %s %s\n";


int main( int argc, char *argv[] )
{
  int i;

  for ( i = 0; i < 33; ++i )
  {
    printf( ctrl_format, i, i, 
      ctrl_code[ i ].mnemonic, ctrl_code[ i ].c_esc, ctrl_code[ i ].description );
  }
  for ( i = 33; i < 127; ++i )
  {
    printf( "\t%3d 0x%02x: %6c %s %s\n", i, i, i,
      isdigit( i ) ? " DEC" : ( isxdigit( i ) ? " HEX" : "----" ),
      ispunct( i ) ? "punctuation" : ( isalpha( i ) ? "alphabetic" : "numeric" ) ); 
  }
  printf( ctrl_format, 127,127, 
      del_code.mnemonic, del_code.c_esc, del_code.description );
  return EXIT_SUCCESS;
}

(I could have used the first program to output a skeleton source for this one, but I decided to use regular expressions again to extract the various fields, instead.)

The ctrl_code_s struct template has three char * fields -- the mnemonic, the C escape code if there is one, and a more verbal description from the manual page and from Wikipedia. (I extracted the initializations from the source of the first one, again, using the regular expression search-and-replace in gedit.)

The initializations enclose the triplets of anonymous char arrays in curly braces, and we format the source to make it easy to see that the right data goes with the right data. Again, the order of the elements is such that the index of the array of struct ctrl_code_s is the same as the ASCII code.

Some pre-ANSI compilers may not handle this kind of nested initialization. In that case, you may be able to do the initializations without the inner sets of curly braces. It becomes trickier, because such compilers can't help you be sure that the sets are kept together correctly, but it's worth trying. If it doesn't work, you can give up on the array of  struct ctrl_code_s, and use three separate arrays.

(If the compiler doesn't nest initializations and you want to use the array of struct ctrl_code_s anyway, you can set up the three separate initialized arrays and then copy the fields into the uninitialized struct ctrl_code_s. It might be an interesting exercise to do, anyway, to help you get a better handle of what a pointer is and what it points to.) 

Also, some compilers did not support ternary expressions well. If that's the case with your compiler, try the following for the second loop, instead:


  for ( i = 33; i < 127; ++i )
  {
    char * numeric = "----";
    char * char_class = "numeric";

    if ( isdigit( i ) )
    {
      numeric = " DEC";
    }
    else if ( isxdigit( i ) )
    {
      numeric = " HEX";
    }
    if ( ispunct( i ) )
    {
      char_class = "punctuation";
    }
    else if ( isalpha( i ) )
    {
      char_class = "alphabetic";
    }
    printf( "\t%3d 0x%02x: %6c %s %s\n", i, i, i, numeric, char_class );
  }

I've kept the evaluation the same as the ternary expressions, which may help if you're having trouble working those out.

(And if your compiler complains at declaring variables inside nested blocks, you'll need to move the declarations of the variables numeric and char_class up to the main() block where "int i" is declared. But you'll also need to re-initialize them each time through the loop, in the same place they are declared and initialized above.)

One thing you'll notice in the mnemonic field initializations is the use of the backslash escape character to escape itself. For NUL's escape sequence, for example, the source code is written

"'\\0'"

What is actually stored in RAM is the single-quoted escape sequence of backslash followed by the ASCII digit zero:

'\0' 

the single quotes acting in the source as ordinary characters within the double-quoted initialization string, but the first backslash still acting as the escape character so you can store control code literals in your code. If we used only one backslash in the initialization for NUL, it would not get the escape sequence, it would get the literal NUL -- a byte of 0. 

(You might be interested in what happens when you print out a NUL character. If you are, give it a try.)

The syntax for accessing the fields of the array of struct ctrl_code_s is the dot syntax that is common for record fields in other languages, so accessing the mnemonic for NUL is 
ctrl_code[ 0 ].mnemonic

And, if (for some reason) we wanted the third character of the description of the horizontal tab, the syntax would be

ctrl_code[ 9 ].description[ 2 ]

Examing the source code and the output should give you some confidence in what you are seeing. 

Again, DEL gets its own struct ctrl_code_s, not in the main ctrl_code array.

This time, I'm showing that the output format is, in fact, just a NUL-terminated char array, by declaring it before I use it and giving it a name: 

char ctrl_format[] = "\t%3d 0x%02x: %6s %s %s\n";

Other than fixing the format for the hexadecimal field, it's the same as before, but with more %s fields for added information.

I thought about using the same format for the visible character range, but that gets a bit cluttered, so I gave it it's own format.

"\t%3d 0x%02x: %6c %s %s\n

The format for the visible range also adds information fields, just to demonstrate the ctype library and one more conditional construct.

I use the first information field to show whether the character is hexadecimal, decimal, or not numeric, using the functions isdigit() and isxdigit() from the ctype standard library.

The parameter to printf() here is a calculated parameter, using the ternary conditional expression,

condition ? true_case_value : false_case_value

The second informational field is also calculated, using the ctype functions ispuncti() and isalpha() called from within the (nested) ternary conditional expression.

And there are no more surprises in the line for DEL.

Compiling it yields no surprises:

And here's the start of the table, when it's run:

----- Side Note on Memory Usage -----  

ctrl_code_s is a struct containing three pointer fields. The pointers alone consume  6, 12, or 24 bytes per entry. Multiply that by 34 (to include SPACE and DEL), and there are 204, 408, or 816 bytes in use just for the pointers. The arrays for the first field consume, from the calculations above, 121 + 4 bytes. The next field is 4 bytes plus NUL, five for each one, times 34, makes it 170 bytes.

I used the program itself to add the description strings up. I'll show how later, but the total of the descriptions is 491. The total for all three fields is 786.

Together with the pointers:

  • 16-bit: 786 + 204 == 990 bytes
  • 32-bit: 786 + 408 == 1194 bytes
  • 64-bit: 786 + 816 == 1602 bytes 

Each field could be declared as a constant length array of char

Can you work out how that would affect the size of the table by yourself? It gives a savings of about 300 bytes on 16-bit machines and about 100 on 32-bit machines, with an extra usage of about 300 on 64-bit machines.

Oh, why not now? Here's the code to add at the end of main:


 { int sums[ 3 ] = { 0, 0, 0 };
  for ( i = 0; i < 33; ++i )
  { sums[ 0 ] += strlen( ctrl_code[ i ].mnemonic ) + 1;
    sums[ 1 ] += strlen( ctrl_code[ i ].c_esc ) + 1;
    sums[ 2 ] += strlen( ctrl_code[ i ].description ) + 1;
  }
  sums[ 0 ] += strlen( del_code.mnemonic ) + 1;
  sums[ 1 ] += strlen( del_code.c_esc ) + 1;
  sums[ 2 ] += strlen( del_code.description ) + 1;
  printf( "mnemonic: %d, c_esc: %d, description: %d\n",
    sums[ 0 ], sums[ 1 ], sums[ 2 ] );
  printf( "total; %d\n", sums[ 0 ] + sums[ 1 ] + sums[ 2 ] );
  printf( "with pointers on this machine; %ld\n",
    sums[ 0 ] + sums[ 1 ] + sums[ 2 ] + 34 * sizeof (struct ctrl_code_s) );
 }

You'll need to

#include <string.h>

at the top, for strlen().

Be sure to grab the enclosing curly braces. Probably want to change the name of the program, too, while you're at it. 

Also, be aware that sizeof is an operator, not a function. There is a very good reason for why I use sizeof and strlen() where I use them, which I will explain later. (Or you can look it up now if you want.)

One more thing to be aware of, compilers will often pad structures in places that make code faster, so the discussion I give above of memory use is actually more about minimum memory use.

----- End Side Note on Memory Usage -----

So, now that we have these two programs, what next?

What can you think of to do with these tables, or with the pieces of the language and the libraries that these two programs use? 

Try it.

Unicode? 

A complete Unicode table would be huge, and, since it has way more than 256 characters in it, it won't fit in the C char type. (Did I mention that before? This is one reason I insist that char is not actually a character type.) I hope to take up a partial Unicode table later, although it might not work with pre-ANSI C compilers and the OSses they run under.

And I'll be working on the next step.

Woops. I forgot about the HTML table. That could be the next project. Why don't you see if you can finish a short program to produce the HTML table for yourself before I can?

[TOC]

Monday, January 25, 2021

Looking Deeper into Hello World in C -- char Type and Characters and Strings

[TOC]

I regularly see questions about handling characters in C, regularly enough that it may be time to write a tutorial on the subject.

What is this character thing and what is the C?

C is a very widely used programming language. 

Characters, well, they're a bit hard to pin down, but they have something to do with the letters we write words in -- A, B, C, い、ろ、は、イ、ロ、は、色、匂、散、 etc. Without characters, getting information into and out of computers quickly becomes rather difficult.

Let's look at the archetype of introductory programs:


/* The archetypical Hello, World!" program.
** This instance the work of Joel Rees,
** but it's to trivial to be copyrightable.
*/


#include <stdio.h>
#include <stdlib.h>

int main( int argc, char *argv[] )
{
  puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  return EXIT_SUCCESS;
}

Recapping and explaining the C programming language elements:

Between the /* and the */ is comment for humans. The compiler (mostly) ignores it.

The #include statements are there to tell the compiler to invoke interface definition headers for standard function libraries, standard IO and standard miscellaneous lib(rary). A linking loader will later link the functions in, and we don't need to think too deeply about them just now.

The line "int main( int argc, char * argv[] )" tells the compiler that this is the main() function of the program, that it returns an integer to the operating system (as main() functions traditionally should), and that it knows about the command-line variables that inform the program of the number ("argc") of parameters passed in from the OS and the content ("argv") of the parameters. (We'll look at the command-line parameters soon.)

If you're using a pre-ANSI C compiler, you may need to change the definition of main() as follows:

int main( argc, argv ) /* You may even need to leave out the parameter names. */
int argc;
char ** argv;   /* char * argv[] should also work in most cases. */
{
   /* Same code as above comes here. */
}

For some compilers/operating systems, the system does not provide the parameter count or array. In such cases, just leave the parameters to main() out completely. (And I'll explain that later, too.)

Which all sound like technobabble to the beginning C student, but it does become meaningful at some point.

For now, main() is where your program starts. (Essentially.) 

------ Side Note on Code Formatting ------

The curly braces ("{}") define a block of code, and the fact that they come immediately after the line that says this is main(), with no punctuation between, tells the compiler that the stuff between is what the program should do.

A note on those curly braces. Some -- well, a lot of -- people think that the opening brace belongs on the line above where I've put it, like this:

int main( int argc, char *argv[] ) {

It's wrong, but that's their preferences and their business. The whole world can be wrong sometimes. In this tutorial, I'm putting the open brace down where I and you can see it.

(You'll need to get used to both ways, and some variations of both, if you try to make a living programming. Don't fuss over it. And if it's easier for you to see the other way, when you copy the programs out of the tutorial put them where it's easier for you to see.)

------ End Side Note on Code Formatting ------

So the block of code that defines what this program does consists of two lines. The second of those tells the program to pass back to the operating system a code that tells the OS that the program exited with success. 

The meat of the program is the line that puts the string, "Hello, World!" on the output device, which is usually a terminal window these days.

If you've seen the Hello World program in C before, you may have seen it done with printf() instead of puts(). I chose puts() here because it's a much simpler function to explain. I mean, I've already explained it.

Now I can focus on the string of characters in this program. Not the string of characters which comprises the source code of the program, but the string which the program, when compiled and run, should output, the five letters 'H', 'e', 'l', 'l', and 'o', the space which follows, the five letters 'W', 'o', 'r', 'l', and 'd', and the punctuation character which follows, the exclamation mark. (Or, the exclamation point in some parts of the world.):

Hello World!

This is a string of characters, as we say. And puts() puts them out on the output device, whatever the output device is. Here's a sample output when run on a Linux OS, in a terminal emulator:


Now this terminal screen is not all that obvious. In the modern world, you would have bells and whistles and dancing assistants, explaining what the picture shows, and the picture would be a video instead of a still shot. So I'll show you the dancing assistant's script:

In the above screenshot on a typical computer running Ubuntu, you can see me 

  1. moving to the directory where the source code is stored:

cd ダウロード/FBcomp/

("cd" stands for "change directory".) 

  1. listing the contents of the directory:

ls

("ls" stands for "list". There is only one file, the source file Hello_World.c , and you see it listed in the line below the command. In MSWindows command line shells, it would be "dir".) 

  1. issuing the compile command:

cc -Wall -o Hello_World Hello_World.c

("cc" stands for "C compile". 

"-Wall" stands for "Warn all warnings". 

"-o Hello_World" means "name the executable object file for the program 'Hello_World'.". 

 "Hello_World.c" is the name of the file containing the source code. It occurs to me now that using a different name for the source and object files would have been a little less confusing.) 

  1. and issuing the command to run the program:

./Hello_World

("./" in a *nix shell says look only in the current local directory.) And you can see the output after the last command:

Hello World!

Hmm. I could do a video of this. It's something to think about. But until I have the time, I'll hope you can follow this well enough. 

------ Side Note on Getting a Compiler ------

To actually compile this and run the programs, you'll need a compiler and some system software that supports it. 

*nix (LInux and BSD OSses, et. al., and Plan 9):

If you are on a Linux or BSD or similar libre operating system, you'll have a package manager that will help you install the compiler tools (if they are not already installed), and the web site for your OS distribution (vendor) will have pages on how to check if they are installed, and how to install them if they aren't.

*Mac: 

If you're on a Mac, I understand the current official thing is to get XCode from the App store. It looks like Apple will push you to learn Swift, which I suppose is okay, but I can't help you with that. XCode should allow you to compile C programs with clang. Clang is like gnu cc, but, instead of typing the command "cc" like I show above, you type the command "clang". (Clang can also be used on Linux, and gcc can also be used on Mac, but that requires some setup, and is a topic for another day.)

*Microsoft Windows:

Microsoft's Visual Studio will only continue to push you to remain in Microsoft's world, so I don't recommend that. The Hello World for that world is different from what I describe here, and will send you jumping through hoops to open a window to display it in, which is fine for opening a window just to display a string in, but doesn't really help you start understanding what a string is or what is really happening underneath or how to go to the next step.

Microsoft also has a Windows Subsystem for Linux, which allows you to install full Linux distributions in your MSWindows OS (apparently to run under emulation). I have not used it. I can't recommend either for or against it. But layering more layers over reality never helps learn about reality. Still, if you just want to get your toes wet, it might be the thing for you.

*Cygwin:

Cygwin allows you to install the Gnu C Compiler tools on MSWindows computers, along with certain other software from the libre software world. If you must run MSWindows, I think I recommend Cygwin.

Instructions for downloading and installing Cygwin on MSWindows can be found at 

Get the gnu C compiler and basic libraries using the installer, after you check the checksum to make sure it downloaded safely, and, if you have gnupg or other way to check the fingerprint, check the fingerprint so you know it came from the right place..

*Android:

There are a several apps to install a C compiler and walled runtime environment on Android. I have not used any of them, can't recommend either for against, but the layers principle applies. And they take up space you may need for other things. (Space is the primary reason I have not used any of them.) But they may be good for getting your feet wet.

There are also partial Linux systems (like NoRoot) that can allow more than just compilers to be installed, but don't allow full access to the phone. (The walls do help keep the phone somewhat safe.) You'll need to search for an app that is compatible with your phone and Android OS, however.

I have heard that recent Android Phones can officially (by Google approved methods) be turned into full Linux computers, but that seems to be more rumor than fact.

*Downloading Linux and BSD OSses:

Instructions for downloading and installing a full Linux or BSD system to replace or dual-boot with your MSWindows OS can be found at their respective sites. As I mentioned above, once you have one of those installed, getting the compiler and other tools is simply a matter of running the package manager and telling it to install the tools. 

(I'm currently running Ubuntu on an older Panasonic Let's Note with 4G  RAM, using about 100G of the internal SSD.)

Some OSses I have a bit of experience with include

There are other distributions of both LInux and BSD OSses, such as Dragonfly BSD, Arch Linux, OpenSUSE, Fedora (Red Hat), CentOS, Mint OS, and so forth.

*Others: 

There are other options similar to the BSD distributions, such as Minix, Plan 9 (Inferno), and Open Solaris. Your web search engine should find relevant information quickly. 

If you are using an older (classic) Macintosh, Apple's Macintosh Programmer's Workbench (MPW) has a good compiler and a fun, if quirky by today's standards, workbench environment. Codewarrior for the Mac was also good, if quirky in other ways.

Radio Shack/Tandy's venerable 8-bit Color Computer had OS-9/6809, for which a compiler was available from Microware. (Not Microsoft, okay?) It was a pre-ANSI compiler. Other ANSI compatible and pre-ANSI compilers were available for all that gear that is now retro gear, and you can often find those compilers to download. I'll discuss pre-ANSI (K&R) C syntax, but I won't try to deal with Small C.

*Trusting an Alternate OS:

If you are wondering how you can trust one of these alternate OSses, I talk about that a little here:

https://defining-computers.blogspot.com/2019/05/analyzing-mechanized-trust.html 

Some of what I say there, I'm not completely sure is universal, but if you have questions, that rant should give you some good pointers to start researching your questions.

------ End Side Note on Getting a Compiler ------

Back to the Hello World program.

It's a simple program. It also invites some misunderstanding, which is the real reason for this post. While you read this, keep the text editor window where you copied and pasted the source open for reference.

I am now going to tell you something that may have you thinking I'm telling you lies. But Kernighan and Ritchie explain it as well, in their book The C Programming Language. I'm just going to try to make it more obvious.

The programming language C does not inherently support strings of text. No real character type in the language proper, no real character string type, either.

Okay, I said it. 

(The support is indirect, through library functions, and is not nearly as complete in the standard libraries as you want to think.) 

In the program, collecting the "Hello World!" between the quotes into a byte array and terminating it with a NUL byte is, uhm, well, it's part of C, but it isn't.

To explain that somewhat carefully, let's do a few alternate versions of the Hello World program:


/* Not so typical Hello, World!" program.
** This instance the work of Joel Rees,
** Copyright 2021 Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>

char greeting1[ 32 ] = "Sekai yo! Konnichiwa!";

/* This could be done on a single line.
-- I'm doing it this way for effect.
*/
char greeting2[ 16 ] =
{
  'H',
  'o',
  'l',
  'a',
  ' ',
  'M',
  'u',
  'n',
  'd',
  'o',
  '!',
  '\0', /* <= Look! It's a NUL! */
  0 /* <= Look! It's another NUL! */
}; 

int main( int argc, char *argv[] )
{
  puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  puts( greeting1 );
  puts( greeting2 );
  return EXIT_SUCCESS;
}

You can change that for pre-ANSI compilers as described above.

Several questions should come to mind. 

One, why do I use Rōmaji instead of kana or Kanji in the Japanese, and where is the leading inverted exclamation mark in the Spanish?

I'll get to that in a bit. Maybe in another post.

The other, why do I put explicit NUL bytes on the end of the Spanish version?

Right? Those were the two most obvious questions, right?

No?

Okay, let's work through your questions. 

The string that I might have named greeting0, "Hello World!", is automatically collected in an array like greeting1 and greeting2, but not given a name that the programmer can use. 

It's anonymous. 

(If the same identical string occurs elsewhere in the same file, modern compilers will recognize that and only store it once -- unless you tell them not to. Older compilers may not search for identical strings, and just store another copy. But the string doesn't have a name that the C program can directly use.)

Other than that, the three greetings strings are all treated exactly the same way. They are collected as arrays of char, and a trailing 0 (NUL) terminator byte is attached.

(cough.)

Well, sort of. If I had declared them without size, ..., oh, hang on. No, we won't go there quite yet.

Oh. The size. I guess we do need to go there.

So, no, not quite exactly the same. The anonymous string is allocated enough bytes for the text and the trailing NUL (and maybe some extra, at the discretion of the compiler). 

The other two are allocated the number of bytes that the source specifies, thirty-two for greeting1 and sixteen for greeting2 (and maybe some extra, at the discretion of the compiler). And, if there is enough room after the text is stored, the rest of the array is filled with zeroes, effectively putting at least one NUL terminator byte at the end. But only if there is enough room. 

Will the compiler complain if you've declared a size too small for the text specified?

It should. Usually. I mean, yes. Maybe. Usually.

Which is why, when you explicitly declare strings, you usually declare strings like this:

char greeting1prime[] = "Sekai yo! Konnichiwa!";

Or like this:

char * greeting3prime = "Bonjour le monde!";

Here, greeting1prime is an array of bytes like greeting1, but the size allocated is enough bytes for the string plus the trailing NUL (plus extra, if the compiler wants to).

On the other hand, greeting3prime is a pointer to a byte, initialized with the address of the anonymously allocated, NUL terminated array "Bonjour le monde!". In other words, in addition to the string in greeting3prime's case, you've declared a variable of type 

char *

which is a byte pointer -- a variable. You can change what it points to. You can even lose what it was pointing to, if you're not careful.

Hmm. 

Before we go any further, let me explain something.

C has no native character type. 

Again, you're doubting my sanity. I know you are. This whole post is a discussion of characters in C, right?

Not yet. Time for a little history. (This is definitely not an aside.)

Back in the 1970s, when Kernighan and Ritchie and some of their coworkers were playing around with BCPL and the early versions of C, we didn't know nearly as much about how to deal with text in computers as we know now. (And there's still a lot we need to learn.)

Even the size of a byte in a computer was not set. Some had 6-bit bytes, some had 8-bit bytes, and a few had 9-bit bytes. Other sizes also existed, look them up if you want.

Nowadays, we can be sort-of comfortable thinking of a byte as 8 bits. But a byte now is (usually) defined as the smallest addressable unit of memory in a computer.

Which is precisely what C defines the char type as. This was something of a mistake. The type should really have been called byte.

You can alias the char type to byte with

typedef char byte;

Why did the conflation occur? Glad you asked. (You did ask, I hope.)

Back then, in the western world, we didn't really know much about eastern languages, so we didn't really consider them. 

Western languages all used (as we thought) less than 100 characters, and even Japanese had the kana, of which there are only around 50 (depending on what's included). 

And computers were beginning to standardize on 8-bit bytes

We assumed that Kanji were built in some orderly manner from smaller parts, and that those parts would number less than 250, so that they could also be encoded in bytes.

And we really didn't think about a single encoding that would encompass all languages, like Unicode. 256 was just too much a magical number to ignore.

256 is still a magical number, but we now know that even English actually needs more characters than that. (Some members of the computer typesetting industry of the early 1970s knew that we needed more than 256, but they weren't writing operating systems and programming languages.)

And, anyway, a typeface with more than 256 characters was known to need more computing power than an ordinary office could afford. (The 68000 overcame that barrier, but that was several years after the early K&R C had been defined.)

And that's the simplified version of how characters were conflated with bytes -- a bit historically incomplete, but good enough to help us think beyond the names of types.

char in C is an integer type. It can be either signed or unsigned, according to the what the C compiler engineers think works best for a particular family of computers.

And that should be enough for an introduction to characters in C. No, wait. One more version of Hello World! --


/* A non-typical Hello, World!" program.
** This instance the work of Joel Rees.
** Copyright 2021 Joel Matthew Rees. 
** Permission granted to modify, compile, and run
** for personal and educational use.
*/


#include <stdio.h>
#include <stdlib.h>


/* Implementing our very own puts():
*/
void my_puts( char * string )
{
  int i;

  for ( i = 0; string[ i ] != '\0'; ++i )
  {
    putchar( string[ i ] );
  }
  putchar( '\n' );
}

char greeting1[ 32 ] = "Sekai yo! Konnichiwa!";

/* This could be done on a single line.
-- I'm doing it this way for effect.
*/
char greeting2[ 16 ] =
{
  'H',
  'o',
  'l',
  'a',
  ' ',
  'M',
  'u',
  'n',
  'd',
  'o',
  '!',
  '\0', /* We only need one terminator, 
  ** but even this isn't really necessary here
  ** because we specify a large enough size.
  */
}; 

char * greeting3 = "Bonjour le monde!"; 

char greeting4[] = "Hallo Welt!";

int main( int argc, char *argv[] )
{
  my_puts( "Hello World!"); /* Sure -- puts() works as well as printf(). */
  my_puts( greeting1 );
  my_puts( greeting2 );
  my_puts( greeting3 );
  my_puts( greeting4 );
  return EXIT_SUCCESS;
}

For pre-ANSI C compilers, change main() as described above, and change my_puts() as follows:

void my_puts( string )
char * string;
{
  /* Code that goes here is the same. */
}

A few things I didn't mention before --

Double quotes are for strings. 

(Well, they're for telling C to automatically collect the text into a char array, and to terminate the array with a NUL if the size isn't specified as just big enough for the text without the NUL. Try that yourself and see what happens, by the way. Something like

char fluke[ 15 ] = "This is a test.";

What did it do when you tried to puts( fluke )?)

The single quotes we used in several places are for individual characters, not for strings. 

(In some compilers, you can actually pack multiple characters in between the single quotes, but I'm not going to try to confuse you by telling you that. Okay? You didn't hear me say that. Okay? Good. You don't want to know what that does in relation to byte order, in particular. ;))

So, 'A' is a single capital A. 

You saw that '\0' is shorthand for a NUL byte, eight bits of zero. Well, zero in eight bits. (No matter how wide it is, zero is still zero, in any byte order, thank heavens.) 

And you saw in our third version, that '\n' is shorthand for a newline character.

Oh, and "void" is (among other things) for telling the compiler that a particular function doesn't have a return value. Also, control returns from a function at the trailing brace if there is no explicit return.

And you might have noticed, in our my_puts(), that you can use a character pointer variable as if it were the name of an array in many cases. (There are other ways to write my_puts(), but we won't go there quite yet, either.)

I have other things to do tonight, so that's it for now.

It's your turn to think of things you can do with this. Explore. Get results you don't expect. Get a copy of the Kernighan & Ritchie's The C Programming Language if you don't already have one, or look it up on the web and figure out why.

(I may or may not write a follow-up to this sometime soon.)

The next step in this tutorial is ready, now. We'll tell the computer whom to greet.

 [TOC]