Sunday, November 14, 2021

Getting Free/Libre Software -- The GIMP and Libre Office as Examples

(This is something of a continuation of this post: https://defining-computers.blogspot.com/2019/05/analyzing-mechanized-trust.html.)

So you have friends who recommend Libre/Free/Open Source software like the GIMP or Libre Office, but you aren't sure how to get it. So you type "How do I get GIMP" or "download gimp" into your web browser's search box and hit return and it shows you a bunch of different advice. 

In the case of the GIMP, gimp.org shows up pretty high in the results page. Likewise Libre Office and libreoffice.org. And your friends and I say these are the official sites, so you have some pretty high levels of confidence you know what to do. 

But if you look down the list a bit you see download.cnet.com. Aren't they bigger? Shouldn't you trust them more? Or you see something like imagecustomize.com, and for some reason you think that name makes more sense than "the GIMP".

Most of the major libre projects will have URLs based on their project name, and, with libre software, the project's own website is where you want to go to get it. Otherwise, you risk getting something that has been tampered with and compromised and could easily compromise your computer.

Okay, you know that. So you go to the project sites and find the download pages and click the download link or button and your browser kicks up a fit:

***** DANGER DANGER!!! NOT MICROSOFT APPROVED!!! *****

Of course not, you chuckle.

Wait. Yes, of course not. But hold off a moment. Even so, there may be some meaning to that. 

Mouse around the website a bit. Find the instructions for installing and using the software. Find the mailing list for user support. Read a few posts. Get a feel for the community. See if you feel comfortable about the attitude of the developers and the community.

If you're planning on using software, whether you buy the "traditional" kind or download the libre kind, you want to be sure you can ask questions and get answers.

(In truth, the traditional way of "Pay me and trust me!" makes no more sense to me than the libre way of "Use my stuff and if it helps you donate to the project so we can keep making it." If anything, it seems like the libre way should be the usual way, not the pay-and-trust way.)

Anyway, spend a few hours or days or more getting used to the community. Not only does it help you decide whether you trust the community enough to download and use their software, it also helps you recognize the community, in case you get mail claiming to be from the community but telling you strange things.

So, the time to download and install has come, but you might download the software and try to install it, and find your anti-virus/anti-malware program blocking you at some inconvenient point. 

***** DANGER DANGER WARNING WARNING *****
***** MIGHT DO STRANGE THINGS TO YOUR COMPUTER!!! *****

Had something like this come up on the mailing list for the GIMP just now: 

> Report from Trend Micro antivirus during download:
>
> Time: 11/11/2021 16:36
> File: gimp-2.20.28-setup.exe-part
> Threat: TSPY.Win32.TRX.XXPE50FSX016E0002
> Action: Quarantined

Putting quotes around that and doing a web search yields no results. If Trend Micro has a specific record for this, they haven't published it, or it's really new.

Best thing to do is contact them and see if they are willing to share what they have, but this report is similar to one on the mailing list for the language Go:

https://github.com/golang/go/issues/45191

You'll note the phrase "out of an abundance of caution". I'm afraid I would use other words, such as "laziness", but I don't know if that would or should put your mind at ease. You probably don't know me from Adam.

So, I finally get to the purpose of this rant. I'll tell you what I go through when I prepare to install free/libre software. Read through it once before you try following along:

(1) Do I trust the developer(s)?

I've talked around this above. If I don't trust them, I don't even try to download their software. I go looking for another alternative.

In the case of the GIMP and Libre Office, I've been watching the community and using the software long enough to trust the developers enough to install it on a new computer if i think it hasn't been tampered with.

(2) So it comes down to detecting tampering.

(2a) Is the download available on HTTPS servers?

The URLs for the website for downloading the GIMP start with HTTPS, starting from here:

https://www.gimp.org/downloads/

If you want it for Mac or MSWindows and it doesn't automagically show you the one for your OS, there's a link or button you can click on to get there.

If you want it for Ubuntu or Debian or some other Linux, or from one of the BSDs, you can probably get it from the packages for your OS, and you won't need these instructions. (There are other instructions for less well-known software and less well-known OSses, but this isn't the place.)

HTTPS (as opposed to unencrypted HTTP) gives a fairly high degree of confidence that the owners and operators of the web site is who they say they are, and that what you download makes it to your computer safely. For many people it's enough. For me, it helps.

In addition, if you have a torrent client, they may provide a link for torrent download. Torrent download is a bit more secure than simple download, for what it's worth. It's also supposed to use spare bandwidth instead of prime, and so be friendlier to the Internet.

(2b) Do they make checksums available?

The GIMP makes checksums available, publishing the checksum for the MSWIndows download on the download page underneath the download buttons. If you've wandered around enough, you've probably seen it. But you may not have recognized it. 

As I write this, the current SHA256 checksum for the GIMP is

2c2e081ce541682be1abdd8bc6df13768ad9482d68000b4a7a60c764d6cec74e

You can use the certutil.exe utility in MSWindows to check that from a shell or powershell window. The command is

certutil -hashfile filename SHA256

Substitute the name of the file for "filename", of course. In the case of the present version of the GIMP, it's "gimp-2.10.28-setup.exe". 

Also, make sure you are in the download directory before you issue the command.

Below the checksum, the GIMP site gives a link to VirusTotal, which you can use to check whether vendors are blacklisting this particular checksum. Some other projects also do.

But if you do that, copy the entire checksum and use your search engine to go direct to VirusTotal and paste the checksum in. That way, in the very slight chance that you are seeing a spoof of the GIMP's website, you can avoid the possibility of jumping to a spoofed VirusTotal, as well. 

I'm not sure how useful that information will be, but some may find it useful.

Speaking of man-in-the-middle, If you're worried that, in spite of the site using HTTPS, the download page is being spoofed by a man-in-the-middle and the checksum is faked, there is a way to get some confidence that is not the case.

Look around the download page or the installation instruction pages and the link to the mirrors. Look through the list of mirrors and pick one at random.

I happen to be familiar with the XMission mirror in the US, so I'll use that as an example. Search the web for XMission and note the URL. Open the site and copy the domain name:

https://xmission.com

Use right-click to copy (don't jump to) the link in the mirrors list:

https://mirrors.xmission.com/

Note that the xmission.com domain name is the same. Now you can paste the mirroring subdomain name into the URL blank of your browser and go to their libre downloads section and be pretty sure you're safe from everything but guess-ahead DNS poisoning.

I'm deliberately refraining from making those links active. If you are worried about man-in-the-middle attacks, you should not trust active links on my blog. I might make them appear to take you to the right place but take you someplace else, instead.

Drill down into the gimp section, into the gimp section of that, into the current version (2.10) and into the windows section of that. For the XMission mirror, the URL you end up at is (currently)

https://mirrors.xmission.com/gimp/gimp/v2.10/windows/

From there, you can download the SHA256SUMS file, save it on your computer, open it with a text editor, and look at the line for the version you downloaded. If the default text editor (probably Notepad) shows the lines without breaks, try the other one (WordPad). 

In the case of the GIMP v. 2.10.28, the line you're looking for would be the line for gimp-2.10.28-setup.exe (currently the last line).

Copy the checksum from the web page and paste it below that line like this:

2c2e081ce541682be1abdd8bc6df13768ad9482d68000b4a7a60c764d6cec74e gimp-2.10.28-setup.exe
2c2e081ce541682be1abdd8bc6df13768ad9482d68000b4a7a60c764d6cec74e

and you can visually check that the checksums are the same.

If you need even more assurance, try one or two more mirrors, and you have two or three witnesses that the checksum is the one the project publishes.

To recap, what I've walked you through is a way to get more than one witness that you got what the GIMP project put up there for you, which, if you trust the project, should be enough to trust the download, even if random security vendor is too lazy to be sure that it isn't giving false positives on free/libre software.

You can use the same sort of process with Libre Office and many other Free/Libre/Open Source projects, including Debian OS and Ubuntu OS (although Ubuntu's website and mirrors sometimes make the checksums hard to find). 

In LibreOffice's case, below the download link you'll find a link for the torrent, and beside that, a link for info. Clicking on the info link will show you the checksums and list of mirrors. Oh -- A hash is a checksum, so look for the hashes. 

For the mirrors for LibreOffice, I go down the list of mirrors and pick KDDIlabs since I'm sort of familiar with them, and right-click copy the link, then paste it into an open text editor window. I grab the top part of it, https://ftp-srv2.kddilabs.jp/, and copy-paste it to a browser window's URL blank. And I can drill down from there into office, tdf, libreoffice, stable, 7.2.2, win, x86_64, and, oh dear. No checksum files. 

-----

LibreOffice developers are not believers in the many witnesses approach. I'll need to have gnupg and find the signature keys, instead. Some people think that is a better approach, but it does leave you with a chicken-and-egg conundrum. I'll have to finish this part of this post later, I've just run out of time this weekend.

The safest thing to do in this case is, if you have a friend who has a Linux or BSD OS running, have your friend install gnupg on her system, set it up to recognize the keys for libreoffice, download the libreoffice installer, and check the signatures, and then put it on a USB drive for you. Or, you could back up a bit, and have her do the same for the MSWIndows version of gnupg, which you could install and set up on your computer, and then you're set to check fingerprints instead of or along with checksums.

-----

As I mentioned above, if you are using Debian or Cygwin, et. al., the OS project has packages for all the major libre programs, and you can use the OS's package manager directly, and the package manager handles all the checksum checking for you. Which means you only need to do the above running around the web for multiple witnesses once, when you first download the OS. 

(Some OSses make upgrading to a new version of the OS as straightforward as getting a package, but some do not. In the latter case, you may end up looking for around for multiple witnesses on the checksum on every OS upgrade, but that isn't very often.)

Some projects (last time I installed Cygwin) do not make checksums available on their mirror sites. I'm not sure why; it may have something to do with the rate of updates and the way they handle download and updates. Cygwin, in particular, is more like the OSses, with its own package manager.

Now that I've walked you through the process, you have no excuse not to try Libre software, right? 

Heh. I know, it's scary. But reading through this again in a few weeks or months should help.


Wednesday, November 3, 2021

Gnome Text Editor gedit, and Regular Expressions

Gedit is the Gnome project's default text editor

Somewhere over the last twenty years, it became a victim of malicious simplification, and it's hard to get information on advanced features. (I'm still trying to remember how to enable the included "advanced features" with the new UI.)

Since it's hard to find information on gedit's regular expressions, I'm taking notes here:

  • \s matches newline, as well as other whitespace. Matches across end-of-line.
    \S seems to invert that.
  • ^ is line beginning,
    $ is line-end, but line-end can be hidden by \s .
  • \h matches non-newline whitespace, and
    \H inverts that.
  • () Parenthetic groups work, \1 -- \9 are the first nine in the replacement pattern.
  • | Alternate patterns work, separated by | .
  • [] Brackets collect arbitrary single matches:
    \d is the same as [0-9] .
Repeats:
  • * is 0 or more
  • + is one or more

Thus 

\h+ 

is one or more non-newline whitespace characters.

I'll add notes as it seems appropriate.

Some examples:

  • Insert a semicolon comment character where it was left out of EQU statements:
    match: (\h+EQU)(\h+)(\S+)(\h+)(\S)
    replace: \1\2\3\4; \5
  • Replace LDA A style assembler lines with LDAA, inserting semicolon before comments as you go (but missing lines without comments):
    match: (\h+)(LDA)\h+([AB])(\h+)(\S+)(\h+)(\S)
    \1\2\3\4\5\6; \7
  • Insert semicolon comment characters in LDD lines with comments:
    match: (\h+)(LD|ST)D(\h+)(\S+)(\h+)(\S)
    replace: \1\2D\3\4\5; \6
  • Insert semicolon comment characters in branch (Bcc) lines:
    match: (\h+B??)(\h+)(\S+)(\h+)(\S)
    replace: \1\2\3\4; \5


Sunday, August 22, 2021

Hand Compiling C Code to 6809 -- Mixed Output

Source is an extract from calc.c in rpilot from github:

https://github.com/TryItOnline/pilot-rpilot

(which appears to be a fork of rpilot on sourceforge:)

https://sourceforge.net/projects/rpilot/files/rpilot/

I chose the calc() function in calc.c as an exercise to test how difficult the entire project would be. Just this much took me more than eight hours. (Admittedly, it did not have my full attention, and, at sixty, I'm probably not as quick at this as I was at twenty-one.) 

I initially started working on a compile to 6800 code, because that is a more interesting (meaning, harder) target, but I spent an awful lot of time in working out glue logic, and the result was tending to look like a list of subroutine calls (which is generally good indication to write in Forth instead of assembler). 

The purpose in this sort of exercise is to prepare for writing a compiler.

Yeah, it's a little hard to read. The C source is in the comment lines. 


*** An exercise in semi-blind hand-compiling C to 6809, 
*** from the rpilot source code.
*** Joel Rees, Amagasaki, Japan, August 2021

*/*
 * calc.c - handle simple mathematical expressions
 * rob - started july.25.2000
 * 
 * updates:
 *   - got around to finishing it - aug.11.2000
 *   - RPilot special code - aug.11.2000
 */

*--------------------

* An include file somewhere would have

* Native integer width for C programs:
_INTWIDTH EQU 2
_ADRWIDTH EQU 2

*--------------------

* #include "rpilot.h"
* #include "rstring.h"
* #include "calc.h"
  INCLUDE calc.h.6809.asm
* #include "var.h"

* #include <string.h>
* #include <stdio.h>
* #include <ctype.h>
* #include <stdlib.h>

* int next_num( char *str, int *pos, int *status );
* char next_tok( char *str, int *pos );
* int find_match( char *str, int pos, char what, char match );
* int read_num( char *str, int *pos );
* int read_var( char *str, int *pos );

* int calc( char *expr, int *status )
expr SET 0
status SET expr+_ADRWIDTH
_PSZ SET status+_ADRWIDTH
calc 
* {
*   int pos = 0;
pos SET 0
*   int num = 0, result = 0;
num SET pos+_INTWIDTH
result SET num+_INTWIDTH
*   char op = 0; 
op SET result+_INTWIDTH
_LOCSZ SET op+1
  LDD #0
  PSHU D
  PSHU D
  PSHU D
  PSHU B

*   total_trim( expr );
  LDX expr+_LOCSZ,U
  PSHU X
  LBSR total_trim
  LEAU _ADRWIDTH,U	; returned pointer unnecessary, balance stack

*   result = next_num( expr, &pos, status );
  LDY status+_LOCSZ,U
  LEAX pos,U
  LDD expr+_LOCSZ,U
  PSHU D,X,Y
  LBSR next_num
  PULU D
  STD result,U

*   while( pos < strlen(expr) ) {
calc_lp00
  LDX expr+_LOCSZ,U
  PSHU X
  LBSR strlen
  LDD pos+_INTWIDTH,U
  CMPD ,U++
  LBGE calc_lx00

*     op = next_tok( expr, &pos );
  LEAX pos,U
  LDD expr+_LOCSZ,U
  PSHU X,Y
  LBSR next_tok
  PULU B
  STB op,U
*     num = next_num( expr, &pos, status );
  LDY status+_LOCSZ,U
  LEAX pos,U
  LDD expr+_LOCSZ,U
  PSHU D,X,Y
  LBSR next_num
  PULU D
  STD num,U

* Directly indexing a jump table from op takes too much memory
*   from 6809 address range.
* Not enough cases to warrant searching in a loop.
*     switch( op ) {
  LDB op,U
* Use value-optimized binary test tree:
* (We don't really need it to be this fast.) 
  CMPB #'+'
  BEQ calc_if00_c01
  BLO calc_if00_t00
  CMPB #'^'
  BEQ calc_if00_c08
  BLO calc_if00_t01
  CMPB #'|'
  BEQ calc_if00_c07
  BRA calc_if00_else
calc_if00_t01
  CMPB #'-'
  BEQ calc_if00_c02
  CMPB #'/'
  BEQ calc_if00_c03
  BRA calc_if00_else
calc_if00_t00
  CMPB #'*'
  BEQ calc_if00_c04
  CMPB #'&'
  BEQ calc_if00_c06
  CMPB #'%'
  BEQ calc_if00_c05
*     case 0 :  // invalid operand
* Code duplicate of default
calc_if00_else
calc_if00_c00
*       *status = CALC_NO_OP;
  LDD #CALC_NO_OP
  STD [status+_LOCSZ,U]
*       return 0;
  LDD #0
  PSHU D
  RTS
*       break;

*     case '+' :
calc_if00_c01
*       result += num;
  LDD result,U
  ADDD num,U
*       break;
  BRA calc_if00_end
*     case '-' :
calc_if00_c02
*       result -= num;
  LDD result,U
  SUBD num,U
*       break;
  BRA calc_if00_end
*     case '/' :
calc_if00_c03
*       result /= num;
  LDX result,U
  LDD num,U
  PSHU D,X
  LBSR IDIV
  PULU D
*       break;
  BRA calc_if00_end
*     case '*' :
calc_if00_c04
*       result *= num;
  LDX result,U
  LDD num,U
  PSHU D,X
  LBSR IMUL
  PULU D
*       break;
  BRA calc_if00_end
*     case '%' :
calc_if00_c05
*       result %= num;
  LDX result,U
  LDD num,U
  PSHU D,X
  LBSR IMOD
  PULU D
*       break;
  BRA calc_if00_end
*     case '&' : 
calc_if00_c06
*       result &= num;
  LDD result,U
  ANDB num+1,U 
  ANDA num,U
*       break;
  BRA calc_if00_end
*     case '|' :
calc_if00_c07
*       result |= num;
  LDD result,U
  ORB num+1,U 
  ORA num,U
*       break;
  BRA calc_if00_end
*     case '^' :
calc_if00_c08
*       result ^= num;
  LDD result,U
  EORB num+1,U 
  EORA num,U
*       break;
  BRA calc_if00_end
*     Code duplicate of case 0
*     default:
* calc_if00_else
*       *status = CALC_BAD_OP;
*       return 0;
*       break;
*     }
calc_if00_end
  STD result,U
  LBRA calc_lp00
*   }
calc_lx00
*   *status = CALC_SUCCESS;
  LDD #CALC_NO_OP
  STD [status+_LOCSZ,U]
*   return result;
  LDD result,U
  PSHU D
  RTS
* }

* Leave it possible to assemble, if not run:
IMUL
IDIV
IMOD
strlen
next_num
find_match
read_num
next_tok
read_var EQU *
total_trim
  END

Here's the assembler output, for reference:


                      (  calc.c.6809.asm):00001         *** An exercise in semi-blind hand-compiling C to 6809, 
                      (  calc.c.6809.asm):00002         *** from the rpilot source code.
                      (  calc.c.6809.asm):00003         *** Joel Rees, Amagasaki, Japan, August 2021
                      (  calc.c.6809.asm):00004         
                      (  calc.c.6809.asm):00005         */*
                      (  calc.c.6809.asm):00006          * calc.c - handle simple mathematical expressions
                      (  calc.c.6809.asm):00007          * rob - started july.25.2000
                      (  calc.c.6809.asm):00008          * 
                      (  calc.c.6809.asm):00009          * updates:
                      (  calc.c.6809.asm):00010          *   - got around to finishing it - aug.11.2000
                      (  calc.c.6809.asm):00011          *   - RPilot special code - aug.11.2000
                      (  calc.c.6809.asm):00012          */
                      (  calc.c.6809.asm):00013         
                      (  calc.c.6809.asm):00014         *--------------------
                      (  calc.c.6809.asm):00015         
                      (  calc.c.6809.asm):00016         * An include file somewhere would have
                      (  calc.c.6809.asm):00017         
                      (  calc.c.6809.asm):00018         * Native integer width for C programs:
     0002             (  calc.c.6809.asm):00019         _INTWIDTH EQU 2
     0002             (  calc.c.6809.asm):00020         _ADRWIDTH EQU 2
                      (  calc.c.6809.asm):00021         
                      (  calc.c.6809.asm):00022         *--------------------
                      (  calc.c.6809.asm):00023         
                      (  calc.c.6809.asm):00024         * #include "rpilot.h"
                      (  calc.c.6809.asm):00025         * #include "rstring.h"
                      (  calc.c.6809.asm):00026         * #include "calc.h"
                      (  calc.c.6809.asm):00027           INCLUDE calc.h.6809.asm
                      (  calc.h.6809.asm):00001         */*
                      (  calc.h.6809.asm):00002          * calc.h - header file for the calc package
                      (  calc.h.6809.asm):00003          * rob linwood (rcl211@nyu.edu)
                      (  calc.h.6809.asm):00004          * see README for more information
                      (  calc.h.6809.asm):00005          */
                      (  calc.h.6809.asm):00006         
                      (  calc.h.6809.asm):00007         * #ifndef _calc_h_
                      (  calc.h.6809.asm):00008         * #define _calc_h_
                      (  calc.h.6809.asm):00009         
                      (  calc.h.6809.asm):00010         * #define CALC_SUCCESS 0      /* Indicates success */
     0000             (  calc.h.6809.asm):00011         CALC_SUCCESS EQU 0
                      (  calc.h.6809.asm):00012         * #define CALC_NO_OP 1        /* No mathematical operator in expression */
     0001             (  calc.h.6809.asm):00013         CALC_NO_OP EQU 1
                      (  calc.h.6809.asm):00014         * #define CALC_BAD_OP 2       /* Unknown mathematical operator in expression */
     0002             (  calc.h.6809.asm):00015         CALC_BAD_OP EQU 2
                      (  calc.h.6809.asm):00016         * int calc( char *expr, int *status );
                      (  calc.h.6809.asm):00017         
                      (  calc.h.6809.asm):00018         * #endif
                      (  calc.c.6809.asm):00028         * #include "var.h"
                      (  calc.c.6809.asm):00029         
                      (  calc.c.6809.asm):00030         * #include <string.h>
                      (  calc.c.6809.asm):00031         * #include <stdio.h>
                      (  calc.c.6809.asm):00032         * #include <ctype.h>
                      (  calc.c.6809.asm):00033         * #include <stdlib.h>
                      (  calc.c.6809.asm):00034         
                      (  calc.c.6809.asm):00035         * int next_num( char *str, int *pos, int *status );
                      (  calc.c.6809.asm):00036         * char next_tok( char *str, int *pos );
                      (  calc.c.6809.asm):00037         * int find_match( char *str, int pos, char what, char match );
                      (  calc.c.6809.asm):00038         * int read_num( char *str, int *pos );
                      (  calc.c.6809.asm):00039         * int read_var( char *str, int *pos );
                      (  calc.c.6809.asm):00040         
                      (  calc.c.6809.asm):00041         * int calc( char *expr, int *status )
     0000             (  calc.c.6809.asm):00042         expr SET 0
     0002             (  calc.c.6809.asm):00043         status SET expr+_ADRWIDTH
     0004             (  calc.c.6809.asm):00044         _PSZ SET status+_ADRWIDTH
0000                  (  calc.c.6809.asm):00045         calc 
                      (  calc.c.6809.asm):00046         * {
                      (  calc.c.6809.asm):00047         *   int pos = 0;
     0000             (  calc.c.6809.asm):00048         pos SET 0
                      (  calc.c.6809.asm):00049         *   int num = 0, result = 0;
     0002             (  calc.c.6809.asm):00050         num SET pos+_INTWIDTH
     0004             (  calc.c.6809.asm):00051         result SET num+_INTWIDTH
                      (  calc.c.6809.asm):00052         *   char op = 0; 
     0006             (  calc.c.6809.asm):00053         op SET result+_INTWIDTH
     0007             (  calc.c.6809.asm):00054         _LOCSZ SET op+1
0000 CC0000           (  calc.c.6809.asm):00055           LDD #0
0003 3606             (  calc.c.6809.asm):00056           PSHU D
0005 3606             (  calc.c.6809.asm):00057           PSHU D
0007 3606             (  calc.c.6809.asm):00058           PSHU D
0009 3604             (  calc.c.6809.asm):00059           PSHU B
                      (  calc.c.6809.asm):00060         
                      (  calc.c.6809.asm):00061         *   total_trim( expr );
000B AE47             (  calc.c.6809.asm):00062           LDX expr+_LOCSZ,U
000D 3610             (  calc.c.6809.asm):00063           PSHU X
000F 1700D0           (  calc.c.6809.asm):00064           LBSR total_trim
0012 3342             (  calc.c.6809.asm):00065           LEAU _ADRWIDTH,U      ; returned pointer unnecessary, balance stack
                      (  calc.c.6809.asm):00066         
                      (  calc.c.6809.asm):00067         *   result = next_num( expr, &pos, status );
0014 10AE49           (  calc.c.6809.asm):00068           LDY status+_LOCSZ,U
0017 30C4             (  calc.c.6809.asm):00069           LEAX pos,U
0019 EC47             (  calc.c.6809.asm):00070           LDD expr+_LOCSZ,U
001B 3636             (  calc.c.6809.asm):00071           PSHU D,X,Y
001D 1700C2           (  calc.c.6809.asm):00072           LBSR next_num
0020 3706             (  calc.c.6809.asm):00073           PULU D
0022 ED44             (  calc.c.6809.asm):00074           STD result,U
                      (  calc.c.6809.asm):00075         
                      (  calc.c.6809.asm):00076         *   while( pos < strlen(expr) ) {
0024                  (  calc.c.6809.asm):00077         calc_lp00
0024 AE47             (  calc.c.6809.asm):00078           LDX expr+_LOCSZ,U
0026 3610             (  calc.c.6809.asm):00079           PSHU X
0028 1700B7           (  calc.c.6809.asm):00080           LBSR strlen
002B EC42             (  calc.c.6809.asm):00081           LDD pos+_INTWIDTH,U
002D 10A3C1           (  calc.c.6809.asm):00082           CMPD ,U++
0030 102C00A3         (  calc.c.6809.asm):00083           LBGE calc_lx00
                      (  calc.c.6809.asm):00084         
                      (  calc.c.6809.asm):00085         *     op = next_tok( expr, &pos );
0034 30C4             (  calc.c.6809.asm):00086           LEAX pos,U
0036 EC47             (  calc.c.6809.asm):00087           LDD expr+_LOCSZ,U
0038 3630             (  calc.c.6809.asm):00088           PSHU X,Y
003A 1700A5           (  calc.c.6809.asm):00089           LBSR next_tok
003D 3704             (  calc.c.6809.asm):00090           PULU B
003F E746             (  calc.c.6809.asm):00091           STB op,U
                      (  calc.c.6809.asm):00092         *     num = next_num( expr, &pos, status );
0041 10AE49           (  calc.c.6809.asm):00093           LDY status+_LOCSZ,U
0044 30C4             (  calc.c.6809.asm):00094           LEAX pos,U
0046 EC47             (  calc.c.6809.asm):00095           LDD expr+_LOCSZ,U
0048 3636             (  calc.c.6809.asm):00096           PSHU D,X,Y
004A 170095           (  calc.c.6809.asm):00097           LBSR next_num
004D 3706             (  calc.c.6809.asm):00098           PULU D
004F ED42             (  calc.c.6809.asm):00099           STD num,U
                      (  calc.c.6809.asm):00100         
                      (  calc.c.6809.asm):00101         * Directly indexing a jump table from op takes too much memory
                      (  calc.c.6809.asm):00102         *   from 6809 address range.
                      (  calc.c.6809.asm):00103         * Not enough cases to warrant searching in a loop.
                      (  calc.c.6809.asm):00104         *     switch( op ) {
0051 E646             (  calc.c.6809.asm):00105           LDB op,U
                      (  calc.c.6809.asm):00106         * Use value-optimized binary test tree:
                      (  calc.c.6809.asm):00107         * (We don't really need it to be this fast.) 
0053 C12B             (  calc.c.6809.asm):00108           CMPB #'+'
0055 2730             (  calc.c.6809.asm):00109           BEQ calc_if00_c01
0057 2516             (  calc.c.6809.asm):00110           BLO calc_if00_t00
0059 C15E             (  calc.c.6809.asm):00111           CMPB #'^'
005B 276D             (  calc.c.6809.asm):00112           BEQ calc_if00_c08
005D 2506             (  calc.c.6809.asm):00113           BLO calc_if00_t01
005F C17C             (  calc.c.6809.asm):00114           CMPB #'|'
0061 275F             (  calc.c.6809.asm):00115           BEQ calc_if00_c07
0063 2016             (  calc.c.6809.asm):00116           BRA calc_if00_else
0065                  (  calc.c.6809.asm):00117         calc_if00_t01
0065 C12D             (  calc.c.6809.asm):00118           CMPB #'-'
0067 2724             (  calc.c.6809.asm):00119           BEQ calc_if00_c02
0069 C12F             (  calc.c.6809.asm):00120           CMPB #'/'
006B 2726             (  calc.c.6809.asm):00121           BEQ calc_if00_c03
006D 200C             (  calc.c.6809.asm):00122           BRA calc_if00_else
006F                  (  calc.c.6809.asm):00123         calc_if00_t00
006F C12A             (  calc.c.6809.asm):00124           CMPB #'*'
0071 272D             (  calc.c.6809.asm):00125           BEQ calc_if00_c04
0073 C126             (  calc.c.6809.asm):00126           CMPB #'&'
0075 2743             (  calc.c.6809.asm):00127           BEQ calc_if00_c06
0077 C125             (  calc.c.6809.asm):00128           CMPB #'%'
0079 2732             (  calc.c.6809.asm):00129           BEQ calc_if00_c05
                      (  calc.c.6809.asm):00130         *     case 0 :  // invalid operand
                      (  calc.c.6809.asm):00131         * Code duplicate of default
007B                  (  calc.c.6809.asm):00132         calc_if00_else
007B                  (  calc.c.6809.asm):00133         calc_if00_c00
                      (  calc.c.6809.asm):00134         *       *status = CALC_NO_OP;
007B CC0001           (  calc.c.6809.asm):00135           LDD #CALC_NO_OP
007E EDD809           (  calc.c.6809.asm):00136           STD [status+_LOCSZ,U]
                      (  calc.c.6809.asm):00137         *       return 0;
0081 CC0000           (  calc.c.6809.asm):00138           LDD #0
0084 3606             (  calc.c.6809.asm):00139           PSHU D
0086 39               (  calc.c.6809.asm):00140           RTS
                      (  calc.c.6809.asm):00141         *       break;
                      (  calc.c.6809.asm):00142         
                      (  calc.c.6809.asm):00143         *     case '+' :
0087                  (  calc.c.6809.asm):00144         calc_if00_c01
                      (  calc.c.6809.asm):00145         *       result += num;
0087 EC44             (  calc.c.6809.asm):00146           LDD result,U
0089 E342             (  calc.c.6809.asm):00147           ADDD num,U
                      (  calc.c.6809.asm):00148         *       break;
008B 2045             (  calc.c.6809.asm):00149           BRA calc_if00_end
                      (  calc.c.6809.asm):00150         *     case '-' :
008D                  (  calc.c.6809.asm):00151         calc_if00_c02
                      (  calc.c.6809.asm):00152         *       result -= num;
008D EC44             (  calc.c.6809.asm):00153           LDD result,U
008F A342             (  calc.c.6809.asm):00154           SUBD num,U
                      (  calc.c.6809.asm):00155         *       break;
0091 203F             (  calc.c.6809.asm):00156           BRA calc_if00_end
                      (  calc.c.6809.asm):00157         *     case '/' :
0093                  (  calc.c.6809.asm):00158         calc_if00_c03
                      (  calc.c.6809.asm):00159         *       result /= num;
0093 AE44             (  calc.c.6809.asm):00160           LDX result,U
0095 EC42             (  calc.c.6809.asm):00161           LDD num,U
0097 3616             (  calc.c.6809.asm):00162           PSHU D,X
0099 170046           (  calc.c.6809.asm):00163           LBSR IDIV
009C 3706             (  calc.c.6809.asm):00164           PULU D
                      (  calc.c.6809.asm):00165         *       break;
009E 2032             (  calc.c.6809.asm):00166           BRA calc_if00_end
                      (  calc.c.6809.asm):00167         *     case '*' :
00A0                  (  calc.c.6809.asm):00168         calc_if00_c04
                      (  calc.c.6809.asm):00169         *       result *= num;
00A0 AE44             (  calc.c.6809.asm):00170           LDX result,U
00A2 EC42             (  calc.c.6809.asm):00171           LDD num,U
00A4 3616             (  calc.c.6809.asm):00172           PSHU D,X
00A6 170039           (  calc.c.6809.asm):00173           LBSR IMUL
00A9 3706             (  calc.c.6809.asm):00174           PULU D
                      (  calc.c.6809.asm):00175         *       break;
00AB 2025             (  calc.c.6809.asm):00176           BRA calc_if00_end
                      (  calc.c.6809.asm):00177         *     case '%' :
00AD                  (  calc.c.6809.asm):00178         calc_if00_c05
                      (  calc.c.6809.asm):00179         *       result %= num;
00AD AE44             (  calc.c.6809.asm):00180           LDX result,U
00AF EC42             (  calc.c.6809.asm):00181           LDD num,U
00B1 3616             (  calc.c.6809.asm):00182           PSHU D,X
00B3 17002C           (  calc.c.6809.asm):00183           LBSR IMOD
00B6 3706             (  calc.c.6809.asm):00184           PULU D
                      (  calc.c.6809.asm):00185         *       break;
00B8 2018             (  calc.c.6809.asm):00186           BRA calc_if00_end
                      (  calc.c.6809.asm):00187         *     case '&' : 
00BA                  (  calc.c.6809.asm):00188         calc_if00_c06
                      (  calc.c.6809.asm):00189         *       result &= num;
00BA EC44             (  calc.c.6809.asm):00190           LDD result,U
00BC E443             (  calc.c.6809.asm):00191           ANDB num+1,U 
00BE A442             (  calc.c.6809.asm):00192           ANDA num,U
                      (  calc.c.6809.asm):00193         *       break;
00C0 2010             (  calc.c.6809.asm):00194           BRA calc_if00_end
                      (  calc.c.6809.asm):00195         *     case '|' :
00C2                  (  calc.c.6809.asm):00196         calc_if00_c07
                      (  calc.c.6809.asm):00197         *       result |= num;
00C2 EC44             (  calc.c.6809.asm):00198           LDD result,U
00C4 EA43             (  calc.c.6809.asm):00199           ORB num+1,U 
00C6 AA42             (  calc.c.6809.asm):00200           ORA num,U
                      (  calc.c.6809.asm):00201         *       break;
00C8 2008             (  calc.c.6809.asm):00202           BRA calc_if00_end
                      (  calc.c.6809.asm):00203         *     case '^' :
00CA                  (  calc.c.6809.asm):00204         calc_if00_c08
                      (  calc.c.6809.asm):00205         *       result ^= num;
00CA EC44             (  calc.c.6809.asm):00206           LDD result,U
00CC E843             (  calc.c.6809.asm):00207           EORB num+1,U 
00CE A842             (  calc.c.6809.asm):00208           EORA num,U
                      (  calc.c.6809.asm):00209         *       break;
00D0 2000             (  calc.c.6809.asm):00210           BRA calc_if00_end
                      (  calc.c.6809.asm):00211         *     Code duplicate of case 0
                      (  calc.c.6809.asm):00212         *     default:
                      (  calc.c.6809.asm):00213         * calc_if00_else
                      (  calc.c.6809.asm):00214         *       *status = CALC_BAD_OP;
                      (  calc.c.6809.asm):00215         *       return 0;
                      (  calc.c.6809.asm):00216         *       break;
                      (  calc.c.6809.asm):00217         *     }
00D2                  (  calc.c.6809.asm):00218         calc_if00_end
00D2 ED44             (  calc.c.6809.asm):00219           STD result,U
00D4 16FF4D           (  calc.c.6809.asm):00220           LBRA calc_lp00
                      (  calc.c.6809.asm):00221         *   }
00D7                  (  calc.c.6809.asm):00222         calc_lx00
                      (  calc.c.6809.asm):00223         *   *status = CALC_SUCCESS;
00D7 CC0001           (  calc.c.6809.asm):00224           LDD #CALC_NO_OP
00DA EDD809           (  calc.c.6809.asm):00225           STD [status+_LOCSZ,U]
                      (  calc.c.6809.asm):00226         *   return result;
00DD EC44             (  calc.c.6809.asm):00227           LDD result,U
00DF 3606             (  calc.c.6809.asm):00228           PSHU D
00E1 39               (  calc.c.6809.asm):00229           RTS
                      (  calc.c.6809.asm):00230         * }
                      (  calc.c.6809.asm):00231         
                      (  calc.c.6809.asm):00232         * Leave it possible to assemble, if not run:
00E2                  (  calc.c.6809.asm):00233         IMUL
00E2                  (  calc.c.6809.asm):00234         IDIV
00E2                  (  calc.c.6809.asm):00235         IMOD
00E2                  (  calc.c.6809.asm):00236         strlen
00E2                  (  calc.c.6809.asm):00237         next_num
00E2                  (  calc.c.6809.asm):00238         find_match
00E2                  (  calc.c.6809.asm):00239         read_num
00E2                  (  calc.c.6809.asm):00240         next_tok
     00E2             (  calc.c.6809.asm):00241         read_var EQU *
00E2                  (  calc.c.6809.asm):00242         total_trim
                      (  calc.c.6809.asm):00243           END

Saturday, February 6, 2021

A Tools-oriented In-depth Tutorial Introduction to C

 A Tools-oriented In-depth Tutorial Introduction to C

by Joel Matthew Rees

Copyright 2021, Joel Matthew Rees

 

I'm supposed to say something here that sets this tutorial apart from other tutorials you can find in various places on the web.

I'm writing it. That definitely makes it different.

Well, the odd choice of projects draws from my own odd interests, but my focus is on building tools I have found useful in my past programming projects, and in using them to expose the corners that most introductions to the language leave dark.

Most of the tutorial projects will compile and run anywhere an ANSI or K&R compiler can be used. I'll note any exceptions.

Part One -- Basics of C

  1. Looking Deeper into Hello World in C -- char Type and Characters and Strings
  2. Personalizing Hello World -- A Greet Command
  3. Personalizing Hello World -- Char Arrays, and Giving the User a Menu
  4. ASCII Table in C -- Char Arrays and Char Pointer Arrays
  5. TBD

Part Two -- TBD



Wednesday, February 3, 2021

ASCII Table in C -- Char Arrays and Char Pointer Arrays

[TOC]

Let's set Hello World! aside for a while. 

In the last project, we needed to look at a part of the ASCII table to explain the code for the menu selections. I happen to have that part of the table mostly memorized, and I know about the ASCII man page in modern *nix OSses, so I just built the code by hand. But it would be nice to have our own ASCII chart, and it would be cool to let the computer build it for us, right?

Let's give it a try.

First, we'll build a simple ASCII table. I don't have the control code mnemonics completely memorized, so we'll go to Wikipedia's ASCII page and the *nix manual pages mentioned above for reference.

man ascii 

(Using the terminal window's copy function, I pasted the manual page contents into an empty gedit text document window, and used gedit's regular expression search-and-replace to extract the parts I wanted. Very convenient, once you get the hang of it.)

With a little bit of work (a very little bit), I came up with the following simple table generator:


/* A simple program to print the ASCII chart, 
** as part of a tutorial introduction to C.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>


/* reference: 
== *nix man command: man ascii 
== Also, wikipedia: https://en.wikipedia.org/wiki/ASCII
*/
char *ctrl_code[33] =
{
  "NUL", 	/* '\0': null character */
  "SOH", 	/* ----  start of heading */
  "STX", 	/* ----  start of text */
  "ETX", 	/* ----  end of text */
  "EOT", 	/* ----  end of transmission */
  "ENQ", 	/* ----  enquiry */
  "ACK", 	/* ----  acknowledge(ment) */
  "BEL", 	/* '\a': bell */
  "BS", 	/* '\b': backspace */
  "HT", 	/* '\t': horizontal tab */
  "LF", 	/* '\n': line feed / new line */
  "VT", 	/* '\v': vertical tab */
  "FF", 	/* '\f': form feed */
  "CR", 	/* '\r': carriage ret */
  "SO", 	/* ----  shift out */
  "SI", 	/* ----  shift in */
  "DLE", 	/* ----  data link escape */
  "DC1", 	/* ----  device control 1 / XON */
  "DC2", 	/* ----  device control 2 */
  "DC3", 	/* ----  device control 3 / XOFF */
  "DC4", 	/* ----  device control 4 */
  "NAK", 	/* ----  negative ack. */
  "SYN", 	/* ----  synchronous idle */
  "ETB", 	/* ----  end of trans. blk */
  "CAN", 	/* ----  cancel */
  "EM", 	/* ----  end of medium */
  "SUB", 	/* ----  substitute */
  "ESC", 	/* ----  escape */
  "FS", 	/* ----  file separator */
  "GS", 	/* ----  group separator */
  "RS", 	/* ----  record separator */
  "US",  	/* ----  unit separator */ 
  "SPACE"  	/* ----  space */ 
};

char *del_code = 
  "DEL";	/* ----  delete */


int main( int argc, char *argv[] )
{
  int i;

  for ( i = 0; i < 33; ++i )
  {
    printf( "\t%3d 0x%2x: %s\n", i, i, ctrl_code[ i ] );
  }
  for ( i = 33; i < 127; ++i )
  {
    printf( "\t%3d 0x%2x: %c\n", i, i, i ); 
  }
  printf( "\t%3d 0x%2x: %s\n", 127, 127, del_code );
  return EXIT_SUCCESS;
}

(Again, if you're working on a pre-ANSI C compiler, remember to change the main()  function declaration to the pre-ANSI K&R style:

int main( argc, argv )
int argc;
char *argv[];
{ ...
}

I think most K&R C compilers should compile it with that change.)

Looking through the code, you probably think you recognize what the ctrl_code[] array is, but look close. It is not a two dimensional array of char. It's an array of char *, and the pointers point to anonymous char arrays of varying length. If we'd written it out with explicit C char array strings, it would look something like this:

char ent00[] = "NUL";    /* '\0': null character */
char ent01[] = "SOH";   /* ----  start of heading */
char ent02[] = "STX";   /* ----  start of text */
...
char ent33[] = "SPACE";      /* ----  space */

char *ctrl_code[] =
{
  ent00, ent01, ent02, ... ent33
};

But C takes care of all of this for us, without the nuisance of all the entNN names.

The advantage of this structure is pretty clear, I think. Well, maybe it's clear. 

In many cases, we can save some memory space because each char string does not have to be as long as the longest plus the byte for the trailing NUL. 

More importantly, we don't have to check whether we've accidentally clipped off that trailing NUL. (Right?)  

(Clear as mud? Well, follow along with me anyway. It does clear up.)

The compiler is free to allocate just enough space for the char array and its trailing NUL, and to take care of the petty details for us. 

All we have to do is remember that It's not really a two dimensional array. It just looks an awful lot like one. 

----- Side Note on Memory Usage -----

You may be wondering how much space is actually saved in this particular table by using an array of char * pointers instead of a two-dimensional array of char. It's a good question. Let's calculate it out.

If this array were declared as a two-dimensional array, we'd want the rows long enough to handle the longest mnemonic plus NUL. The longest mnemonic is SPACE, so that's 6 bytes:

char ctrl_code[ 33][ 6 ];

Total space is 33 rows times 6 bytes per row, or 198 bytes.

As we've declared it above, it will be 33 times the size of a pointer plus the individual string lengths. If pointers are sixteen bits (16-bit addresses), that's 66 bytes. If pointers are 32 bits (1990s computers, 32-bit addresses), that's double, or 132 bytes. 

On modern (64-bit address) computers, that's 8 bytes per address, or 264 bytes, just for the pointers themselves.

For the individual string lengths, there are 19 three-byte mnemonics, 13 two-byte mnemonics, and 1 five-byte mnemonic. Adding in the NUL, that's

19 x 4 + 13 x 3 + 6 == 76 + 39 + 6 == 121

For the various address sizes:

  • 16-bit: 66 + 121 == 187 bytes (11 bytes saved)
  • 32-bit: 132 + 121 == 253 bytes
  • 64-bit: 264 + 121 == 385 bytes

del_code could go either way, independent of the ctrl_code array. Declared as a pointer to an anonymous array, it consumes 2 bytes for the pointer (on a 16-bit architecture) and four for the anonymous array. We really don't need the pointer pointing to it, and accessing the array directly would use the same syntax. But sometimes you do things for consistency, and it is not always a bad thing to do so.

So, in this case, we really aren't saving space, unless we're working on a retro 16-bit computer, and even then not much. 

 The benefit of not having to worry about the trailing NULs is no small benefit, and the extra memory use does not worry us nearly as much on machines where a few hundred bytes are well less than a millionth of the total available memory. 

----- End Side Note on Memory Usage -----

The source code itself for this control code table gives us a good table of control codes, for reference, of course. But since it is source code, we can use it to make other tables from it.

Anyway, let's look at the source code. Since the source you copied out is way up off the screen, refer to it from the file where you copied it while you read this.

I include SPACE in it for convenience, even though SPACE really isn't classified as a control code by C's libraries, or by the language itself. That's no problem, is it? -- as long as we both remember that I'm playing a small game with semantics.

DEL is way up at the top of the ASCII range, and I don't have anything to pre-define for the visible character range, so DEL is not in the control code table. It gets its own named, NUL-terminated char array. Again, I just have to remember to print it's line out after I've done the rest.

I've declared the traditional counter for for loops, i, and I have one loop that is dedicated to the control codes.

This time, I'm using printf() instead of puts() or my_puts(). One reason is that the previous three projects should have gotten us comfortable with some of the details that you miss when using printf(). Another is that we want numeric and textual formatted output, and we aren't ready to write numeric output routines ourselves, and printf() does numeric output and was designed for formatted output. 

A lot of people read printf() as "print file". I forget and read it that way myself from time to time. It's habit that's catching. But it's not what printf() means. printf() means "print formatted".

And that's what it does. The first parameter is a format string. The parameters after the first are what we want formatted.

The format string for the first and third printf()s is this:

"\t%3d 0x%2x: %s\n"

Working through the format --

\t is the tab character. It'll give us a little regular space on the left.

%d is decimal (base ten) integer (not real or fractional) numeric output. %3d is three columns wide, right-justified. We print the loop counter out here, because the array of control code mnemonics is arranged so that the code is the same as the index, and we are using the loop counter as the index.

The space character that comes next is significant. We output it as-is, along with the zero and lower-case x which follow.

%x is hexadecimal numeric output, and %2x is two columns wide, right justified. (This was a little bit of a mistake. I'll show you how to fix it, below.) Then we use this format to output the loop counter again, so we can see the code in hexadecimal.

Then the colon and the following space are output as-is, and %s just outputs the char array passed in as a string of text. We pass in the mnemonic, and the formatted print is done. The output looks like this:

       0 0x 0: NUL
       1 0x 1: SOH
      ...
      13 0x d: CR
      ...
      31 0x1f: US
      32 0x20: SPACE

The second loop has a slightly different format, but the result is adjusted to the first:

"\t%3d 0x%2x: %c\n"

The first two formats are the same. The third is a char format, which outputs the integer given to it as an (ASCII range) character. All three get the loop counter, so we see the character codes in decimal, then in hexadecimal, then the actual character. It looks like this:

      33 0x21: !
      34 0x22: "
      ...
      ...
      47 0x2f: /
      48 0x30: 0
      49 0x31: 1
      ...
      64 0x40: @
      65 0x41: A
      66 0x42: B
      ...
     125 0x7d: }
     126 0x7e: ~

Then the DEL is output with the same format as the other control characters. The loop counter ends at 127 after the visible character range finishes, so we could have used the counter, but we go ahead and pass it the code for DEL as a literal constant.

     127 0x7f: DEL

To demonstrate that we have quite a bit of flexibility in output formats, I've written a bit more involved table generator, and it follows. It gives a few more examples of ways to use the formatted printing. Also it gives us  a look at the use of struct to organize information:


/* A more involved program to print the ASCII chart, 
** as part of a tutorial introduction to C.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


struct ctrl_code_s 
{
  char * mnemonic;
  char * c_esc;
  char * description;
};


/* reference: 
== *nix man command: man ascii 
== Also, wikipedia: https://en.wikipedia.org/wiki/ASCII
*/
struct ctrl_code_s ctrl_code[33] =
{
  {  "NUL",	"'\\0'",	"null character"  },
  {  "SOH",	"----", 	"start of heading"  },
  {  "STX",	"----", 	"start of text"  },
  {  "ETX",	"----", 	"end of text"  },
  {  "EOT",	"----", 	"end of transmission"  },
  {  "ENQ",	"----", 	"enquiry"  },
  {  "ACK",	"----", 	"acknowledge(ment)"  },
  {  "BEL",	"'\\a'",	"bell"  },
  {  "BS",	"'\\b'",	"backspace"  },
  {  "HT",	"'\\t'",	"horizontal tab"  },
  {  "LF",	"'\\n'",	"line feed / new line"  },
  {  "VT",	"'\\v'",	"vertical tab"  },
  {  "FF",	"'\\f'",	"form feed"  },
  {  "CR",	"'\\r'",	"carriage ret"  },
  {  "SO",	"----", 	"shift out"  },
  {  "SI",	"----", 	"shift in"  },
  {  "DLE",	"----", 	"data link escape"  },
  {  "DC1",	"----", 	"device control 1 / XON"  },
  {  "DC2",	"----", 	"device control 2"  },
  {  "DC3",	"----", 	"device control 3 / XOFF"  },
  {  "DC4",	"----", 	"device control 4"  },
  {  "NAK",	"----", 	"negative acknowledgement"  },
  {  "SYN",	"----", 	"synchronous idle"  },
  {  "ETB",	"----", 	"end of transmission block"  },
  {  "CAN",	"----", 	"cancel"  },
  {  "EM",	"----", 	"end of medium"  },
  {  "SUB",	"----", 	"substitute"  },
  {  "ESC",	"----", 	"escape"  },
  {  "FS",	"----", 	"file separator"  },
  {  "GS",	"----", 	"group separator"  },
  {  "RS",	"----", 	"record separator"  },
  {  "US",	"----", 	"unit separator"  }, 
  {  "SPACE",	"----", 	"space"  }
};

struct ctrl_code_s del_code = 
{  "DEL",	"----",	"delete"  };

char ctrl_format[] = "\t%3d 0x%02x: %6s %s %s\n";


int main( int argc, char *argv[] )
{
  int i;

  for ( i = 0; i < 33; ++i )
  {
    printf( ctrl_format, i, i, 
      ctrl_code[ i ].mnemonic, ctrl_code[ i ].c_esc, ctrl_code[ i ].description );
  }
  for ( i = 33; i < 127; ++i )
  {
    printf( "\t%3d 0x%02x: %6c %s %s\n", i, i, i,
      isdigit( i ) ? " DEC" : ( isxdigit( i ) ? " HEX" : "----" ),
      ispunct( i ) ? "punctuation" : ( isalpha( i ) ? "alphabetic" : "numeric" ) ); 
  }
  printf( ctrl_format, 127,127, 
      del_code.mnemonic, del_code.c_esc, del_code.description );
  return EXIT_SUCCESS;
}

(I could have used the first program to output a skeleton source for this one, but I decided to use regular expressions again to extract the various fields, instead.)

The ctrl_code_s struct template has three char * fields -- the mnemonic, the C escape code if there is one, and a more verbal description from the manual page and from Wikipedia. (I extracted the initializations from the source of the first one, again, using the regular expression search-and-replace in gedit.)

The initializations enclose the triplets of anonymous char arrays in curly braces, and we format the source to make it easy to see that the right data goes with the right data. Again, the order of the elements is such that the index of the array of struct ctrl_code_s is the same as the ASCII code.

Some pre-ANSI compilers may not handle this kind of nested initialization. In that case, you may be able to do the initializations without the inner sets of curly braces. It becomes trickier, because such compilers can't help you be sure that the sets are kept together correctly, but it's worth trying. If it doesn't work, you can give up on the array of  struct ctrl_code_s, and use three separate arrays.

(If the compiler doesn't nest initializations and you want to use the array of struct ctrl_code_s anyway, you can set up the three separate initialized arrays and then copy the fields into the uninitialized struct ctrl_code_s. It might be an interesting exercise to do, anyway, to help you get a better handle of what a pointer is and what it points to.) 

Also, some compilers did not support ternary expressions well. If that's the case with your compiler, try the following for the second loop, instead:


  for ( i = 33; i < 127; ++i )
  {
    char * numeric = "----";
    char * char_class = "numeric";

    if ( isdigit( i ) )
    {
      numeric = " DEC";
    }
    else if ( isxdigit( i ) )
    {
      numeric = " HEX";
    }
    if ( ispunct( i ) )
    {
      char_class = "punctuation";
    }
    else if ( isalpha( i ) )
    {
      char_class = "alphabetic";
    }
    printf( "\t%3d 0x%02x: %6c %s %s\n", i, i, i, numeric, char_class );
  }

I've kept the evaluation the same as the ternary expressions, which may help if you're having trouble working those out.

(And if your compiler complains at declaring variables inside nested blocks, you'll need to move the declarations of the variables numeric and char_class up to the main() block where "int i" is declared. But you'll also need to re-initialize them each time through the loop, in the same place they are declared and initialized above.)

One thing you'll notice in the mnemonic field initializations is the use of the backslash escape character to escape itself. For NUL's escape sequence, for example, the source code is written

"'\\0'"

What is actually stored in RAM is the single-quoted escape sequence of backslash followed by the ASCII digit zero:

'\0' 

the single quotes acting in the source as ordinary characters within the double-quoted initialization string, but the first backslash still acting as the escape character so you can store control code literals in your code. If we used only one backslash in the initialization for NUL, it would not get the escape sequence, it would get the literal NUL -- a byte of 0. 

(You might be interested in what happens when you print out a NUL character. If you are, give it a try.)

The syntax for accessing the fields of the array of struct ctrl_code_s is the dot syntax that is common for record fields in other languages, so accessing the mnemonic for NUL is 
ctrl_code[ 0 ].mnemonic

And, if (for some reason) we wanted the third character of the description of the horizontal tab, the syntax would be

ctrl_code[ 9 ].description[ 2 ]

Examing the source code and the output should give you some confidence in what you are seeing. 

Again, DEL gets its own struct ctrl_code_s, not in the main ctrl_code array.

This time, I'm showing that the output format is, in fact, just a NUL-terminated char array, by declaring it before I use it and giving it a name: 

char ctrl_format[] = "\t%3d 0x%02x: %6s %s %s\n";

Other than fixing the format for the hexadecimal field, it's the same as before, but with more %s fields for added information.

I thought about using the same format for the visible character range, but that gets a bit cluttered, so I gave it it's own format.

"\t%3d 0x%02x: %6c %s %s\n

The format for the visible range also adds information fields, just to demonstrate the ctype library and one more conditional construct.

I use the first information field to show whether the character is hexadecimal, decimal, or not numeric, using the functions isdigit() and isxdigit() from the ctype standard library.

The parameter to printf() here is a calculated parameter, using the ternary conditional expression,

condition ? true_case_value : false_case_value

The second informational field is also calculated, using the ctype functions ispuncti() and isalpha() called from within the (nested) ternary conditional expression.

And there are no more surprises in the line for DEL.

Compiling it yields no surprises:

And here's the start of the table, when it's run:

----- Side Note on Memory Usage -----  

ctrl_code_s is a struct containing three pointer fields. The pointers alone consume  6, 12, or 24 bytes per entry. Multiply that by 34 (to include SPACE and DEL), and there are 204, 408, or 816 bytes in use just for the pointers. The arrays for the first field consume, from the calculations above, 121 + 4 bytes. The next field is 4 bytes plus NUL, five for each one, times 34, makes it 170 bytes.

I used the program itself to add the description strings up. I'll show how later, but the total of the descriptions is 491. The total for all three fields is 786.

Together with the pointers:

  • 16-bit: 786 + 204 == 990 bytes
  • 32-bit: 786 + 408 == 1194 bytes
  • 64-bit: 786 + 816 == 1602 bytes 

Each field could be declared as a constant length array of char

Can you work out how that would affect the size of the table by yourself? It gives a savings of about 300 bytes on 16-bit machines and about 100 on 32-bit machines, with an extra usage of about 300 on 64-bit machines.

Oh, why not now? Here's the code to add at the end of main:


 { int sums[ 3 ] = { 0, 0, 0 };
  for ( i = 0; i < 33; ++i )
  { sums[ 0 ] += strlen( ctrl_code[ i ].mnemonic ) + 1;
    sums[ 1 ] += strlen( ctrl_code[ i ].c_esc ) + 1;
    sums[ 2 ] += strlen( ctrl_code[ i ].description ) + 1;
  }
  sums[ 0 ] += strlen( del_code.mnemonic ) + 1;
  sums[ 1 ] += strlen( del_code.c_esc ) + 1;
  sums[ 2 ] += strlen( del_code.description ) + 1;
  printf( "mnemonic: %d, c_esc: %d, description: %d\n",
    sums[ 0 ], sums[ 1 ], sums[ 2 ] );
  printf( "total; %d\n", sums[ 0 ] + sums[ 1 ] + sums[ 2 ] );
  printf( "with pointers on this machine; %ld\n",
    sums[ 0 ] + sums[ 1 ] + sums[ 2 ] + 34 * sizeof (struct ctrl_code_s) );
 }

You'll need to

#include <string.h>

at the top, for strlen().

Be sure to grab the enclosing curly braces. Probably want to change the name of the program, too, while you're at it. 

Also, be aware that sizeof is an operator, not a function. There is a very good reason for why I use sizeof and strlen() where I use them, which I will explain later. (Or you can look it up now if you want.)

One more thing to be aware of, compilers will often pad structures in places that make code faster, so the discussion I give above of memory use is actually more about minimum memory use.

----- End Side Note on Memory Usage -----

So, now that we have these two programs, what next?

What can you think of to do with these tables, or with the pieces of the language and the libraries that these two programs use? 

Try it.

Unicode? 

A complete Unicode table would be huge, and, since it has way more than 256 characters in it, it won't fit in the C char type. (Did I mention that before? This is one reason I insist that char is not actually a character type.) I hope to take up a partial Unicode table later, although it might not work with pre-ANSI C compilers and the OSses they run under.

And I'll be working on the next step.

Woops. I forgot about the HTML table. That could be the next project. Why don't you see if you can finish a short program to produce the HTML table for yourself before I can?

[TOC]

Sunday, January 31, 2021

Personalizing Hello World -- Char Arrays, and Giving the User a Menu

[TOC]

Continuing with the idea of greeting to further extend our beachhead, let's say we want the computer to give the user a list of people to greet, and let the user choose who gets greeted from that.

Hold on to your hat, this is a significantly longer and more involved program.


/* Extending the Hello World! greeting beachhead --
** Let the user choose from a list whom the computer should greet.
** This instance the work of Joel Rees.
** Copyright 2021 Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational use.
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


#define MENU_CT 10
#define MENU_ITEM_LN 12


char menu[ MENU_CT ][ MENU_ITEM_LN + 1 ] =
{
  "Johnny",
  "Ginnie",
  "Marion",
  "Deborah Ann",
  "Howard",
  "Joe",
  "Robin",
  "Dawn",
  "Cornelia Maxi", /* <= Look closely at this! */
  "Nina"
};


void my_puts( char string[] ) /* What's different this time? */
{
  int i;

  for ( i = 0; string[ i ] != '\0'; ++i )
  {
    putchar( string[ i ] );
  }
  putchar( '\n' );
}


/* Convert a number from zero to nine to a digit character. */
int textdigit( int n )  
{
  return n + '0';  /* A trick of ASCII encoding! */
}


/* Convert a number to a text digit and put it on the output device. */
void putdigit( int n )  
{
  putchar( textdigit( n ) );
}


int main( int argc, char * argv[] )
{
  int i;
  int ch;

  my_puts( "From among" );
  for ( i = 0; i < MENU_CT; ++i )
  {
    putchar( '\t' );  putdigit ( i );  putchar( ':' );
    putchar( ' ' );   my_puts( menu[i] );
  }

  my_puts( "Whom should I greet?" );
  ch = getchar();
  while ( !isdigit( ch ) )
  {
    putchar( ch );  putchar( '?' );
    fputs( "Please enter a number from 0 to ", stdout );  putdigit( MENU_CT - 1 ); 
    my_puts( ":" );  /* <= Why do I do it this way? */
    ch = getchar();
  }

  fputs( "Oh-kay, ", stdout );  putchar( ch );  my_puts( "." );
  putchar( '\n' );  putchar( '\n' );  putchar( '\n' );  putchar( '\n' );
  fputs( "Hal-looooooooooooo ", stdout );
  my_puts( menu[ ch - '0' ] );
}

Copy/paste that into your favorite text editor window, or at least one you're comfortable with, and keep it open where you can reference it, and let's work through it.

This program references the ctype library. This library allows you to check characters in the ASCII range, to determine such things as whether they are digits, punctuation, space, etc. It is where get isdigit(), which we use to check the menu selection, so we #include the header.

#define gives you one way to define constants. For many pre-ANSI compilers, #define constants are the only constants. We'll return to #define later to discuss the differences between macros and constants, but, for now, that's what I use it for here, defining the constant count of menu items, MENU_CT, and the constant maximum number of characters in each, MENU_ITEM_LN.

menu[][] is a two-dimensional array of characters, whose size is defined by the above #define constants, MENU_CT, and MENU_ITEM_LN. And it is a true two-dimensional array, allocating MENU_CT times (MENU_ITEM_LN + 1) bytes of memory space.

I just had to bring our my_puts() function in for sentimental reasons. Or, maybe, so I could show a different way to declare its string parameter. It will be useful to stop and compare this definition with the last one, before continuing.

You may by now be asking about the difference between 

char * string;

and 

char string[];

You may, you know. It's a good thing to ask about. 

Well, when declaring string as a parameter to a function, there isn't any effective difference.

If we were declaring string as, say, a global variable, there would be an important difference, but let's not distract ourselves with that just yet. We have too much ground to cover first.

Moving on, for the moment, let's just assume that textdigit() and putdigit() do what their names imply and the comments say, the one converting a number to a digit character, and the other putting a number on the output device. I'll explain pretty soon. I promise.

(I think the ASCII trick will work for the digits in EBCDIC, as well. I'll have to test it sometime.)

Skipping forward to the main() function, the following lines declare two integer variables called i and ch:

int i;
int ch;

Maybe we need to go on a long detour, here.

------ Side Note on Integers ------

These are not the ideal integers of mathematics that extend in range both directions to infinity. Variables in computers have limited range. (You could say integer variables provide the basis for implementing certain types of a mathematical concept called a ring, but let's not go there today. I'll get there, too, eventually.) 

On a sixteen-bit CPU, they will (probably) have a range of 

(-215 .. 215 - 1)

or from -32,768 to 32,767. 

On a modern 32-bit CPU, the range will probably be

(-231 .. 231 - 1)

or from -2,147,483,648 to 2,147,483,647. 

(Yes, I am an American, and I use the comma to group columns in numbers. If you are from a country where they use something different, please make the substitution. You can let me know about it in the comments. And, by the way, there are ways to deal with that in standard C libraries. Sort-of.)

On a modern 64-bit CPU, int variables may be 32-bit integers or they may be 64-bit integers, depending on how the compiler architect interprets the CPU resources and whether the sales managers insist on coddling past programmers who hard-wired 32-bit integers into their programs. Or (more likely) depending on compiler switches.

If int is a 64-bit integer, i and ch will be able to take the range

 (-263 .. 263 - 1)

or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Nice big numbers. Pretty close to minus and plus infinity, from a practical point of view.

Now, we're not using even the full range of 8-bit integers, so we could have declared them as short int, or even as char in this program. But, why?

Oh. Wait. Before that, why, you ask, is ch, which sounds like it's going to be a character, declared as an int?

Excellent question. I'll tell you about EOF later, but, for now, I'll just say it's a convention, and it's a good programming habit to make sure your integer variables will always have enough space to hold their values. Remember, char is an integer type, and a sub-range of int, usually a proper sub-range.

Are you interested in the range that char can take on, since I insist it's an integer type? For the usual size of char,

  • signed char: (-27 .. 27 - 1), or -128 to 127. 
  • unsigned char: (0 .. 28 - 1), or 0 to 255.

------ End of Side Note on Integers ------ 

Back to the program. You've seen the for loop before, in my_puts(), but I haven't explained it. 

Hmm. Before I explain the for loop, I should explain the while loop.

Loops are conditional constructs, much like the if selection construct. But, not only do they branch around code, they repeatedly run through code. Of course they repeat. That's why they're called loops.

It's a common misconception, but, as I said previously, conditionals are not functions. In C, they require parenthesis for the condition expression, but what is inside is a set of conditions, rather than parameters. 

Also, the if, while, for, and do ... while conditionals never have return values in C. 

And, as I have mentioned, they don't have to have curly brace-enclosed blocks if they only apply to a single statement. But it is usually wiser and less confusing to give them explicit blocks anyway. You often find that you actually wanted more than one statement under the conditional part.

I'm dancing around what a loop is because I don't want to show you the accursed goto. And I don't want to do more hand-compiled assembly language. So, let's look at a theoretical example loop, instead:

start( music );
while ( music_is_playing( party ) )
    dance();

This is going to invite more confusion, I just know it. 

The dancing doesn't stop immediately when the music stops. The loop checks that the music is playing, and then the program dances for a bit. Then it checks again, and then it dances some more. That's the way software works. (This is very important to remember. Many expensive commercial projects have met disaster because a programmer forget that conditionals are not constantly monitored.)

Let's look at another example:

fill( plate );
while ( food_remains( plate ) )
{
   eat_a_bite( rice );
   eat_a_bite( sashimi );
   eat_a_bite( pickled_ginger );
   eat_a_leaf( shiso );
   eat_a_bite( nattō );
   eat_a_bite( pickled_radish );
}

And the way this is constructed, once we take that first bite of rice in the loop, we will continue on through the sashimi, all the way through the bite of pickled radish, before checking again whether there is food on the plate. (There is a way to break the loop between bites. And there is concurrent execution, which .... Again, later.)

while loops test their condition before entry, so the condition must be prepared -- primed. That's what fill( plate ) and start( music ) do above.

The for loop primes itself.

There is a do ... while () loop where you jump in before testing, but it turns out to be not very useful. I'll explain why later.

We need something concrete to look at before we fall asleep. Computers are good at counting, we hear. Let's try a counting loop:

count = 0;
while ( count < 100000 )  count = count + 1; 
/* Only one statement, no need for braces. Note that trailing semicolon. */

Note that, since 100,000 won't fit in 16 bits, the count variable must be declared to be one of the 32-bit types for the CPU and compiler.

I'd (cough) like to show you how that would look in 6809 assembly language, but the 6809 needs lots of extra instructions to do 32-bit math, and the extra instructions would cloud the issues. So I'll use 68000 assembly language. It looks different, but my comments should clear things up.


                  ; Uses 32 bit .Long instructions.
* count = 0;
 MOVEQ #0,D7      ; Compiler conveniently put count in D7.
* while ( count < 100000 )
_wb_0000
 CMP.L #100000,D7 ; Compare -- subtract 100000 from D7,
                  ; but don't store result.
 BGE _we_0000     ; Branch over increment and loop end
                  ; if D7 is greater than or equal to 100,000.
*   count = count + 1;
 ADD.L #1,D7      ; Add 1 to D7.
 BRA _wb_0000     ; Branch always back to beginning of loop.  
_we_0000
                  ; Code continues here.

(That's fairly well optimized object code. But there is one further optimization to make which would be confusing, so I won't make it. It's also hand-compiled and untested. But it's fairly understandable this way.)

This is a good way to make the computer waste a little time. On the venerable 6809, it would take about a second or so. On the 68000 at mid-1980s speeds, it would take between a fifth and a tenth of a second. On modern CPUs, it would take something in the range of a millisecond, if that. Just a little time.

And the count stops at 100,000.

------ Side Note on Incrementing ------

Adding one to a count happens so much in programs that C has a nice shorthand for it:

++count;

is the same as 

count = count + 1;

Incrementing by other than 1 has a shorthand, as well, and sometimes you want to increment a value after you use it instead of before. We'll look at that later, too.

------ End of Side Note on Incrementing ------

Let's remake that counting loop as a for loop:

for ( count = 0; count < 100000; ++count ) /* Loop body looks empty. */ ;

Notice that there is nothing between the end of the condition expression and the semicolon except for a comment for humans to read and notice that the space is intentionally left blank.  

That's an example of an empty loop. 

In a sense, it isn't really completely empty, because the loop statement itself contains the counting, in addition to the testing. But, again, the only effect you notice is a small bit of time wasted, and count ends at 100000. (Some compilers will helpfully optimize such loops completely out of the program and just set count to 100,000 -- unless you tell them not to because you know you want to waste the computer's time.)

Empty loops have a significant disadvantage. They are easy to misread. If you have some reason to use an empty loop, use a comment to make it clear, and I recommend giving it a full empty block, just to make it really clear:

for ( count = 0; count < 100000; ++count )  {  /* Empty loop! */  }

This for loop is exactly equivalent to the while loop above, plus the statement priming the count. And the code output will be the same, as well.

One final note of clarification, if we want a counting loop that prints the count out, the for version of the loop might look something like this:

for ( count = 0; count < 100; ++count )
   printf( "%d\n", count );

 And this for loop is exactly identical to the following primed while loop:

count = 0;
while ( count < 100 )
{
   printf( "%d\n", count );
   ++count;
}

We'll be looking more at loops (and printf()) later, but that should be enough to continue with reading my_puts()

It declares a char array, string,  as its only parameter. 

In early C, we definitely did not want to copy whole arrays. It took a lot of precious processor time and memory space. So the authors of C decided that an array parameter would be treated the same as a pointer to its first element.

Arrays still usually aren't something you want to make lots of copies of, so this design optimization might not be a bad thing, even in our current world where RAM and processor cycles are cheap. But it does invite confusion, since both a pointer and an array can be modified by the indexing operator. Specifically, given

char ch_array [ 10 ] = "A string";
char * ch_ptr stringB = "B string";

in 6809 assembler we would see something like this:

ch_array FCC "A string"
  FCB 0
_s00019 FCC "B string"
  FCB 0
ch_ptr FDB _s00019

so you see that "B string" is stored in an array with one of those odd names that won't be visible in the C source -- thus, anonymous, and ch_ptr is a pointer that is initialized to point to the anonymous string. On the other hand, "A string" is stored directly under the name ch_array, which is very much visible in the C source.

However, unless we overwrite ch_ptr with some other pointer,

ch_array[ 0 ] points to 'A',
ch_ptr[ 0 ] points to 'B' and
ch_array[ 3 ] and ch_ptr[ 3 ] both point to (different) 't's.

This leads to headaches if you aren't careful, but it also means that my_puts() is quite readable. Take the char array that gets passed in and count up it, looking at each char and putting it out on the output device as we count -- until we reach a 0. And the way the test is arranged, it will see that the char is 0 and stop before outputting it.

I'm going to present both 6809 compiler output and 68000 compiler output. Both are very much not optimized and not tested, but you can read my comments and see how the thing fits together.

6809 first:


* void my_puts( char * string )
_my_puts_6809
* {
*   int i;
  LEAU -2,U     ; Allocate i.
* 
*   for ( i = 0; string[ i ] != '\0'; ++i )
  LDD #0        ; Initialize i.
  STD ,U
_my_puts_loop_beginning
                ; Split stack, no return PC to avoid.
  LDX 2,U       ; Get string pointer.
  LDD ,U        ; Get i.
  LDB D,X       ; Get string[ i ] (destroying i!)
                ; LDB will see 0 for us, no CMP necessary,
                ; but let's refrain from confusing optimizations.
  CMPB #0       ; Is this char 0?
  BEQ _my_puts_loop_end
*   {
*     putchar( string[ i ] );
                ; Even simple optimization would not repeat this.
  LDX 2,U       ; Get string pointer.
  LDD ,U        ; Get i.
  LDB D,X       ; Get string[ i ] (destroying i!)
  CLRA          ; It was unsigned, extend it to 16 bits. 
  PSHU D        ; Push the parameter for putchar().
  JSR _putchar  ; Call putchar().
*   }
  LDD ,U        ; Increment i.
  ADDD #1
  STD ,U
                ; Go back for more.
  BRA _my_puts_loop_beginning
_my_puts_loop_end
*   putchar( '\n' );
  LDD #_C_NEWLINE
  PSHU D
  JSR _putchar
* }
  RTS

Now 68000:


* void my_puts( char * string )
_my_puts_68000
                    ; The compiler has been told to use 32-bit int .
* {
*   int i;
                    ; The compiler will conveniently put i in D7.
* 
*   for ( i = 0; string[ i ] != '\0'; ++i )
  MOVEQ #0,D7       ; Initialize i.
_my_puts_loop_beginning
                    ; Split stack, no return PC to avoid.
  MOVE.L (A6),A0    ; Get string pointer.
  MOVE.B (D7,A6),D0 ; Get string[ i ] 
                    ; MOVE.B will see 0 for us, no CMP necessary,
                    ; but let's refrain from confusing optimizations.
  CMP.B #0,D0       ; Is this char 0?
  BEQ _my_puts_loop_end
*   {
*     putchar( string[ i ] );
  MOVEQ #0,D0       ; Avoid need to extend char to int .
  MOVE.B (D7,A6),D0 ; Get string[ i ] 
  MOVE.L D0,-(A6)   ; Push the parameter for putchar().
  JSR _putchar      ; Call putchar().
*   }
  ADD.L #1,D7       ; Increment i.
                    ; Go back for more.
  BRA _my_puts_loop_beginning
_my_puts_loop_end
*   putchar( '\n' );
  MOVEQ #_C_NEWLINE,D0
  MOVE.L D0,-(A6)   ; Push the parameter for putchar().
  JSR _putchar
* }
  RTS

The reason I give some (hand-)compiled output is to help motivate the idea that C programs are (effectively) performed step-by-step in the order that the source code dictates. This includes the test conditions in conditional constructs. It's part of the rules of the game for C, even though other languages do something different. 

Those languages have rules, as well. Without the promise of order that the rules give, programs would not function.

(Optimization can break this promise, however. More later.)

To understand textdigit(), we need to look at an ASCII chart, or at least the part where the numbers are:

Code (decimal) Character
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
58 :

Characters are represented by codes inside the computer, and the codes are numbers -- integers, to be specific. You can add numbers to these integers, and the result may be a different character. (Or it may not fall within the table, depending on the number, but we won't worry about that.)

So, if we start with a number from 0 to 9 in the parameter n and we add the code for the character '0' to it, we get a new code for the character version of the number that was in n.

The addition may be more clear if we show the codes in hexadecimal:

Code (decimal)Code (hexadecimal)Character
472F/
48300
49311
50322
51333
52344
53355
54366
55377
56388
57399
583A:

And then we return the resulting character. 

I'd show the assembly language for this, but it's dead simple. On the 6809, convention will have the compiler load the return value in register D before executing the return from subroutine. On the 68000, it will probably be loaded into D0. Other CPUs will have similar conventions for where to put the return value. There may be better ways, but this is the usual way now.

The putdigit() routine is essentially just semantic sugar. I hope it makes the program easier to understand. It just uses textdigit() to convert the number to a character and use putchar() to put it on the output device.

That brings us back to main().

The first loop in main() is a for loop, and it formats and prints out the menu array, along with using putdigit() to put out numbers for selecting a name from the menu array.

By keeping the number of menu items to ten or less, we can use our simplified output routines. We'll show how to deal with more later.

The second loop in main is a while loop, and its purpose is to read characters from the input, and complain and discard them if they are not numbers, until it gets a number.

My odd choice of which routines to use where has something to do with giving you a reason to read through the source in my_puts(), and also something to do with output buffering. (my_puts() forces the output buffer to be flushed with the newline it puts out. Otherwise, we would have no guarantee that the characters we are putting out make it to the screen in time to tell the user what we want to tell him or her. This is something else we will look at later.)

I think the rest of main() is understandable at this point.

Hopefully, you've seen what the bug I planted in the menu does by now. It has to do with allocating enough room for trailing NULs for strings. I'll leave the fix as an exercise, for now.

Here's the screenshot:
 

How long it will take to get the next step up, I don't know. I keep taking on too many projects.

In the meantime, play with what you've learned so far. Fix the bug, or course. Experiment and explore.

The next one is ready sooner than I expected. I decided to show you how to get an overview of the ASCII characters.

[TOC]

Personalizing Hello World -- A Greet Command

[TOC

Continuing with another version of Hello World! to extend our beachhead, let's say we want the Hello! program to be less general. Specifically, instead of having the computer greet the world, let's write a program that allows the user to tell the computer whom to greet: 


/* A greet command beachhead program
** as a light introduction to command-line parameters.
** This instance the work of Joel Rees,
** Whatever is innovative is copyright 2021, Joel Matthew Rees.
** Permission granted to modify, compile, and run
** for personal and educational uses.
*/


#include <stdio.h>
#include <stdlib.h>


int main( int argument_count, char *argument_variables[] )
{
  char * whom = "World";

  if ( argument_count > 1 )
  {
    whom = argument_variables[ 1 ];  /* Where did the initial value go? */
  }
  fputs( "Hello ", stdout ); /* Still avoiding printf(). */
  fputs( whom, stdout );
  putchar( '!' );
  putchar( '\n' );
  return EXIT_SUCCESS;
}

Comparing this to the first exercise, we see that we are actually using those command-line parameters. I'd like to have postponed that a bit further because they are a rather confusing beast. And some people who want to follow along want to do so on platforms that don't have command-line parameters under the usual operating system interface. (Such as the original/Classic Mac without MPW, and ROM-based game machine/PCs like the Tandy/TRS-80 Color Computer without Microware's OS-9, etc.) 

But I have reasons. 

For now, just kind-of assume that there is more to them than meets the eye. 

(If your platform won't allow you to follow along, read the explanation, examine the screenshot carefully, and at least consider downloading Cygwin or installing a libre *nix OS so you can actually try these. For these purposes, an old machine sleeping somewhere might work well with NetBSD or a lightweight Linux OS.)

Again, if you are using a K&R (pre-ANSI) compiler like Microware's compiler for OS-9, move the function parameters for main down below the declaration line. Also, shorten the parameter names, since those compilers typically get confused over long names that start too much the same -- which is the real reason argument_count is usually written argc and argument_variables is usually written argv:

int main( argc, argv )
int argc;
char *argv[];
{
/* etc. */
}

And I'm throwing another fastball at you. There is a conditional in this program. Conditionals are another thing you should assume I'm not telling the whole story about here.

But be aware that, while puts(), fputs(), and putchar() are function calls, 

if ( condition )  {  }

is not. Nor is it a function declaration, such as   

void my_puts( char * string )
{
  ...
}

which you might recall from the first exercise.  

It's a test of a condition. If the condition between the parentheses evaluates to true, the stuff between the braces gets done. If not, the stuff between the braces gets jumped over. (The braces aren't required if there is only one statement to be jumped over, but they are advised for a number of reasons. And there is an optional else clause. And the values of true and false need to be discussed. More detail later.)

Note also that arrays are indexed from 0 up to the array size minus 1. Thus, the first element of an array is element array[ 0 ]. And the last is array[ ARRAY_SIZE - 1 ], for a total of ARRAY_SIZE elements.

If you were compiling to M6809 object code and had the compiler output the assembler source, you would see something like the following -- except that I have added explanation. 

(I'm not asking you to learn 6809 assembly language, just giving it as something to hang my comments on.)

I've mixed in the original C source on the comment lines that start with an asterisk. On code lines, everything following a semicolon is my explanatory comments:


* int main( int argument_count, char *argument_variables[] )
s0000 FCC "World"  ; Allocate the string.
 FCB 0             ; NUL terminate it.
s0001 FCC "Hello " ; See above.
 FCB 0

_C_main
* {
*   char * whom = "World";
 LEAU -2,U   ; Allocate the variable whom.
 LDX #s0000  ; Load a pointer to the World string and
 STX ,U      ; store it in whom.
*
*   if ( argument_count > 1 )
 LDD 2,U     ; Get argument_count.
 CMPD #1     ; Compare it to 1.
 BLE _C_main_001  ; If less than or equal to 1, branch to _C_main_001
*   {
*     whom = argument_variables[ 1 ];  /* Where did the initial value go? */
             ; This code is executed if argument_count is 2 or more.
 LDY 4,U     ; Get the pointer to the argument_variables array.
 LDX 2,Y     ; Get the second pointer in the argument_variables array.
 STX ,U      ; Store it in whom.
*   }
_C_main_001
*   fputs( "Hello ", stdout ); /* Still avoiding printf(). */
 LDX #_f_stdout ; Get the file pointer and
 PSHU X      ; save it as a parameter.
 LDX #s0001  ; Get a pointer to the Hello string and
 PSHU X      ; save it as a parameter.
 JSR _fputs  ; Call (jump to subroutine) fputs() --
             ; fputs() cleans up U before returning.
*   fputs( whom, stdout );
 LDX #_f_stdout ; See above.
 PSHU X
 LEAX ,U     ; Get the address of whom and
 PSHU X      ; Save it as a parameter.
 JSR _fputs
*   putchar( '!' );
 LDD #'!'
 PSHU D
 JSR _fputchar  ; putchar also cleans up the stack after itself.
*   putchar( '\n' );
 LDD #_c_newline
 PSHU D
 JSR _fputchar
*   return EXIT_SUCCESS;
 LDD #_v_EXIT_SUCCESS  ; Leave the return value in D.
 LEAU 2,U  ; Clean up the stack before returning.
 RTS  ; And return.
* }

(Unless I say otherwise, all my assembly language examples are hand-compiled and untested. But I'm fairly confident this one will work, with appropriately defined libraries using a properly split stack.)

(If you understand stack frames, note that this code uses a split stack and does not need explicit stack frames. The return PC is on the S stack, out of the way. Thus the parameters are immediately above the local variables.) 

Again, don't worry how well you understood all of that.

Just note the code produced for the if clause produces code that tests argument_count, and if it is 1 or less skips the following block. If it is 2 (or more) the following block is executed, and the char pointer whom gets overwritten by the second command-line parameter.

Don't assume you know all there is to know about conditionals from that short introduction, any more than you know all about the command-line parameters. Compile it and run it and maybe add some code to get a look at the first entry in argument_variables[] if you're interested and can immediately see how. That's good for now.

I guess we'll get a screen shot of compiling and running this.

Details:

rm greet

deletes a previously compiled version of the program.

ls

as before, lists the files in the current directory. I've saved this version in greet.c, so

cc -Wall -o greet greet.c

will compile the program, with full syntax checking. 

./greet

calls it without parameters. (Except for that first one we haven't looked at yet.)

./greet Harry

calls it with one. (Ergo, two.)

./greet Harry Truman

calls it with two (ergo, three). How would you get a look at the second/third one?

You might be interested to see what is in the first actual command-line parameter. Or you might not be interested. I've mentioned that you could get at it. Can you think of a way to do so? If you do, do you recognize what it contains? 

(The first argument actually varies from platform to platform, but it isn't something the user usually consciously specifies, which is why it isn't usually counted as a parameter. I won't spoil the surprise here, but I will explain later.)

And you might also be interested in looking at the assembly language output for the processor you are using. The command-line option for that on gcc is the -S option, which looks like this: 

cc -S greet.c

You can use the -Wall options as well, like this:

cc -S -Wall greet.c

Either way, that will leave the assembly language source output in a file called greet.s , and you can look at it by bringing it into your favorite text editor, or with the more command, etc.

Where does the string that whom gets initialized with go, by the way? 

Nowhere. But we didn't save the pointer to it anywhere, so it just becomes (more-or-less) inaccessible. It's short enough we don't care too much, especially in this program, but it's just cluttering up memory. 

There's a lot to think about here, so let's keep it short. The next one one is going to be pretty long, when I get it put together and really deep.

Before the next one is up, or before you go look at it, play with this one a bit more. Again, explore. See what happens if (whatever gets your curiosity up), and then see if you can find a reason why.

 And the next one is ready now, here. We'll give the user a menu to choose whom to greet.

[TOC]