What's the Point?
Writing readable, maintainable code should be a priority in scientific computing. If a body of source code is difficult to understand, it is also difficult to:
- Read the code to learn about what it does.
- Read the code to verify that it is written correctly.
- Write tests to verify the code's behavior.
- Debug, since a user may not understand how the code is supposed to work in the first place.
- Tune the code for better performance, because it is harder to tell whether or not the tuned version is still correct.
- Extend the code with new science.
The costs associated with these difficulties can be substantial, and can easily exceed the cost of computing power needed to run a piece of code, even on an HPC project. Therefore, source code should not be treated as just a program to be fed into a machine. Source code is also a document used to communicate with other human beings.
Because CAM is a community model, the code is shared between many developers and many users all over the world. When writing code for inclusion in CAM, keep in mind that your changes can and will be read by many other people, even after you stop working on that piece of code yourself! Therefore it is important to write code that is clear enough for them to follow.
Top Ten List
Here are some of the most important things to consider when writing code.
1. Document your code.
Names
Names are the single most important form of documentation. Try to use each variable or routine for only one purpose, and pick a name that matches that purpose.
Try to use names that are consistent with other CAM code, especially for short names. For instance, "t" usually refers to temperature, not time. In the CAM physics and chemistry, "i" is usually used as a column index, while "k" is usually a level index.
For each variable that has some physical meaning, include a comment specifying what it is and what the units are (if any). If you have multiple, similar variables (e.g. multiple reference pressures), explain the difference between them in the comment, and give the variables names that help you to remember the difference.
If you are using a number that's a constant, give it a name. If it's already present in CAM's physconst, you should use the physconst value instead of specifying your own. Here is an example:
a = 3.14159262 * r**2 rho = p / (287.*t)
use shr_kind_mod, only: r8 => shr_kind_r8 real(r8), parameter :: pi = 3.14159262_r8 ! gas constant for dry air [J/kg/K] real(r8), parameter :: rair = 287._r8 area = pi * radius**2 rho = p / (rair*t)
! physconst parameters use SI units. use physconst, only: pi, rair area = pi * radius**2 rho = p / (rair*t)
Notice how we specified "r8" in the second example above. All real numbers in CAM must have an explicit "kind", which is how precision is specified in Fortran. For double precision values, use shr_kind_r8, and for single precision, use shr_kind_r4. (These are usually abbreviated as just "r8" and "r4".)
Organization
Clear organization can also be a form of documentation. It is much easier to understand code when related actions are close together, and unrelated actions are kept in separate routines.
One method for organizing your code is to write routines in a hierarchical manner. First write an outline that specifies, in broad terms, the tasks a routine should perform. You can do this either in comments or in a separate document. Then fill in the routine, writing the code for each task. If any tasks seem to be particularly complex, try extracting those tasks into new subroutines or functions.
Comments
- Use comments to explain the purpose of the code. What's obvious to you may not be obvious to someone with a different background, so it's better to explain a little too much than too little.
- Use comments to point out assumptions made by the code. For instance, if you're using an approximation that's only valid for temperatures above 150K, be sure to mention this in a comment near the approximation.
- Use comments to document the algorithm used. For instance, if you're using Newton's method to invert a function, mention that at the top of the loop that performs the iteration.
- Use comments to give citations for specific formulas and methods.
- For short comments, place each comment as close as possible to the line(s) of code you are describing. This structure is often better than having one long comment to describe a block of code.
- Larger blocks of comments can be used to explain the purpose or design of a routine in broad terms. However, lengthy scientific explanations should be documented elsewhere (e.g., in a publication or other CAM design document) and cited in the code.
Furthermore, there are some practices to avoid:
- Do not ignore the comments when making changes to code! If there are comments around any code in which you are working, check to make sure the comments are still correct. If the comments and/or code are confusing, add clarifying remarks. Always leave comments which are correct and up to date.
- Don't create a change log in the comments to a routine. It's OK to credit major contributors to the code, but such logs are rarely kept up-to-date, so you don't want to rely on them. Furthermore, writing a log in comments is not necessary if you use a version control system to track and share versions of the code (we currently use Subversion).
- Don't write comments that simply repeat the code. If you have the line "x = x + 1", don't write a comment that just says "Add one to x."
- Don't write detailed comments about other parts of the model that are not relevant to the code you are commenting. If those parts of the system change, your comment will be out-of-date. However, if your code depends on the behavior of another part of the system, you may want to mention this in comments.
- If you copy code from one routine to another, read the comments on that code. As mentioned below, it is not a good idea to copy large blocks of code in the first place. But if you do, make sure that any comments on the code are still valid in the location you have copied them to.
2. Avoid outdated or non-standard Fortran language features.
- All modules must use "implicit none" at the very top (after any "use" statements). If you don't do this, Fortran allows you to use variables without declaring them, which can lead to hard-to-find errors.
- "Fixed form" source was the format for Fortran 77 (and older) source code; it allowed Fortran source to be input using punched cards. Fortran 90 introduced the more modern "free form" source code format, making fixed form source code obsolete. For all new modules, use free form source code.
- "goto" statements should be replaced with other statements, such as "if", "do", "cycle", "exit", "return", or subroutine calls.
- "equivalence" statements give multiple names to the same location in memory. This can very easily cause confusion, so these statements should not be used.
- If you don't really need the equivalence statement, simply remove it and the extra variables it mentions, so that you always refer to a value by just one name.
- If you need to look at the same data two different ways, try separating the different uses of the data into two separate routines, each using a different format.
- If you want to refer to a variable by multiple names, or to give a name to a specific element or section of an array, try using the Fortran 2003 "associate" statement.
- "data" statements can usually be removed; instead of specifying a value using a data statement, simply set the variable when you declare it.
- Common blocks have been replaced with modules in Fortran 90 code. All new routines should be declared in modules; this provides them with an explicit interface that compilers can check. In rare cases where a non-module routine must be added (e.g. due to dependency issues), only use it with an explicit interface block.
- "Statement functions" are functions defined in one line without being explicitly declared as functions. The Fortran standard has declared statement functions obsolescent. If you need to declare a function inside another function or subroutine, you can put a "contains" statement in the outer function, and then define the inner function after the "contains". This is called an internal function. You can also define an internal subroutine this way.
- Most CAM code is processed with the Fortran preprocessor ("fpp"). This is a non-standard, but common, extension to the language, which is similar to the C preprocessor. fpp is used primarily for conditional compilation (using statements like "#ifdef WACCM_CHEM"). Avoid using this feature as much as possible. We recommend this because:
- Different compilers don't always implement fpp in the same way, making it a bit harder to write portable code.
- Too much conditional compilation makes it harder to analyze and debug code, both for humans and for automated tools.
- Setting an option with conditional compilation means that the executable must be re-compiled to change that option. If you can instead implement the option using the namelist, users may not have to rebuild the model as often.
3. Avoid duplicating code; instead, try to reuse code in multiple places.
Reusing code often makes the code more readable, can save time, and reduces the probability of an error.
- Instead of copying/pasting a code block, try moving the code into a new function, and call the function where it's needed.
- Instead of declaring multiple sets of related variables, consider creating a derived type that packages the data together.
4. Avoid certain practices that are error prone.
- Avoid writing to global data (data in a common block or module), except during initialization. Writing to global data can cause confusion, and is usually not thread-safe.
- Avoid pointers when possible. If you need to allocate memory, it is better to use allocatable variables than pointers. If you need to alias a variable with a new name, try making a copy or using an "associate" block.
- Avoid giving different variables similar names, since they may be confused with each other.
5. Try to organize code into many routines, each with a specific purpose and few arguments.
Avoid using one long function to do very many things.
- With more specific routines, it is easier to tell what each routine does just by looking at it.
- With shorter routines, each variable is in scope for a shorter time, so you can more easily see how it is being used.
- Routines that are limited to one purpose are easier to reuse. Reuse makes it faster to write code in the long term, and causes fewer errors than if you copy and paste code, or rewrite it multiple times.
6. Use an editor with features that help you write Fortran source code.
There are a wide range of editors out there, but even very basic editors, such as vim, can be a big help if Fortran features are enabled. The most common choice for Fortran programmers is probably Emacs with f90-mode.
- Syntax highlighting can help to point out when you've made a mistake.
- Automatic indentation can also help to point out mistakes, and also keeps the code clear.
- Many editors can show which column number the cursor is in. This is helpful for fixed-form source code, though free-form is preferred for new modules (see above). Even in free-form source code, lines must be shorter than 132 characters, and we recommend closer to 80-90 characters. Lines longer than this can be continued with "&".
7. When modifying an existing module, follow the conventions it uses.
- Preserve existing naming conventions in a module.
- Follow the existing style with respect to whitespace and capitalization.
- Even if you feel that your own style is better, it can be distracting to have two different styles in the same module.
- If an existing module doesn't follow a consistent pattern, you are encouraged to bring it into line with a pattern with which you are familiar.
8. Be careful with floating point arithmetic.
- Watch out for divide-by-zero and overflow errors.
- It's usually a mistake to test floating-point numbers for equality, so avoid using "!=", "==", ".eq.", and ".ne." to compare reals. Instead, check to see if the two reals are very close (within some tolerance you pick).
- Keep in mind that two expressions can be mathematically equivalent, but not equivalent with limited precision representations of numbers. (x+y)+z is not always the same as x+(y+z).
9. Make use of attributes that limit the ways that data can be used.
- Use "private" at the top of modules to hide most information, and then declare only certain parts of a module "public". This prevents external code from using data outside of the interface you provide.
- Fortran 2003 also provides the "protected" keyword. A public, protected variable is read-only; any module can read its value, but it can only be changed within the module where it is declared.
- Use the "intent" attribute on all arguments to functions and subroutines. This makes it possible for a compiler to check that you are using inputs and outputs in the right way (and it can help with optimization).
- Use "parameter" for constants that should never be changed during a run (e.g. the value of pi).
- Use "pure" for functions or subroutines that have no side effects. Using "pure" can make code harder to debug, so you may want to do this only for small utility functions.
10. Collaborate.
- Try to find other people to review your code and see if it makes sense. The other person may save you time by finding errors or design flaws that you missed.
- Take notice of good or bad practices in other people's code, when you read or modify it yourself.
- Take notice of tools that other programmers and scientists use. It can be hard to get to know all the Fortran intrinsics, utility code in CAM, testing tools, debugging features, and simplified models. Don't be discouraged; these tools can save a lot of effort in the long run.
CLM Coding Conventions
The CLM Coding Conventions contain some generally useful advice, and may be of interest to CAM coders looking for more advice (or examples), or developers working on both models. There are a few differences worth emphasizing.
Arguably, CAM has had more problems with misleading or out-of-date comments, than a lack of comments. Therefore we do not require comments in all of the same situations as CLM does. However, we still encourage users to write code that's as clear as possible in these cases. For instance, when you use several "if" blocks, you don't necessarily need to comment the "end if" statement to specify which "if" it matches. However, it's preferable to avoid very long or deeply nested conditionals, so that it's clear which "end if" matches a given "if" anyway.
Similarly, CLM requires lower bounds to be specified for array arguments, and this is related to the way that threading is handled. In most CAM modules, all arrays have a lower bound of 1, and threading is handled at higher levels of the physics. Since in most cases the lower bounds of arrays are all 1, we do not require "1" to be specified.
CAM does not require specific numbers of spaces for indentation; if editing an existing module, simply follow the same style as used in that module. However, tabs should be avoided; CAM developers use many different editors, which treat tabs in very different ways. Furthermore, tabs make it difficult to determine column number, in those rare cases where it matters in Fortran.