Files as Persistent Storage and Data Exchange A file is a place on a physical disk where information is stored. Programs need files to transfer data between computers and to keep data after the program ends, because memory is freed when a process terminates. Files can also feed data into a program by being read as input.
Standardized Types: MIME’s Type/Subtype Model Files can be categorized by standards, not only by informal labels like images or audio. MIME describes a file by a type and a subtype, such as text/plain. Tools can recognize that C source code is a textual file, while plain text is labeled accordingly.
Readable Text vs Program‑Specific Binary Text files are readable in any text editor. Binary files store data in program‑specific formats and are mostly unreadable in a text editor. Images and audio files are typical binaries that require the right programs to interpret their contents.
From Idea to cat: Parsing Arguments and Validating Input A cat‑like program in C includes standard headers and reads command‑line arguments via argc and argv. The executable name is argv[0], so a filename at argv[1] is required to know which file to print. On a wrong argument count, print an error with fprintf and terminate using exit, which ends the program with a status and can be called from any function, unlike return.
stdout vs stderr and Shell Redirection Standard input provides data to programs, while standard output and standard error are distinct channels for normal results and error messages. Linux redirection lets you separate them using operators like > for output and 2> for errors. File‑oriented C functions can target these streams, so one fprintf can write either to a file or to stderr.
Opening a File with fopen and Understanding Modes fopen returns a FILE pointer representing a stream to the file’s data. Pass the path from argv[1] and a mode string: "r" reads from the beginning, "w" truncates or creates for writing, and "a" appends at the end. Other variants allow combined reading and writing; on failure fopen returns NULL.
Detecting Open Errors and Exiting Cleanly Always check whether the FILE pointer is NULL after fopen to catch issues like a missing file. If opening fails, print a clear message to stderr and exit with a failure code. If it succeeds, the cursor starts at the beginning and the stream is ready for processing.
Character‑by‑Character Reading with fgetc and putchar fgetc reads one character from the stream and advances the cursor position. It returns an integer code for the character, which can be sent to the screen with putchar. A single call prints one character; printing the whole file requires iteration.
EOF Is a Sentinel Value, Not a Printable Character When no more data remains, fgetc returns EOF (typically −1). EOF is not a real file byte, so trying to print it produces a placeholder like a question mark. An endless loop that keeps printing after the end will repeat that placeholder indefinitely.
A Safe Loop to Read Until the End Read the first character before entering the loop and continue while it differs from EOF. Inside, print the character and fetch the next one to let EOF stop the iteration. This pattern prints the entire file correctly and never attempts to render EOF.
Remember to fclose: Releasing the Stream Closing with fclose frees resources and finalizes buffered operations. Relying on implicit closure at program termination is unsafe in larger scenarios. Pair every fopen with a matching fclose to avoid resource leaks.
A Stress Test: Creating 10,000 Files Generating names and opening thousands of files for writing exposes the cost of forgetting fclose. The run ends with a segmentation fault, many files exist only up to a low index, and their contents are empty. The program crashes before buffered data is flushed to disk.
System Limits: ulimit and Open File Descriptors Operating systems cap the number of files a process can have open at once; ulimit -n commonly shows 1024. After hitting that ceiling, further opens fail and the test program collapses. Without closing, the descriptor pool is exhausted long before reaching 10,000 files.
Fixing the Test by Closing Inside the Loop Write the message, then call fclose in every iteration to keep the number of open files low. The pointer is reused each time, so releasing it prevents descriptor exhaustion. With this fix, all 10,000 files are created and contain the intended text.
First N Characters: Implementing head To mimic head, read at most ten characters with a for loop. Break early if fgetc returns EOF to avoid printing anything beyond the available content. The output matches the system head for the first ten characters.
The Hidden Newline at File End A Linux text file typically ends with a newline whose ASCII code is 10. This line‑feed is invisible in plain viewing but counts as a character when measuring by bytes. Knowing this explains seemingly missing or extra characters in simple tests.
Different Line Ending Conventions Across Systems Linux uses LF (code 10), while Windows uses CR followed by LF (codes 13 and 10). Windows files on Linux can look double‑spaced, and Linux files on Windows can appear as a single long line. Some editors allow disabling the automatic end‑of‑line at file end if needed.
Last N Characters: Implementing tail with fseek To print the final ten characters, position the cursor near the end using fseek. Seek to SEEK_END with a negative offset, stepping back by eleven to include the trailing newline before reading forward. If the file is shorter, fseek positions safely without going before the beginning.
Offsets and Anchors: Mastering fseek Parameters fseek takes a long offset and a whence anchor: SEEK_SET for the start, SEEK_CUR for the current position, and SEEK_END for the end. Positive offsets move forward from the anchor and negative offsets move backward. With this control and the earlier reading loop, a cat‑like pass prints precisely the desired tail.
Precise Seeking with fseek With fseek, the cursor moves 11 characters left relative to the file’s end to mimic tail. The call comes from the standard library, compiles cleanly, and runs on the test file. The output matches the tail command, confirming correct cursor control. Head and tail behavior are achieved by seeking to precise positions.
File Size via fseek and ftell Seek to the end of the file and call ftell to get the byte offset from the start. Store the result in a variable and print it with printf. The reported sizes match Linux tools: cat.c shows 465 bytes, hello.txt shows 12. The values include the final newline character present in these text files.
Newline and EOF Explained The newline at the end of a text file is a normal byte (ASCII 10) and counts toward size. EOF is not part of the file; it is just a sentinel value signaling no more data. Editors often append a newline; in Vim this can be disabled with the noeol setting. Understanding this explains the off-by-one perceptions when counting characters.
Safe Two-File Setup for Copying Open the source path for reading and the destination path for writing using fopen. Verify argc equals three and check both FILE pointers for NULL to handle errors. Print which file failed to open, then terminate with an error status if needed. With streams ready, the program can proceed to copy content.
Char-by-Char Copy with fgetc/fputc Use fgetc to read one character at a time until EOF. For each character, call fputc to write it to the destination stream. Tests copying hello.txt to hello2.txt and cat.c to cat2.c produce identical content. diff shows no differences and file sizes remain the same.
I/O Counters Expose Inefficiency Introduce input and output counters around each read and write call. Outputs equal the number of characters copied, while inputs are one higher due to the final EOF read. The totals reveal hundreds of I/O operations for larger files. This motivates a more efficient, batched approach.
fscanf/fprintf Token Copy Breaks Formatting Replace character I/O with fscanf("%s") into a fixed buffer and fprintf to write tokens. A too-small buffer caused a malfunction, so its size was increased to 20. Although the operation count dropped, spaces vanished and lines merged because "%s" discards whitespace. Even inserting spaces on output didn’t restore structure, so this method fails for exact copying.
Line-Preserving Copy with fgets/fputs Switch to fgets to read entire lines, which includes the newline and preserves formatting. Write each line with fputs, adjusting the end-condition because fgets returns NULL at EOF. The copy now matches the source exactly under diff. The number of I/O operations is significantly lower than the per-character version.
Buffer Size Trade-offs in Line I/O If the line buffer is too small, fgets splits long lines across multiple reads without overflowing. Correctness remains, but the program performs more input and output operations. Oversized buffers avoid extra calls but may leave unused capacity. Efficiency improves when the buffer is filled on most iterations.
High-Throughput Block I/O with fread/fwrite Adopt fread and fwrite to move blocks defined by element size and count. Provide the buffer pointer, size of each element (a byte for text), number of elements, and the stream. Think of a crate holding many bottles: count is bottles, size is each bottle’s volume. This design maximizes useful work per call and minimizes overhead.
Implementing the Blocked Copier Implement a loop that reads into a 20-byte buffer with fread and writes the returned count with fwrite. Update the input and output counters around each call to measure efficiency gains. The program compiles and runs with only a few dozen I/O operations. diff confirms the destination is identical to the source.
Binary I/O Uses the Same Primitives These same block functions are used for binary files, not only for text copying. Binary files follow the layout the program defines rather than human-readable encoding. Applying fread and fwrite enables precise storage and retrieval of raw values. The next steps illustrate this with a single integer.
Writing an int to data.bin Open a file named data.bin in "wb" mode to write binary data. Assign 12345 to an int and call fwrite(&value, sizeof(int), 1, fp) to store it. This writes four bytes, whereas text would require five characters and often a newline. Close the stream to finish the write.
Why Editors Misread Ad-Hoc Binary Files Tools like file or MIME detectors may call the result text/plain, and an editor may show "90" or other gibberish. Ad-hoc binary files lack signatures and metadata that standard formats use for identification. The operating system cannot know the intended interpretation from the bytes alone. Only a reader that mirrors the writer’s logic can decode the content properly.
Reading the int Back Reliably Open data.bin in "rb" mode and declare an int to receive the value. If fread returns 1 item, print the number; otherwise report an error. Running the reader prints 12345, validating the write/read pair. Editors couldn’t produce this result because they don’t apply the same structure.
Binary Size vs Textual Digits Checking the file shows a size of four bytes, versus about six for textual storage of 12345 with newline. Choosing long would take eight bytes; short would take two if the value fits its range. Binary footprint depends on data types rather than digit count. Type selection affects both space usage and compatibility.
Robust Argument and Stream Error Handling The utility validates argument count and names the path when fopen fails. Clear error messages and early exits prevent cascading failures. Both streams are closed with fclose to finish cleanly. These patterns help diagnose issues when automating copies.
Interpreting EOF and Return Values Per-character loops incur one extra input call that yields EOF, explaining the higher input count. fgets signals end via NULL, fscanf via EOF, and fread by returning zero items. A zero item count indicates no data was read, typically at end-of-file. Reading logic should rely on these returns rather than assumptions.
Choosing the Right Tool and Final Cautions For exact copies, avoid scanf/printf tokenization; prefer line-based or block-based I/O. Small buffers remain correct yet increase operations, while block I/O reduces calls sharply. When reading binary data, never infer success from the variable’s value—zero may be legitimate—check the function’s return. With careful seeking, sizing, copying, and binary reads/writes, file tasks become both accurate and efficient.