Comparing Golang and understanding Java Value Types

I start to talk about

Comparing Golang and understanding Java Value Types

The talk compares the memory model of the Go programming language to the memory model of Java. This comparison will help Java developers understand the planned Java 10 feature: Value Types. The talk will describe how these are implemented in Go, and why they so much needed for the Java language. At the end of the presentation, the audience will also understand why Value Types cannot be extended, immutable and passed always by value.
This talk is very lucrative for the audience because it is about Go language as well as a future of the upcoming Java version that is not available yet. Both are a gem for the audience.

right now.

You can look at the slides at https://github.com/verhas/compare-go-java-devdays2018-peter-verhas

Advertisements

Prevent Hacking with Modules in Java 9

I start to talk about

Prevent Hacking with Modules in Java 9

Before Java 9 there was a lot of room to do tricky things mainly using reflection. Some of these possibilities were even considered as security holes. With the advent of Java 9, the module systems close these secret doors in Java runtime library and also allows library developers to do the same for their libraries.
The presentation will demonstrate some shocking and funny examples what you could do using Java 8 and then tries to do the same obviously failing using Java 9.

right now.

You can look at the slides at https://verhas.github.io/preventHack-J9-devdays2018-peter-verhas/#/

Generating Source Code, a Compromise

Source Code Generation is not Good

The most important statement in this topic before we would even start to discuss anything else is that source code generation is a suboptimal solution. It may be needed and it may be a viable solution, but whenever source code is generated it could have been done some way better. It is just that the environment, the available tools, developers are not fit for the purpose. Let me give some examples.

When you program Java you use Eclipse, IntelliJ or NetBeans. Each of these IDEs is capable of generating hashCode(). What is wrong with it? The language could provide a declarative description of how to compute the function. The hash code depends on the hash code of the fields and the calculation is fairly standard. Why can’t we just define which fields should be taken into account and the language would implicitly provide us with the method? In this case, the language is insufficient for the purpose. I do not say that Java should provide such a feature. Maybe it should, maybe it should not.

In case of setters and getters the case is more prominent. Java needs them and we have to generate them whenever there is a need. Other languages, like C#, Swift or even Groovy support the feature on the language level.

Another example from my practice when I needed several business object classes converted to Map<String,String> with a special format. I created some utility classes that listed the fields using reflection and performed the conversion. This solution, however, was rejected during code review. The code was too complex and later teams who will be responsible for the maintenance may not be able to cope with the code. I could have said that they should hire cleverer people, but that costs more money and they wanted code that is cheap to maintain. The solution was to write extremely similar code for each and every business objects class. It could have been generated if there was any tool that could do that and, which could have been part of the build process, which again increases maintenance cost. In this case, the human environment was insufficient.

Please do not start flame war on this part of the article. This example is partially made up for NDA reasons, and after all it is not the major topic of the article.

Navigare Necesse Est

The above examples clearly depict that source code generation is a must. We may not like it though, but it is a must. The next question is when to generate code, which phase of the development process?

It is fairly obvious that source code can only be generated before the compilation phase. You can generate source code after the compilation phase, but that is like calling a doctor after the patient is dead: no use. We can generate code during the build process, just before the compilation phase or as part of the editing process. Both have advantages and disadvantages.

Editing Phase Source Code Generation

When you generate code while you edit the code the code generation does not need to be part of the build process. This means that the rebuild of the code is simpler, there are fewer
potential deviations from the standard build process and thus you are more likely to be able to do it when you work in a restricted enterprise environment. An example is when you use your IntelliJ to generate hashCode(). The generated method is available immediately in the editing environment, and functions like auto-complete will take the generated code into account.

The disadvantage is that the process is triggered manually. The more manual the process is the more room there is for human errors. You create a new field and you forget to update the hashCode() in the class. The generated code also gets into the source code repository that may not be optimal. Source code repository is for the source code and generated source code is not really source-code, is it?

Build Process Source Code Generation

When you generate the source code during the build process the code generation tool will certainly rely on the last version of the source code. In our example there will not be any field left out from the hashCode() method.

The disadvantage is that the build process is more complex. Your favorite code generation tool may not be available or allowed in the environment you work in. The tools that can be hooked into the build process usually generate whole files. It is not likely that you will generate a hashCode() method into the middle of a class using a tool that runs on the build server in batch mode. Also, you will not have the generated code in your IDE and you may lose some of the code editing support.

Build time source code generation tools are usually also environment specific. You may have a tool that works for Java but does not work for Rust or Python projects.

There is no clear “one is better than the other” decision. Sometimes build time source code generation is better, other purposes are fit better with edit time source code generation. I created tools like Fluflu mentioned in my article “Named parameters in Java“, or Scriapt Java annotation processing tool described in the article “Don’t write boilerplate, use scriapt“. These tools are Java specific and build time executable. They are annotation processors, that hook into the Java compilation process and thus interestingly the IDEs continuous builds also handle them.

Source Code Generation In-line

This time I want to write about a Python written tool Pyama that can be used to generate code not only for Java but also for Go, Rust, Markdown or just anything else. It is an editing phase tool and it was designed with editing in mind. The major idea was to automate the part of the editing process that can be automated.

My Demanding Need

The demanding need was my editing the new edition of my book Java 9 Programming by Example published by Packt. The first edition of the book was edited in MS Word and I had to copy paste the source code samples from the IDE. However, book and code development is not a linear work. Sometimes the code was edited and modified after it was copied. It was a huge work to revisit each code sample in the book to see if the latest version is included in the document. I wanted something else, something more automatic. Luckily the second edition that will address Java 11 is edited with a different format that I can convert from Markdown. I edit the text in Markdown and I needed a tool that copies the code samples into the text.

The first idea was to create a tool that converts a .md.pre file that contains markdown and special directives controlling the source code inclusion into .md containing the code snippets. Such a solution, however, would not allow me to see the full rendered document in a Markdown WYIWYG editor. IntelliJ lets me render the markdown document text on the left side of the screen and see the result on the right side, which is a great help when I forget closing a backtick. Thus I decided to create a tool that can copy the snippets into my edited text file. It is also very handy that IntelliJ keeps the file almost all the time saved and reloads it when it is modified on the disk. Therefore I can edit the file in the editor and I can safely edit the file with any external tool. To develop this tool was also a nice Python learning project.

I also wanted to create something that was more general than just fetching snippets from code files and insert them into markdown documents. The outcome was a framework that, by now, has several extensions. One is handling snippets and markdown, others generate Java code (setters, getters, equals, hashCode, constructors, builder methods), handle text macros, execute Python scripts in any code files and so on. These extensions are samples and you can create other extensions with a few lines of Python code. As far as the book writing and Markdown Pyama proved to be an extremely valuable tool.

Pyama Architecture

When generating code into already existing source files, it is evident that the unit of editing should be something more granular than a file. We should not overwrite a whole file with something new. The tool has to distinguish between the lines that need to be altered, or rather that are allowed to be altered and those that must not be touched. Pyama introduces the notion of a segment when processing files. The tool splits up the source files it works with into segments. Segments contain lines of the text files. Thus a pyama project works with files, each file contains segments and each segment contains lines. The segments of a file make up the whole file. In other words, there are no lines outside of segments. Pyama reads the contents of the files into the memory and then it invokes configured handlers (Python objects) to do whatever they should with the individual segments. When invoked, a handler works with a single segment. It can collect information from it, it can build up data structures to use later and it can read and modify the lines that are in the segment. This way the code of a handler is extremely simple, because it does nothing else but processes a list of strings and it does not need to care for anything else.

To decide where a segment starts an ends pyama asks the handler objects for regular expressions to identify lines that start and end segments. Different handlers may work with different segments and they may have different start and end patterns.

The segments in all files are processed a few times invoking the handlers in several passes. For example, the snippet reader may collect the code snippets from the configured source files into a snippet store where each snippet is identified with a name. In the next pass, the snippet writer handler looks at segments that start with a line referencing a named snippet and it replaces the lines of the segment with the current version of the collected snippet.

The snippet reader says that each line that contains START SNIPPET starts a new segment and such a segment lasts till a line containing END SNIPPET or till the end of the file. Then the code

// START SNIPPET main_java
     System.out.println("Hello, world!");
// END SNIPPET

will collect a snippet that contains the code sample. The snippet writer manages segments that start with a line that contains USE SNIPPET and the name of the snippet and end with a line containing END SNIPPET. If there is a line in a file that the snippet writer processes that reads

USE SNIPPET main_java
     System.out.println("Hello, outdated string world!");
END SNIPPET

it will replace it with

USE SNIPPET main_java
     System.out.println("Hello, world!");
END SNIPPET

The lines with the USE SNIPPET and END SNIPPET remain in the code, but in most formats, it is possible to hide them into some comment field that the output (HTML renderer, or Java compiler) will ignore.

This is only the tip of the iceberg of this code generation, text processing tool. There are handlers that can number the snippet lines, trim the code, skip certain lines that may not be interesting for the printout, apply regular expression search and replace, or even execute small Python scripts that can create the segment text.

For example the following code

/* PYTHON SNIPPET xxx
fields = ["String name", "String office", "BigDecimal salary"]
print("    public void setParameters(",end="")
print(", ".join(fields), end="")
print("){")
for field in fields:
    field_name = field.split(" ")[1]
    print("        this." + field_name + " = " + field_name + ";")
print("        }")

print("""
    public Map getMap(){
        Map retval = new HashMap();\
""")
for field in fields:
    field_name = field.split(" ")[1]
    print("        retval.put(\""+field_name+"\", this."+field_name+");")
print("        return retval;\n        }")

END SNIPPET*/

public class SimpleBusinessObject {
    //USE SNIPPET ./xxx
    public void setParameters(String name, String office, BigDecimal salary){
        this.name = name;
        this.office = office;
        this.salary = salary;
        }

    public Map getMap(){
        Map retval = new HashMap();
        retval.put("name", this.name);
        retval.put("office", this.office);
        retval.put("salary", this.salary);
        return retval;
        }
    //END SNIPPET
}

can easily be changed to contain another field, just adding to the type and the name of the field to the array named fields. In real life examples the source printing code would be in some external file and imported, and probably the generated code would also be more complex than this sample. This code, however, enlightens that with minimal Python knowledge such manual tasks can be automated.

Please feel free to try and use pyama available from GitHub.

Java getting back to the browser?

Betteridge’s law of headlines apply.

Lead-in

This article talks about WebAssembly and can be read to get the first glimpse of it. At the same time, I articulate my opinion and doubts. The summary is that WebAssembly is an interesting approach and we will see what it will become.

Java in the Browser, the past

There was a time when we could run Java applets in the browsers. There were a lot of problems with it, although the idea was not total nonsense. Nobody could tell that the future of browser programmability is not Java. Today we know that JavaScript was the winner and the applet as it is deprecated in Java 9 and is going to be removed from later Java versions. This, however, does not mean that JavaScript is without issues and it is the only and best possible solution for the purpose that a person can imagine.

JavaScript has language problems, there are a lot of WTF included in the language. The largest shortage, in my opinion, is that it is a single language. Developers are different and like different languages. Projects are different best solved by different programming languages. Even Java would not so immensely successful without the JVM infrastructure supported by so many different languages. There are a lot of languages that run on the JVM, even such a crap as ScriptBasic.

Now you can say that the same is true for the JavaScript infrastructure. There are other languages that are compiled to JavaScript. For example, there is TypeScript or there is even Java with the GWT toolkit. JavaScript is a target language, especially with asm.js. But still, it is a high level, object-oriented, memory-managed language. It is nothing like a machine code.

Compiling to JavaScript invokes the compiler once, then the JavaScript syntax analyzer, internal bytecode and then the JIT compiler. Isn’t it a bit too many compilers till we get to the bits that are fed into the CPU? Why should we download the textual format JavaScript to the browser and compile it into bytecode each time a page is opened? The textual format may be larger, though compression technologies are fairly advanced, and the compilation runs millions of times on the client computer emitting a lot of carbon into the air, where we already have enough, no need for more.

(Derail: Somebody told me that he has an advanced compression algorithm that can compress any file into one bit. There is no issue with the compression. Decompression is problematic though.)

WebAssembly

Why can’t we have some bytecode based virtual machine in the browser? Something that once the JVM was for the applets. This is something that the WebAssembly guys were thinking in 2015. They created WebAssembly.

WebAssembly is a standard program format to be executed in the browser nearly as fast as native code. The original idea was to “complement JavaScript to speed up performance-critical parts of web applications and later on to enable web development in other languages than JavaScript.” (WikiPedia)

Today the interpreter runs in Firefox, Chromium, Google Chrome, Microsoft Edge and in Safari. You can download a binary program to the browser and you can invoke it from JavaScript. There is also some tooling supporting developing programs in “assembly” and also on higher level languages.

Structure

The binary web assembly contains blocks. Each block describes some characteristics of the code. I would say that most of the blocks are definition and structure tables and there is one, which is the code itself. There is a block that lists the functions that the code exports, and which can be invoked from JavaScript. Also, there is a block that lists the methods that the code wants to invoke from the JavaScript code.

The assembly code is really assembly. When I started to play with it I had some nostalgic feeling. Working with these hex codes is similar to programming the Sinclair ZX80 in Z80 assembly when we had to convert the code manually to hex on paper and then we had to “POKE” the codes from BASIC to the memory. (If you understand what I am talking about you are seasoned. I wanted to write ‘old’ but my editor told me that is rude. I am just kidding. I have no editor.)

I will not list all the features of the language. If you are interested, visit the WebAssembly page. There is a consumable documentation about the binary format.

There are, however, some interesting features that I want to talk about to later express my opinions.

No Objects

The WebAssembly VM is not an object-oriented VM. It does not know objects, classes or any similar high-level structures. It really looks like some machine language. It has some primitive types, like i32, i64, f32, f64 and that it is. The compiler that compiles high-level language has to use these.

No GC

The memory management is also up to the application. It is assembly. There is no garbage collector. The code works on a (virtually) continuous memory segment that can grow or shrink via system call and it is totally up to the application to decide which code fragment uses which memory address.

Two Stacks

There are two stacks the VM works with. One is the operation stack for arithmetic operations. The other one is the call stack. There are functions that can call each other and return to the caller. The call sequence is stored in a stack. This is a very usual approach. The only shortage is that there is no possibility to mark the call stack and purge it when an exception happens. The only possibility to handle try/catch programming structure is to generate code before and after function calls that check for exception conditions and if the exception is not caught on the caller function level then the code has to return to the higher level caller. This way the exception handling walks through the call stack with the extra generated code around each function call. This slows down not only the exception handling but also the function calls.

Single Thread

There is no threading in WebAssembly.

Support, Tooling

The fact that most of the browsers support WebAssembly is one half of the bread. There have to be developer tools supporting the concept to have code that can be executed.

There is an LLVM backed compiler solution so technically any language that is compiled to LLVM should be compilable to WebAssembly and run in the browser. There is a C compiler in the tooling and you can also compile RUST to WebAssembly. There is also a textual format in case you want to program directly in assembly level.

Security

Security is at least questionable. First of all, WebAssembly is binary, therefore it is not possible, or at least complex to look at the code and analyze the code. The download of the code does not require channel encryption (TLS) therefore it is vulnerable to MITM attack. Similarly, WebAssembly does not support code signature that would assert that the code was not tampered with since being generated in the (hopefully protected) development environment.

WebAssembly runs in a sandbox, just like JavaScript or like Flash was running. Fairly questionable architecture from the security point of view.

You can read more on the security questions in this article.

Roadmap

WebAssembly was developed for to years to reach a Minimal Viable Product (MVP) that can be used as a PoC. There are features, like garbage collection, multi-thread support, exception handling support, SIMD type instructions, DOM access support directly from WebAssembly, which are developed after MVP.

Present and Future

I can say after playing like a weekend with WebAssembly that it is an interesting and nice toy. In its current state, it is a toy, nothing more. Without the features planned after MVP, I see only one viable use case: WebAssembly is the perfect tool to deploy malicious mining code on the client machines. In addition to that, any implementation flaw in the engine is a security risk. Note that these security risks come from a browser functionality that gives no value to the average user. You can disable WebAssembly in some of the browsers. It is a little worrisome that it is enabled by default, although it is needed only for early adopters for PoC and not commercial projects. If I were paranoid I would say that the browser vendors, like Google, have hidden agenda with the WebAssembly engine in the browser.

I am afraid that we see no security issues currently with WebAssembly only because technology is new and IT felons have not learned yet the tools. I am almost certain that the security holes are currently lurking in the current code waiting to be exploited. Disable WebAssembly in your browser till you want to use it. Perhaps in a few years (or decades).

The original aim was to amend JavaScript. With the features after MVP, I strongly believe that WebAssembly will rather aim to replace JavaScript than amend it. There will a time when we will be able to write applications to run in the browser in Golang, Swift, Java, C, Rust or whatever language we want to. So looking at the question in the title “will Java get back to the browser?” the answer is definitely NO. But some kind of VM technology, JIT, bytecode definitely will sometime in the future.

But not yet.

Comparing files in Java

I am creating a series of video tutorials for PACKT about network programming in Java. There is a whole section about Java NIO. One sample program is to copy a file via raw socket connection from a client to a server. The client reads the file from the disk, and the server saves the bytes as they arrive, to disk. Because this is a demo, the server and the client are running on the same machine and the file is copied from one directory to the exact same directory but a different name. The proof of the pudding is eating it: the files have to be compared.

The file I wanted to copy was created to contain random bytes. Transferring only text information can leave sometimes some tricky bug lurking in the code. The random file was created using the simple Java class:

package packt.java9.network.niodemo;

import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Random;

public class SampleMaker {
    public static void main(String[] args) throws IOException {
        byte[] buffer = new byte[1024 * 1024 * 10];
        try (FileOutputStream fos = new FileOutputStream("sample.txt")) {
            Random random = new Random();
            for (int i = 0; i < 16; i++) {
                random.nextBytes(buffer);
                fos.write(buffer);
            }
        }
    }
}

Using IntelliJ comparing files is fairly easy, but since the files are binary and large this approach is not really optimal. I decided to write a short program that will not only signal that the files are different but also where the difference is. The code is extremely simple:

package packt.java9.network.niodemo;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;

public class SampleCompare {
    public static void main(String[] args) throws IOException {
        long start = System.nanoTime();
        BufferedInputStream fis1 = new BufferedInputStream(new FileInputStream("sample.txt"));
        BufferedInputStream fis2 = new BufferedInputStream(new FileInputStream("sample-copy.txt"));
        int b1 = 0, b2 = 0, pos = 1;
        while (b1 != -1 && b2 != -1) {
            if (b1 != b2) {
                System.out.println("Files differ at position " + pos);
            }
            pos++;
            b1 = fis1.read();
            b2 = fis2.read();
        }
        if (b1 != b2) {
            System.out.println("Files have different length");
        } else {
            System.out.println("Files are identical, you can delete one of them.");
        }
        fis1.close();
        fis2.close();
        long end = System.nanoTime();
        System.out.print("Execution time: " + (end - start)/1000000 + "ms");
    }
}

The running time comparing the two 160MB files is around 6 seconds on my SSD equipped Mac Book and it does not improve significantly if I specify a large, say 10MB buffer as the second argument to the constructor of BufferedInputStream. (On the other hand, if we do not use the BufferedInputStream then the time is approximately ten times more.) This is acceptable, but if I simply issue a diff sample.txt sample-copy.txt from the command line, then the response is significantly faster, and not 6 seconds. It can be many things, like Java startup time, code interpretation at the start of the while loop, till the JIT compiler thinks it is time to start to work. My hunch is, however, that the code spends most of the time reading the file into the memory. Reading the bytes to the buffer is a complex process. It involves the operating system, the device drivers, the JVM implementation and they move bytes from one place to the other and finally we only compare the bytes, nothing else. It can be done in a simpler way. We can ask the operating system to do it for us and skip most of the Java runtime activities, file buffers, and other glitters.

We can ask the operating system to read the file to memory and then just fetch the bytes one by one from where they are. We do not need a buffer, which belongs to a Java object and consumes heap space. We can use memory mapped files. After all, memory mapped files use Java NIO and that is exactly the topic of the part of the tutorial videos that are currently in the making.

Memory mapped files are read into the memory by the operating system and the bytes are available to the Java program. The memory is allocated by the operating system and it does not consume the heap memory. If the Java code modifies the content of the mapped memory then the operating system writes the change to the disk in an optimized way, when it thinks it is due. This, however, does not mean that the data is lost if the JVM crashes. When the Java code modifies the memory mapped file memory then it modifies a memory that belongs to the operating system and is available and is valid after the JVM stopped. There is no guarantee and 100% protection against power outage and hardware crash, but that is very low level. If anyone is afraid of those then the protection should be on the hardware level that Java has nothing to do anyway. With memory mapped files we can be sure that the data is saved into the disk with certain, very high probability that can only be increased by failure tolerant hardware, clusters, uninterruptible power supplies and so on. These are not Java. If you really have to do something from Java to have the data written to disk then you can call the MappedByteBuffer.force() method that asks the operating system to write the changes to disk. Calling this too often and unnecessarily may hinder the performance though. (Simple because it writes the data to disk and returns only when the operating system says that the data was written.)

Reading and writing data using memory mapped files is usually much faster in case of large files. To have the appropriate performance the machine should have significant memory, otherwise, only part of the file is kept in memory and then the page faults increase. One of the good things is that if the same file is mapped into the memory by two or more different processes then the same memory area is used. That way processes can even communicate with each other.

The comparing application using memory mapped files is the following:

package packt.java9.network.niodemo;

import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

public class MapCompare {
    public static void main(String[] args) throws IOException {
        long start = System.nanoTime();
        FileChannel ch1 = new RandomAccessFile("sample.txt", "r").getChannel();
        FileChannel ch2 = new RandomAccessFile("sample-copy.txt", "r").getChannel();
        if (ch1.size() != ch2.size()) {
            System.out.println("Files have different length");
            return;
        }
        long size = ch1.size();
        ByteBuffer m1 = ch1.map(FileChannel.MapMode.READ_ONLY, 0L, size);
        ByteBuffer m2 = ch2.map(FileChannel.MapMode.READ_ONLY, 0L, size);
        for (int pos = 0; pos < size; pos++) {
            if (m1.get(pos) != m2.get(pos)) {
                System.out.println("Files differ at position " + pos);
                return;
            }
        }
        System.out.println("Files are identical, you can delete one of them.");
        long end = System.nanoTime();
        System.out.print("Execution time: " + (end - start) / 1000000 + "ms");
    }
}

To memory map the files we have to open them first using the RandomAccessFile class and ask for the channel from that object. The channel can be used to create a MappedByteBuffer, which is the representation of the memory area where the file content is loaded. The method map in the example maps the file in read-only mode, from the start of the file to the end of the file. We try to map the whole file. This works only if the file is not larger than 2GB. The start position is long but the size of the area to be mapped is limited by the size of an Integer.

Generally this it… Oh yes, the running time comparing the 160MB random content files is around 1sec.

UPDATE:

https://twitter.com/snazy pointed out that the part of the code

        for (int pos = 0; pos < size; pos++) {
            if (m1.get(pos) != m2.get(pos)) {
                System.out.println("Files differ at position " + pos);
                return;
            }
        }

can be replaced using the built-in ByteBuffer::mismatch method. The code is simpler, it does exactly what the example code is aiming and it is probably faster.