Monthly Archives: November 2017

Split a File as Stream

Last week I discussed that the new (@since 1.8) method splitAsStream in the class Pattern works on the character sequence reading from it only as much as needed by the stream and not running ahead with the pattern matching creating all the possible elements and returning it as a stream. This behavior is the true nature of streams and it is the way it has to be to support high performance applications.

In this article, as I promised last week, I will show a practical application of splitAsStream where it really makes sense to process the stream and not just split the whole string into an array and work on that.

The application as you may have guessed from the title of the article is splitting up a file along some tokens. A file can be represented as a CharSequence so long (or so short) as long it is not longer than 2GB. The limit comes from the fact that the length of a CharSequence is an int value and that is 32-bit in Java. File length is long, which is 64-bit. Since reading from a file is much slower than reading from a string that is already in memory it makes sense to use the laziness of stream handling. All we need is a character sequence implementation that is backed up by a file. If we can have that we can write a program like the following:

    public static void main(String[] args) throws FileNotFoundException {
        Pattern p = Pattern.compile("[,\\.\\-;]");
        final CharSequence splitIt = 
            new FileAsCharSequence(
                   new File("path_to_source\\SplitFileAsStream.java"));
        p.splitAsStream(splitIt).forEach(System.out::println);
    }

This code does not read any part of the file, that is not needed yet, assumes that the implementation FileAsCharSequence is not reading the file greedy. The class FileAsCharSequence implementation can be:

package com.epam.training.regex;

import java.io.*;

public class FileAsCharSequence implements CharSequence {
    private final int length;
    private final StringBuilder buffer = new StringBuilder();
    private final InputStream input;

    public FileAsCharSequence(File file) throws FileNotFoundException {
        if (file.length() > (long) Integer.MAX_VALUE) {
            throw new IllegalArgumentException("File is too long to handle as character sequence");
        }
        this.length = (int) file.length();
        this.input = new FileInputStream(file);
    }

    @Override
    public int length() {
        return length;
    }

    @Override
    public char charAt(int index) {
        ensureFilled(index + 1);
        return buffer.charAt(index);
    }


    @Override
    public CharSequence subSequence(int start, int end) {
        ensureFilled(end + 1);
        return buffer.subSequence(start, end);
    }

    private void ensureFilled(int index) {
        if (buffer.length() < index) {
            buffer.ensureCapacity(index);
            final byte[] bytes = new byte[index - buffer.length()];
            try {
                int length = input.read(bytes);
                if (length < bytes.length) {
                    throw new IllegalArgumentException("File ended unexpected");
                }
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            try {
                buffer.append(new String(bytes, "utf-8"));
            } catch (UnsupportedEncodingException ignored) {
            }
        }
    }
}

This implementation reads only that many bytes from the file as it is needed for the last, actual method call to charAt or subSequence.

If you are interested you can improve this code to keep only the bytes in memory that are really needed and delete bytes that were already returned to the stream. To know what bytes are not needed a good hint is from the previous article is that the splitAsStream never touches any character that has smaller index than the first (start) argument of the last call to subSequence. However, if you implement the code in a way that it throws the characters away and fail if anyone wants to access a character that was already thrown then it will not truly implement the CharSequence interface, though it still may work well with splitAsStream so long as long the implementation does not change and it starts needed some already passed characters. (Well, I am not sure, but it may also happen in case we use some complex regular expression as a splitting pattern.)

Happy coding!

Advertisement

Split as stream

I am preparing a regular expression tutorial update for the company I work for. The original tutorial was created in 2012 and Java has changed a wee bit since then. There are new Java language releases and though the regular expression handling is still not perfect in Java (nb. it still uses non-deterministic FSA) there are some new features. I wrote about some of those in a previous post focusing on the new Java 9 methods. This time however I have to look at all the features that are new since 2012.

splitAsStream since 1.8

This way I found splitAsStream in the java.util.regex.Pattern class. It is almost the same as the method split except that what we get back is not an array of String objects but a stream. The simplest implementation would be something like

public Stream<String> splitAsStream(final CharSequence input) {
    return Arrays.stream(p.split(input));
}

I could see many such implementations when a library tried to keep pace with the new winds and support streams. Nothing is simpler then converting the array or the list available from some already existing functionality to a stream.

The solution, however, is sub-par losing the essence of streams: doing only as much work as needed. And this, I mean “doing only as much work as needed” should happen while the stream is processed and not while the developer converts the array or collection returning method to a stream returning one. Streams deliver the results in a lean way, just in time. You see how many expressions we have for being lazy.

The JDK implementation leverages the performance advantages of streams. If you look at the source code you can see immediately that the implementation is slightly more complex than the before mentioned simple solution. Lacking time I could devote to the study of the implementation and perhaps lacking interest, I used another approach to demonstrate that the implementation respects the stream laziness.

The argument to the method is a CharSequence and not a String. CharSequence is an interface implemented by String but we can also implement it. To have a feeling how lazy the stream implementation in this case is I created an implementation of CharSequence that debug prints out the method calls.

class MyCharSequence implements CharSequence {

    private String me;

    MyCharSequence(String me) {
        this.me = me;
    }

    @Override
    public int length() {
        System.out.println("MCS.length()=" + me.length());
        return me.length();
    }

    @Override
    public char charAt(int index) {
        System.out.println("MCS.charAt(" + index + ")=" + me.charAt(index));
        return me.charAt(index);
    }

    @Override
    public CharSequence subSequence(int start, int end) {
        System.out.println("MCS.subSequence(" + start + "," + end + ")="
                                              + me.subSequence(start, end));
        return me.subSequence(start, end);
    }
}

Having this class at hand, I could execute the following simple main method:

public static void main(String[] args) {
    Pattern p = Pattern.compile("[,\\.\\-;]");
    final CharSequence splitIt =
              new MyCharSequence("one.two-three,four;five;");
    p.splitAsStream(splitIt).forEach(System.out::println);
}

The output shows that the implementation is really lazy:

MCS.length()=24
MCS.length()=24
MCS.length()=24
MCS.charAt(0)=o
MCS.charAt(1)=n
MCS.charAt(2)=e
MCS.charAt(3)=.
MCS.subSequence(0,3)=one
one
MCS.length()=24
MCS.charAt(4)=t
MCS.charAt(5)=w
MCS.charAt(6)=o
MCS.charAt(7)=-
MCS.subSequence(4,7)=two
two
MCS.length()=24
MCS.charAt(8)=t
MCS.charAt(9)=h
MCS.charAt(10)=r
MCS.charAt(11)=e
MCS.charAt(12)=e
MCS.charAt(13)=,
MCS.subSequence(8,13)=three
three
MCS.length()=24
MCS.charAt(14)=f
MCS.charAt(15)=o
MCS.charAt(16)=u
MCS.charAt(17)=r
MCS.charAt(18)=;
MCS.subSequence(14,18)=four
four
MCS.length()=24
MCS.charAt(19)=f
MCS.charAt(20)=i
MCS.charAt(21)=v
MCS.charAt(22)=e
MCS.charAt(23)=;
MCS.subSequence(19,23)=five
five
MCS.length()=24

The implementation goes ahead and when it finds the first element for the stream, it returns it. We can process the string “one” and it processes further characters only when we get back for further elements. Why does it have to call the method length three times at the start? I have no idea. Perhaps it wants to be very sure that the length of the sequence is not magically changes.

Morale

This is a good example how a library has to be extended to support streams. It is not a problem if the application just converts the collection or array to a stream in the first version but if analysis shows that the performance pays back the investment then the real stream laziness should be implemented.

Side note

The implementation of CharSequence is mutable, but the processing requires that it remains constant otherwise the result is undefined. I can confirm that.

Next week I will show a possible use of the splitAsStream that makes use of the feature that it does not read further in the character sequence than it is needed.