volatile false sharing problem

 False sharing false sharing, as the name suggests, "false sharing" is "not actually sharing". So what is "sharing"? When multiple CPUs access the same memory area at the same time, it is "sharing", which will cause conflicts and require a control protocol to coordinate the access. The smallest memory area size that will cause "sharing" is one cache line. Therefore, when two or more CPUs want to access the same memory area of ​​the same cache line size, a conflict will occur, which is called "sharing". However, this situation also includes the "pseudo-sharing" situation of "not actually sharing". For example, two processors need to access a word each, but these two words exist in the same cache line size area. At this time, from the application logic level, the two processors do not share memory, because they access is different content (different word). However, due to the existence and limitation of the cache line, when the two CPUs want to access the two different words, they must access the same cache line block, resulting in a de facto "sharing". Obviously, this kind of "false sharing" due to cache line size limit is not what we want and will waste system resources.

  The cache system is stored in units of cache lines. A cache line is an integer power of 2 consecutive bytes, typically 32-256 bytes. The most common cache line size is 64 bytes. When multiple threads modify variables that are independent of each other, if these variables share the same cache line, it will inadvertently affect each other's performance, which is false sharing. Write contention on cache lines is the most important limiting factor for the scalability of parallel threads running on SMP systems. Some people describe false sharing as a silent performance killer because it is difficult to see from the code whether false sharing occurs.

  In order for scalability to scale linearly with the number of threads, it must be ensured that no two threads write to the same variable or cache line. Two threads writing to the same variable can be found in the code. In order to determine whether independent variables share the same cache line, you need to understand the memory layout, or find a tool to tell us. Intel VTune is one such analysis tool.

  Figure 1 illustrates the problem of false sharing. A thread running on core 1 wants to update variable X, while a thread on core 2 wants to update variable Y. Unfortunately, both variables are in the same cache line. Each thread has to compete for ownership of the cache line to update the variable. If core 1 takes ownership, the cache subsystem will invalidate the corresponding cache line in core 2. When core 2 takes ownership and performs the update operation, core 1 invalidates its corresponding cache line. This will go back and forth through the L3 cache, greatly affecting performance. The problem can be exacerbated if the competing cores are in different sockets, with additional connections across the sockets.

  Java Memory Layout Java memory layout. In project development, most of HotSpot's JVM is used. Objects in hotspot have object headers that are two words (four bytes) long. The first word is a Mark Word consisting of a 24-bit hash code and 8-bit flag bits (such as the state of the lock or as a lock object). The second word is a reference to the class to which the object belongs. In the case of an array object, an extra word is required to store the length of the array. The starting address of each object is aligned to 8 bytes to improve performance. Therefore, for efficiency when wrapping objects, the order of object field declarations is reordered into the following byte-size-based order:

  • double (8 bytes) and long (8 bytes)
  • int (4 bytes) and float (4 bytes)
  • short (2 bytes) and char (2 bytes): char is 2 bytes in java. Java uses unicode, 2 bytes (16 bits) to represent a character.
  • boolean (1 byte) and byte (1 byte)
  • reference (4/8 bytes)
  • <Subclass field repeats the above order>

After knowing this, you can fill the cache line with 7 longs between any fields. False sharing provides different solutions under different JDKs.

  In the JDK1.6 environment, the solution to false sharing is to use cache line filling, so that the memory size occupied by an object is exactly 64 bytes or an integer multiple of it, thus ensuring that there will not be multiple objects in a cache line.

copy code

package basic;

public class TestFlash implements Runnable {

    public final static int  NUM_THREADS = 4;                   // change
    public final static long ITERATIONS  = 500L * 1000L * 1000L;
    private final int        arrayIndex;

    /**
     * To demonstrate its performance impact, we start several threads, each updating its own independent counter. The counters are of type volatile long, so other threads can see their progress.
     */
    public final static class VolatileLong {

        /* Variables modified with volatile[ˈvɑ:lətl], each time a thread uses a variable, the JVM virtual machine only guarantees that the value loaded from the main memory to the thread's working memory is the latest */
        public volatile long value = 0L;

        /* buffer line padding */
        /* 37370571461 : Execute nanoseconds without buffering lines */
        /* 16174480826 : Execute nanoseconds with buffered lines, half the performance */
        public long          p1, p2, p3, p4, p5, p6, p7;
    }

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
    static {
        for (int i = 0; i < longs.length; i++) {
            longs[i] = new VolatileLong();
        }
    }

    public TestFlash(final int arrayIndex){
        this.arrayIndex = arrayIndex;
    }

    /**
     * We are not sure where in memory these VolatileLongs will be laid out. They are separate objects. But experience tells us that objects allocated at the same time tend to be lumped together.
     */
    public static void main(final String[] args) throws Exception {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(new TestFlash(i));
        }

        for (Thread t : threads) {
            t.start();
        }

        for (Thread t : threads) {
            t.join();
        }
    }

    /*
     * To demonstrate its performance impact, we start several threads, each updating its own independent counter. Counters are of type volatile long, so other threads can see their progress
     */
    @Override
    public void run() {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            longs[arrayIndex].value = i;
        }
    }
}

copy code

VolatileLong fills in some useless fields p1, p2, p3, p4, p5, p6, and then considers that the object header also occupies 8 bits, and just expands the memory occupied by the object to just 64 bytes (or an integer multiple of 64 bytes). This avoids loading multiple objects in one cache line. But this method can only adapt to JAVA6 and previous versions.

  In the jdk1.7 environment, useless fields will be optimized out due to java 7. Therefore, it is more troublesome to do cache line filling under JAVA 7, and it is necessary to use the method of inheritance to avoid the filling being optimized away. Putting the padding in the base class can avoid optimization (this seems to be unreasonable, the memory optimization algorithm of JAVA7 can be detoured if it can).

copy code

package basic;

public class TestFlashONJDK7 implements Runnable {

    public static int             NUM_THREADS = 4;
    public final static long      ITERATIONS  = 500L * 1000L * 1000L;
    private final int             arrayIndex;
    private static VolatileLong[] longs;

    public TestFlashONJDK7(final int arrayIndex){
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception {
        Thread.sleep(10000);
        System.out.println("starting....");
        if (args.length == 1) {
            NUM_THREADS = Integer.parseInt(args[0]);
        }

        longs = new VolatileLong[NUM_THREADS];
        for (int i = 0; i < longs.length; i++) {
            longs[i] = new VolatileLong();
        }
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException {
        Thread[] threads = new Thread[NUM_THREADS];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(new TestFlashONJDK7(i));
        }
        for (Thread t : threads) {
            t.start();
        }
        for (Thread t : threads) {
            t.join();
        }
    }

    @Override
    public void run() {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            longs[arrayIndex].value = i;
        }
    }
}

class VolatileLong extends VolatileLongPadding {

    public volatile long value = 0L;
}

class VolatileLongPadding {

    public volatile long p1, p2, p3, p4, p5, p6, p7;
}

copy code

In the jdk1.8 environment, cache line filling is finally supported natively by JAVA. JAVA 8 added a @Contended annotation, adding this annotation will automatically fill the cache line. The above example can be changed to:

copy code

package basic;

public class TestFlashONJDK8 implements Runnable {

    public static int             NUM_THREADS = 4;
    public final static long      ITERATIONS  = 500L * 1000L * 1000L;
    private final int             arrayIndex;
    private static VolatileLong[] longs;

    public TestFlashONJDK8(final int arrayIndex){
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception {
        Thread.sleep(10000);
        System.out.println("starting....");
        if (args.length == 1) {
            NUM_THREADS = Integer.parseInt(args[0]);
        }

        longs = new VolatileLong[NUM_THREADS];
        for (int i = 0; i < longs.length; i++) {
            longs[i] = new VolatileLong();
        }
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException {
        Thread[] threads = new Thread[NUM_THREADS];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(new TestFlashONJDK8(i));
        }
        for (Thread t : threads) {
            t.start();
        }
        for (Thread t : threads) {
            t.join();
        }
    }

    @Override
    public void run() {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            longs[arrayIndex].value = i;
        }
    }
}
 
@Contended
class VolatileLong { 

  public volatile long value = 0L; 
}

copy code

When executing, the virtual machine parameter -XX:-RestrictContended must be added, and the @Contended annotation will take effect. A lot of articles leave this out, and then it doesn't actually work.

Replenish:

byte byte bit bit 1byte=8bit

volatile description

copy code

package basic;

public class TestVolatile {

    public static int count = 0;

    /* Even using volatile, it still does not achieve the desired effect */
    // public volatile static int count = 0;

    public static void increase() {
        try {
            // Delay 10ms to make the result obvious
            Thread.sleep(10);
            count++;
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        for (int i = 0; i < 10000; i++) {
            new Thread(new Runnable() {

                @Override
                public void run() {
                    TestVolatile.increase();
                }
            }).start();
        }
        System.out.println("Expected result: 10000");
        System.out.println("Actual running result:" + TestVolatile.count);
    }
}

copy code

Use of the volatile keyword: For variables modified with volatile, each time the thread uses the variable, it will read the latest modified value of the variable. But since the operation is not atomic, for volatile-modified variables, the JVM only guarantees that the value loaded from the main memory to the thread's working memory is up-to-date.

In the article java garbage collection and finishing, the allocation of memory at jvm runtime is described. One of the memory areas is the jvm virtual machine stack. Each thread has a thread stack when it is running. The thread stack saves the variable value information when the thread is running. When the thread accesses the value of an object, it first finds the value of the variable corresponding to the heap memory through the reference of the object, and then loads the specific value of the heap memory variable into the thread local memory to create a copy of the variable, and then the thread does not It has nothing to do with the variable value of the object in the heap memory, but directly modifies the value of the copy variable. At a certain moment after the modification (before the thread exits), the value of the copy of the thread variable is automatically written back to the object variable in the heap. . In this way, the value of the object in the heap changes. The above diagram depicts these interactions, the process is as follows:

  • read and load copies variables from main memory to current working memory
  • use and assign execute code and change shared variable values ​​( where use and assign can appear multiple times
  • store and write Refresh main memory related content with working memory data

But these operations are not atomic, that is, after read load, if the main memory count variable is modified, the value in the thread's working memory will not change because it has been loaded, so the calculated result will not be as expected. Same. For volatile-modified variables, the JVM virtual machine only guarantees that the value loaded from main memory into the thread's working memory is up-to-date. For example, if thread 1 and thread 2 are performing a read load operation and find that the value of count in the main memory is both 5, then the latest value will be loaded. After the thread 1 heap count is modified, it will be written to the main memory, and the count variable in the main memory will become 6. Since thread 2 has already performed read and load operations, it will also update the variable value of count in the main memory to 6 after the operation. As a result, there will still be concurrency between two threads even after using the volatile keyword to modify them.

For volatile-modified variables, the JVM virtual machine can only guarantee that the value loaded from the main memory into the thread's working memory is up-to-date.

Reference blog:

[1] http://www.cnblogs.com/Binhua-Liu/p/5620339.html