Converting a Go string to a string suitable use a specialized function, UTF16Encode, that can encode the string directly to a malloc'ed buffer. That way, only two copies are made when strings are passed from Go to Java; once for UTF-8 to UTF-16 encoding and once for the creation of the Java String. This CL implements the same optimization in the other direction, with a UTF-16 to UTF-8 decoder implemented in C. Unfortunately, while calling into a Go decoder also saves the extra copy, the Cgo overhead makes the calls much slower for short strings. To alleviate the risk of introducing decoding bugs, I've added the tests from the encoding/utf16 package to SeqTest. As a sideeffect, both Java and ObjC now always copy strings, regardless of the argument mode. The cpy argument can therefore be removed from the string conversion functions. Furthermore, the modeRetained and modeReturned modes can be collapsed into just one. While we're here, delete a leftover function from seq/strings.go that wasn't removed when the old seq buffers went away. Benchmarks, as compared with benchstat over 5 runs: name old time/op new time/op delta JavaStringShort 11.4µs ±13% 11.6µs ± 4% ~ (p=0.859 n=10+5) JavaStringShortDirect 19.5µs ± 9% 20.3µs ± 2% +3.68% (p=0.019 n=9+5) JavaStringLong 103µs ± 8% 24µs ± 4% -77.13% (p=0.001 n=9+5) JavaStringLongDirect 113µs ± 9% 32µs ± 7% -71.63% (p=0.001 n=9+5) JavaStringShortUnicode 11.1µs ±16% 10.7µs ± 5% ~ (p=0.190 n=9+5) JavaStringShortUnicodeDirect 19.6µs ± 7% 20.2µs ± 1% +2.78% (p=0.029 n=9+5) JavaStringLongUnicode 97.1µs ± 9% 28.0µs ± 5% -71.17% (p=0.001 n=9+5) JavaStringLongUnicodeDirect 105µs ±10% 34µs ± 5% -67.23% (p=0.002 n=8+5) JavaStringRetShort 14.2µs ± 2% 13.9µs ± 1% -2.15% (p=0.006 n=8+5) JavaStringRetShortDirect 20.8µs ± 2% 20.4µs ± 2% ~ (p=0.065 n=8+5) JavaStringRetLong 42.2µs ± 9% 42.4µs ± 3% ~ (p=0.190 n=9+5) JavaStringRetLongDirect 51.2µs ±21% 50.8µs ± 8% ~ (p=0.518 n=9+5) GoStringShort 23.4µs ± 7% 22.5µs ± 3% -3.55% (p=0.019 n=9+5) GoStringLong 51.9µs ± 9% 53.1µs ± 3% ~ (p=0.240 n=9+5) GoStringShortUnicode 24.2µs ± 6% 22.8µs ± 1% -5.54% (p=0.002 n=9+5) GoStringLongUnicode 58.6µs ± 8% 57.6µs ± 3% ~ (p=0.518 n=9+5) GoStringRetShort 27.6µs ± 1% 23.2µs ± 2% -15.87% (p=0.003 n=7+5) GoStringRetLong 129µs ±12% 33µs ± 2% -74.03% (p=0.001 n=10+5) Change-Id: Icb9481981493ffca8defed9fb80a9433d6048937 Reviewed-on: https://go-review.googlesource.com/20250 Reviewed-by: David Crawshaw <crawshaw@golang.org>
50 lines
1.2 KiB
Go
50 lines
1.2 KiB
Go
// Copyright 2014 The Go Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style
|
|
// license that can be found in the LICENSE file.
|
|
|
|
package seq
|
|
|
|
import "unicode/utf16"
|
|
|
|
// Based heavily on package unicode/utf16 from the Go standard library.
|
|
|
|
const (
|
|
replacementChar = '\uFFFD' // Unicode replacement character
|
|
maxRune = '\U0010FFFF' // Maximum valid Unicode code point.
|
|
)
|
|
|
|
const (
|
|
// 0xd800-0xdc00 encodes the high 10 bits of a pair.
|
|
// 0xdc00-0xe000 encodes the low 10 bits of a pair.
|
|
// the value is those 20 bits plus 0x10000.
|
|
surr1 = 0xd800
|
|
surr2 = 0xdc00
|
|
surr3 = 0xe000
|
|
|
|
surrSelf = 0x10000
|
|
)
|
|
|
|
// UTF16Encode utf16 encodes s into chars. It returns the resulting
|
|
// length in units of uint16. It is assumed that the chars slice
|
|
// has enough room for the encoded string.
|
|
func UTF16Encode(s string, chars []uint16) int {
|
|
n := 0
|
|
for _, v := range s {
|
|
switch {
|
|
case v < 0, surr1 <= v && v < surr3, v > maxRune:
|
|
v = replacementChar
|
|
fallthrough
|
|
case v < surrSelf:
|
|
chars[n] = uint16(v)
|
|
n += 1
|
|
default:
|
|
// surrogate pair, two uint16 values
|
|
r1, r2 := utf16.EncodeRune(v)
|
|
chars[n] = uint16(r1)
|
|
chars[n+1] = uint16(r2)
|
|
n += 2
|
|
}
|
|
}
|
|
return n
|
|
}
|