Commit Graph

87 Commits

Author SHA1 Message Date
Dhia Ayachi d24156db14 generate a single debug file for a long duration capture (#10279)
* debug: remove the CLI check for debug_enabled

The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>

* remove parallel test runs

parallel runs create a race condition that fail the debug tests

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d03346

* Refactor captureDynamic to extract capture logic for each item in a different func

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d03346

* Refactor captureDynamic to extract capture logic for each item in a different func

* extract wait group outside the go routine to avoid a race condition

* capture pprof in a separate go routine

* perform a single capture for pprof data for the whole duration

* add missing vendor dependency

* add a change log and fix documentation to reflect the change

* create function for timestamp dir creation and simplify error handling

* use error groups and ticker to simplify interval capture loop

* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval

* refactor Logs capture routine and add log capture specific test

* improve error reporting when log test fail

* change test duration to 1s

* make time parsing in log line more robust

* refactor log time format in a const

* test on log line empty the earliest possible and return

Co-authored-by: Freddy <freddygv@users.noreply.github.com>

* rename function to captureShortLived

* more specific changelog

Co-authored-by: Paul Banks <banks@banksco.de>

* update documentation to reflect current implementation

* add test for behavior when invalid param is passed to the command

* fix argument line in test

* a more detailed description of the new behaviour

Co-authored-by: Paul Banks <banks@banksco.de>

* print success right after the capture is done

* remove an unnecessary error check

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Co-authored-by: Paul Banks <banks@banksco.de>
2021-06-07 17:12:54 +00:00
Matt Keeler ada4d21285
Bump raft-autopilot to the latest version (#10310) 2021-05-27 13:23:18 -04:00
R.B. Boyer 0e7ab74f17
[1.9.x] mod: bump to github.com/hashicorp/mdns v1.0.4 (#10019)
backport of #10018 to 1.9.x
2021-04-14 14:42:10 -05:00
Matt Keeler ab1e689c4a Upgrade raft-autopilot and wait for autopilot it to stop when revoking leadership (#9644)
Fixes: 9626
2021-01-27 16:15:37 +00:00
Kit Patella d28d86a56f Merge pull request #9510 from pierresouchay/prometheus_metrics_help_duplicate_fix
[bugfix] Prometheus metrics without warnings
2021-01-06 18:53:33 +00:00
Matt Keeler 8539565046 Merge pull request #9103 from hashicorp/feature/autopilot-mod
Switch to using the external autopilot module
2020-11-09 16:30:48 +00:00
Mike Morris 4f1d2a1c56 chore: upgrade to gopsutil/v3 (#9118)
* deps: update golang.org/x/sys

* deps: update imports to gopsutil/v3

* chore: make update-vendor
2020-11-07 01:49:01 +00:00
Kit Patella 5d9240d6ff Merge pull request #9088 from hashicorp/mkcp/telemetry/add-key-metrics-definitions
Add prometheus definitions for key metrics.
2020-11-06 18:45:49 +00:00
Kyle Havlovitz 4abe96aa74 vendor: Update github.com/hashicorp/yamux 2020-10-09 05:05:46 -07:00
Kyle Havlovitz dd6ed08924 vendor: Update github.com/hashicorp/mdns 2020-10-09 04:43:27 -07:00
Kyle Havlovitz 1cc012b202 vendor: Update github.com/hashicorp/hil 2020-10-09 04:43:27 -07:00
Kyle Havlovitz b95ab0d33c vendor: Update github.com/hashicorp/go-version 2020-10-09 04:43:27 -07:00
Kyle Havlovitz f389f1184d vendor: Update github.com/hashicorp/go-memdb 2020-10-09 04:43:27 -07:00
Kyle Havlovitz 40481e2b8f vendor: Update github.com/hashicorp/go-checkpoint 2020-10-09 04:43:27 -07:00
Mike Morris 708957a982
chore: update raft to v1.2.0 (#8822) 2020-10-08 15:07:10 -04:00
Matt Keeler 38f5ddce2a
Add per-agent reconnect timeouts (#8781)
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.

This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
2020-10-08 15:02:19 -04:00
Mike Morris 1ebc2fb006
chore(deps): update gopsutil to v2.20.9 (#8843)
* core(deps): bump golang.org/x/sys

To resolve /go/pkg/mod/github.com/shirou/gopsutil@v2.20.9+incompatible/host/host_bsd.go:20:13: undefined: unix.SysctlTimeval

* chore(deps): make update-vendor
2020-10-07 12:57:18 -04:00
Daniel Nephin 627449a870 Vendor gofuzz and google/go-cmp 2020-09-28 18:28:37 -04:00
Kyle Havlovitz b1b21139ca Merge branch 'master' into vault-ca-renew-token 2020-09-15 14:39:04 -07:00
Kyle Havlovitz 1cd7c43544 Update vault CA for latest api client 2020-09-15 13:33:55 -07:00
Kyle Havlovitz 74dc50a771 vendor: Update vault api package 2020-09-15 12:45:29 -07:00
Daniel Nephin 0c87cf468c Update go-metrics dependencies, to use metrics.Default() 2020-09-14 19:05:22 -04:00
Hans Hasselberg a932aafc91
add primary keys to list keyring (#8522)
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.

Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:

```json
[
  {
    "WAN": true,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "NumNodes": 6
  },
  {
    "WAN": false,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  },
  {
    "WAN": false,
    "Datacenter": "dc1",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  }
]
```

I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output

Fixes #3393.
2020-08-18 09:50:24 +02:00
s-christoff 102b7e55da
Update Go-Metrics 0.3.4 (#8478) 2020-08-11 11:17:43 -05:00
Kyle Havlovitz f4efd53d57 vendor: Update github.com/armon/go-metrics to v0.3.3 2020-07-23 11:37:33 -07:00
Matt Keeler a6a1a0e3d6
Update mapstructure to v1.3.3 (#8361)
This was done in preparation for another PR where I was running into https://github.com/mitchellh/mapstructure/issues/202 and implemented a fix for the library.
2020-07-22 15:13:21 -04:00
R.B. Boyer e853368c23
gossip: Avoid issue where two unique leave events for the same node could lead to infinite rebroadcast storms (#8343)
bump serf to v0.9.3 to include fix for https://github.com/hashicorp/serf/pull/606
2020-07-21 15:48:10 -05:00
Pierre Souchay 20d1ea7d2d
Upgrade go-connlimit to v0.3.0 / return http 429 on too many connections (#8221)
Fixes #7527

I want to highlight this and explain what I think the implications are and make sure we are aware:

* `HTTPConnStateFunc` closes the connection when it is beyond the limit. `Close` does not block.
* `HTTPConnStateFuncWithDefault429Handler(10 * time.Millisecond)` blocks until the following is done (worst case):
  1) `conn.SetDeadline(10*time.Millisecond)` so that
  2) `conn.Write(429error)` is guaranteed to timeout after 10ms, so that the http 429 can be written and 
  3) `conn.Close` can happen

The implication of this change is that accepting any new connection is worst case delayed by 10ms. But only after a client reached the limit already.
2020-07-03 09:25:07 +02:00
Hans Hasselberg 95c027a3ea
Update gopsutil (#8208)
https://github.com/shirou/gopsutil/pull/895 is merged and fixes our
problem. Time to update. Since there is no new version just yet,
updating to the sha.
2020-07-01 14:47:56 +02:00
Matt Keeler e9835610f3
Add a test for go routine leaks
This is in its own separate package so that it will be a separate test binary that runs thus isolating the go runtime from other tests and allowing accurate go routine leak checking.

This test would ideally use goleak.VerifyTestMain but that will fail 100% of the time due to some architectural things (blocking queries and net/rpc uncancellability).

This test is not comprehensive. We should enable/exercise more features and more cluster configurations. However its a start.
2020-06-24 17:09:50 -04:00
R.B. Boyer c63c994b04
connect: upgrade github.com/envoyproxy/go-control-plane to v0.9.5 (#8165) 2020-06-23 15:19:56 -05:00
Paul Banks f6ac08be04 state: track changes so that they may be used to produce change events 2020-06-16 13:04:29 -04:00
Daniel Nephin 98d271bee9 Update google.golang.org/api and stretchr/testify
To match the versions used in enterprise, should slightly reduce the
chances of getting a merge conflict when using `go.mod`.
2020-06-09 16:03:05 -04:00
Daniel Nephin db74f09b6b Update protobuf and golang.org/x/... vendor
Partially extracted from #7547

Updates protobuf to the most recent in the 1.3.x series, and updates
golang.org/x/sys to a7d97aace0b0 because of https://github.com/shirou/gopsutil/issues/853
prevents updating to a more recent version.

This breaking change in x/sys also prevents us from getting a newer
version of x/net. In the future, if gopsutil is not patched,  we may want to run a fork version of
gopsutil so that we can update both x/net and x/sys.
2020-06-09 14:46:41 -04:00
R.B. Boyer 1efafd7523
acl: add auth method for JWTs (#7846) 2020-05-11 20:59:29 -05:00
Mike Morris 291e6af33a
vendor: revert golang.org/x/sys bump to avoid FreeBSD regression (#7780) 2020-05-05 09:26:17 +02:00
Hans Hasselberg c4093c87cc
agent: don't let left nodes hold onto their node-id (#7747) 2020-05-04 18:39:08 +02:00
Matt Keeler daec810e34
Merge pull request #7714 from hashicorp/oss-sync/msp-agent-token 2020-05-04 11:33:50 -04:00
Matt Keeler 55050beedb
Update go-discover dependency (#7731) 2020-05-04 10:59:48 -04:00
Matt Keeler 8c545b5206
Update mapstructure to v1.2.3
This release contains a fix to prevent duplicate keys in the Metadata after decoding where the output value contains pointer fields.
2020-04-28 09:33:16 -04:00
R.B. Boyer b989967791
cli: ensure that 'snapshot save' is fsync safe and also only writes to the requested file on success (#7698) 2020-04-24 17:34:47 -05:00
R.B. Boyer 5f1518c37c
cli: fix usage of gzip.Reader to better detect corrupt snapshots during save/restore (#7697) 2020-04-24 17:18:56 -05:00
Daniel Nephin 50d73c2674 Update github.com/joyent/triton-go to latest
There was an RSA private key used for testing included in the old
version. This commit updates it to a version that does not include the
key so that the key is not detected by tools which scan the Consul
binary for private keys.

Commands run:

go get github.com/joyent/triton-go@6801d15b779f042cfd821c8a41ef80fc33af9d47
make update-vendor
2020-04-16 12:34:29 -04:00
Daniel Nephin 5023a3b178 cli: send requested help text to stdout
This behaviour matches the GNU CLI standard:
http://www.gnu.org/prep/standards/html_node/_002d_002dhelp.html
2020-03-26 15:27:34 -04:00
R.B. Boyer 6adad71125
wan federation via mesh gateways (#6884)
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:

There are several distinct chunks of code that are affected:

* new flags and config options for the server

* retry join WAN is slightly different

* retry join code is shared to discover primary mesh gateways from secondary datacenters

* because retry join logic runs in the *agent* and the results of that
  operation for primary mesh gateways are needed in the *server* there are
  some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
  at multiple layers of abstraction just to pass the data down to the right
  layer.

* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers

* the function signature for RPC dialing picked up a new required field (the
  node name of the destination)

* several new RPCs for manipulating a FederationState object:
  `FederationState:{Apply,Get,List,ListMeshGateways}`

* 3 read-only internal APIs for debugging use to invoke those RPCs from curl

* raft and fsm changes to persist these FederationStates

* replication for FederationStates as they are canonically stored in the
  Primary and replicated to the Secondaries.

* a special derivative of anti-entropy that runs in secondaries to snapshot
  their local mesh gateway `CheckServiceNodes` and sync them into their upstream
  FederationState in the primary (this works in conjunction with the
  replication to distribute addresses for all mesh gateways in all DCs to all
  other DCs)

* a "gateway locator" convenience object to make use of this data to choose
  the addresses of gateways to use for any given RPC or gossip operation to a
  remote DC. This gets data from the "retry join" logic in the agent and also
  directly calls into the FSM.

* RPC (`:8300`) on the server sniffs the first byte of a new connection to
  determine if it's actually doing native TLS. If so it checks the ALPN header
  for protocol determination (just like how the existing system uses the
  type-byte marker).

* 2 new kinds of protocols are exclusively decoded via this native TLS
  mechanism: one for ferrying "packet" operations (udp-like) from the gossip
  layer and one for "stream" operations (tcp-like). The packet operations
  re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
  overhead.

* the server instances specially wrap the `memberlist.NetTransport` when running
  with gateway federation enabled (in a `wanfed.Transport`). The general gist is
  that if it tries to dial a node in the SAME datacenter (deduced by looking
  at the suffix of the node name) there is no change. If dialing a DIFFERENT
  datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
  gateways to eventually end up in a server's :8300 port.

* a new flag when launching a mesh gateway via `consul connect envoy` to
  indicate that the servers are to be exposed. This sets a special service
  meta when registering the gateway into the catalog.

* `proxycfg/xds` notice this metadata blob to activate additional watches for
  the FederationState objects as well as the location of all of the consul
  servers in that datacenter.

* `xds:` if the extra metadata is in place additional clusters are defined in a
  DC to bulk sink all traffic to another DC's gateways. For the current
  datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
  balances all servers as well as one mini-cluster per node
  (`<node>.server.<dc>.consul`)

* the `consul tls cert create` command got a new flag (`-node`) to help create
  an additional SAN in certs that can be used with this flavor of federation.
2020-03-09 15:59:02 -05:00
Matt Keeler 80ed304e04
Run make update-vendor and fixup various go.sum files
go mod tidy removes these lines because we have a replace directive
2020-02-11 09:20:49 -05:00
Matt Keeler 3c6d9516bc
Bump `api` and `sdk` module versions
api -> v1.4.0
sdk -> v0.4.0
2020-02-10 20:08:47 -05:00
R.B. Boyer 73ba5d9990
make the TestRPC_RPCMaxConnsPerClient test less flaky (#7255) 2020-02-10 15:13:53 -06:00
Matt Keeler 58befa754b
Update to miekg/dns v1.1.26 (#7252) 2020-02-10 15:14:27 -05:00
Hans Hasselberg 649ffcb66f
memberlist: vendor v0.1.6 to pull in new state: stateLeft (#7184) 2020-02-03 11:02:13 +01:00