Commit Graph

301 Commits

Author SHA1 Message Date
William Chargin 56ddb5cf9b
Load commit parent hashes into memory (#177)
Test Plan:
Snapshot updated with `./src/plugins/git/loadRepositoryTest.sh -u`; unit
tests suffice.

wchargin-branch: load-commit-parents
2018-04-30 18:01:16 -07:00
William Chargin d5f468ca68
Fix Git plugin `NodePayload` definition (#176)
Summary:
Flow didn’t catch this because all the payloads are `{}` anyway.

Test Plan:
Note that every node and edge payload is now listed exactly once in the
correct spot for each of `{Node,Edge}{Type,Payload}`.

wchargin-branch: git-nodepayload
2018-04-30 17:08:49 -07:00
Dandelion Mané 0609201af4
Remove Graph.{in,out}Edges (#174)
This method removes `Graph.inEdges` and `Graph.outEdges`. As a
replacement to these two functions, `graph.neighborhood` now takes an
optional `direction` flag, which can be set to `"IN" | "OUT" | "ANY"`.

This reduces the surface area of the Graph API, and means that the same
pattern can be used when requesting in or out neighbors as is used when
requesting all neighbors.

This change generates significant churn in the test files, and in some
cases the tests are less elegant / show historicity, as they were
written for the type signature of `{in,out}Edges`, which just returns an
array of edges, and now receive an array of neighbors. I think this is
acceptable, and it's not worth re-writing the test.

In many cases, replacing existing calls to `{in,out}Edges` in our actual
codebase resulted in cleaner code, as `neighborhood` successfully
abstracts over the common patterns that users of `{in,out}Edges` were
implementing.

As a fly-by refactor, I also changed the `neighborAddress` part of the
`neighborhood` return value to `neighbor`. It's a little less
descriptive, but it's more concise, and flow is there to help ensure
it's used correctly.

Test plan: Note that CI passes. Inspect the test changes, and verify
that they are appropriate transformations for consuming the new API.
2018-04-30 15:32:28 -07:00
William Chargin 5af5748ed7
Convert in-memory Git repos to cred graphs (#169)
Test Plan:
This snapshot test is too unwieldy to actually read—it’s 1000 lines of
opaque SHAs and thrice-stringified JSON objects—so it should be
interpreted as a regression test only. The programmatic tests should
suffice.

wchargin-branch: wip-git-create-graph
2018-04-30 15:23:37 -07:00
William Chargin f3a440244e
Fix all lint errors, adding a lint CI step (#175)
Test Plan:
Run `yarn lint` and `yarn travis` and observe success. Add something
that triggers a lint warning, like `const zzz = 3;`; re-run and observe
failures.

wchargin-branch: lint
2018-04-30 14:52:28 -07:00
Dandelion Mané 22ca77ed05
Add safe type coercion for GitHub api (#173)
In general, methods in the porcelain GitHub api may return multiple
types; e.g. a reference could be to an Issue, PullRequest, Comment,
Author (or more). To make working with the api more convenient while
maintaining safety, this commit adds a static `asType` method to each
Entity class, which confirms that type coercion is safe, and errors if
not.

This commit also adds `issueOrPRByNumber`, a convenience method, to
api.test.js.

Test plan: Check the API usage and verify that it is reasonable.
2018-04-30 10:07:23 -07:00
Dandelion Mané d878be6550
Update the GitHub example repo data (#172)
Commit generated by running src/plugins/github/fetchGithubRepoTest -u
2018-04-29 22:13:00 -07:00
Dandelion Mané 7158deaad3
Add a porcelain api for Github data (#170)
Interacting with raw contribution graphs is cumbersome. We'll need
more fluent and convenient ways to retrieve data from them; we can do
this by creating porcelain APIs that wrap the underlying graph.

This commit adds a simple porcelain API for the GitHub data. It creates
the following classes:

* `api.Repository`
* `api.Issue`
* `api.PullRequest`
* `api.Comment`
* `api.Author`

The classes all wrap a graph and a nodeAddress. They provide read-only
functions for retreiving data from the graph; that data might be a part
of the node payload, or it might do some graph traversal under the hood.

The choice to have the wrapper hold onto the Address rather than the
node itself was deliberate; in the future, the graph may contain nodes
that are not synchronously reachable, so this approach allows us to
create wrappers for nodes we can't synchronously reach. When this comes
up in practice, we can then add async methods to the wrapper.

Note that some data already included in our graph, such as
PullRequestReviews and PullRequestReviewComments, were deliberately
excluded, so as to allow the core ideas to be reviewed without
unnecessary clutter.

Test plan:
Check that the unit tests appropriately test the behavior, and that the
API seems pleasant to use.
2018-04-27 21:45:30 -07:00
William Chargin 1c28c75e39
Check in example repo’s in-memory representation (#166)
Summary:
Two reasons for this. First, we want tests to be able to operate on this
data without having to generate repositories via `git(1)`. (Doing that
is slow, and requires a Git installation, and makes it less clear that
the tests are correctly isolated/provides more surface area for
something to go wrong.) Second, in general plugins will need a canonical
source of test data, so setting/continuing this precedent is a good
thing.

Test Plan:
Observe that the old Jest snapshot must be equivalent to the new JSON
one, because the test criterion in `loadRepository.test.js` changed and
the test still passes. Then, run `loadRepositoryTest.sh` and note that
it passes; change the `example-git.json` file and note that the test
fails when re-run; then, run the test with `--updateSnapshot` and watch
it magically revert your changes.

wchargin-branch: check-in-git-repo
2018-04-27 20:51:54 -07:00
William Chargin 301e542ee1
Switch in-memory Git types from Maps to objects (#165)
Summary:
I’d like to use `Map`s whenever the keys are homogeneous (i.e.,
dictionaries, not structs). But this has proven infeasible. The primary
issue at this point is that `JSON.stringify(anyMap)` is `"{}"`—not
entirely unreasonable given that maps can have non-string keys, but
frustrating enough to not use them.

Test Plan:
Jest appears to order the snapshot keys differently for `Map`s and
objects (the former by insertion order and the latter alphabetical),
which makes the snapshot change harder to read. I verified that the
general structure is okay, and hand-verified some of the individual
changes. Noting that the number of lines added and deleted in the
snapshot is a good sanity check.

wchargin-branch: map-to-object
2018-04-27 20:46:55 -07:00
Dandelion Mané ec3d084ffc
Graph: type filter for `nodes()` and `edges()` (#168)
When requesting nodes and edges from the graph, it is convenient to
filter them by their type.

In the future, we should add plugin filtering as well, as we
expect type names to collide across plugins.

We may also want to consider keeping a cache of nodes and edges by type
to speed up these calls, if they become performance bottlenecks. (The
implementation in this commit naively iterates over every node/edge.)

Test plan:
Verify that the unit tests are appropriate.
2018-04-27 20:10:51 -07:00
Dandelion Mané 3d79f7680e
Collapse the 3 author types into 1 (#164)
Currently, we store GitHub Users, Organizations, and Bots as separate
nodetypes in the graph. This is inconvenient, as we don't care very much
what type of entity authored a node.

This commit collapses those three categories into one nodetype. The
extra information has migrated to the node payload, so it is still
possible to discover this information if it's important.

Test plan: There is some amount of snapshot churn because the author
node types and payloads have changed. Verify that the snapshot changes
are appropriate, and that CI passes.
2018-04-27 15:52:55 -07:00
Dandelion Mané dd48084810
Remove `get` prefix from getters in graph.js (#163)
This commit renames the following graph functions:

* `get{Node,Edge}{,s}` -> `{node,edge}{,s}`
* `get{In,Out}Edges` -> `{in,out}Edges`
* `getNeighborhood` -> `neighborhood`

The rename was effected across the repo by running:

```
$ find src -name "*.js" -exec sed -i 's/getNeighborhood/neighborhood/g' {} +
```

modified appropriately for each subsitution.

Test plan:
Inspect the code to make sure nothing was erronously renamed. Check that
CI passes.
2018-04-27 15:10:12 -07:00
Dandelion Mané 678924087a
Replace `Graph.{getAdjacentEdges,getNeighborhood}` (#162)
`Graph.getAdjacentEdges` had a serious defect: for the adjacent edges,
it's hard to tell which of the {src,dst} is the neighboring node address
and which is the node we called `getAdjacentEdges` on.

This commit fixes that limitation by replacing `getAdjacentEdges` with
`getNeighborhood`, with a return signature of
`{edge: Edge<EP>, neighborAddress: Address}[]`

Some yak shaving was required: we needed a version of `expectSameSorted`
and, by extension, `sortedByAddress` that takes an accessor to an
Addressable, so that we could test that the neighborhoods returned were
correct. To satisfy flow, I created `expectSameSortedAccessorized` and
`sortedByAddressAccessorized`. Cumbersome, but it worked. ¯\_(ツ)_/¯
2018-04-27 15:05:36 -07:00
Dandelion Mané 28e686c369
Remove `address.sortedByAddress` (#161)
Previously, the address module exported `sortedByAddress`, a utility
function that sorts an array of `Addressable`s. This function was only
used in test code.

This commit replaces it with generic usage of `lodash.sortBy`. This
reduces the API surface area of the module, and removes test-only code
from the exported api.

New dependency added: `lodash.sortby`
https://www.npmjs.com/package/lodash.sortby
2018-04-27 14:29:49 -07:00
William Chargin 1550e6d05e
Add one-way GitHub sync for Git example repos (#160)
Test Plan:
Run the script with `--dry-run`, which currently prints
```shell
$ src/plugins/git/demoData/synchronizeToGithub.sh -n
yarn run v1.5.1
[build output truncated]
Build completed; results in 'bin'.
Done in 3.30s.

Synchronizing: example-git
warning: no common commits
To github.com:sourcecred/example-git.git
 + 3507b7c...3715ddf HEAD -> master (forced update)

Synchronizing: example-git-submodule
Everything up-to-date

Done.
```
This reflects the correct state of affairs, because #158 changed the
example repository. Note that the `3715ddf` SHA in the output of the
above script matches the SHA in the `exampleRepo.test.js.snap` snapshot.

wchargin-branch: sync-git-example-repos
2018-04-27 14:03:54 -07:00
William Chargin f4de3e2067
Standardize environment passed to Git (#159)
Summary:
When we shell out to `git`, we don’t want the end user’s environment
variables and Git configuration to influence the results. This commit
standardizes those inputs. Standardizing the environment has the side
benefit that the `GIT_DIR` environment variable is not set, which means
that the test suite will work properly when run from the `exec` step of
a Git rebase.

Test Plan:
Tests pass and snapshots are unchanged. Note that
```shell
$ git rebase HEAD --exec 'CI=1 yarn test'
```
works after this commit but not before it.

wchargin-branch: standardize-git-environment
2018-04-27 11:50:32 -07:00
William Chargin c7235f6e49
Remove superfluous commas from example repo README (#158)
Summary:
Using `array.join()` added commas at the start of some lines; I meant to
use `array.join("")`.

(I’ve now inspected the full generated contents of both repos, and they
look good.)

Test Plan:
It is expected that these attributes of the snapshots should change.
There’s no need to carefully check the SHAs.

wchargin-branch: readme-change
2018-04-27 11:48:40 -07:00
Dandelion Mané 8e9ddbf9fc
Add `Graph.getAdjacentEdges` for in and out edges (#157)
Some consumers of the graph may prefer to treat it as an undirected
graph. For example, when finding the author of an issue, it is wholly
sufficient to find an edge with the `AUTHORS` type; the caller may
prefer not to be bothered with remembering which end of the `AUTHORS`
end is considered the `src` and which is the `dst`.

The `getAdjacentEdges` call enables that, by combining the output of
`getInEdges` and `getOutEdges`.

Test plan:
The new tests are pretty comprehensive.
2018-04-26 20:27:46 -07:00
William Chargin aa071ceab3
Include a submodule in the main example repository (#156)
Summary:
The main example repository now covers the currently desired features:
it has blobs, subtrees, and submodules, and commits that change each of
these. (We don’t have merge commits yet—we can add those once we start
to care about them.)

Once this is merged, I will push the two repositories to GitHub.

Test Plan:
Verifying and understanding is easier than ever before. You can run the
following commands to create the repositories in question on your disk:
```shell
$ yarn backend
$ node bin/createExampleRepo.js /tmp/repo
$ node bin/createExampleRepo.js --submodule /tmp/subrepo
```

You can then explore these repositories at your leisure. For instance,
to check that the `loadRepository` snapshot has the right set of
commits, inspect the output of the following command:
```shell
$ git -C /tmp/repo log --format='%H %T'
```
Or, to check that a particular tree has the right contents, just run:
```shell
$ git -C /tmp/repo ls-tree TREE_SHA
```
Verifying the `exampleRepo` snapshot is similarly easy: just check that
the lists of commit SHAs in `/tmp/repo` and `/tmp/subrepo` are correct.

wchargin-branch: include-submodule
2018-04-26 20:11:44 -07:00
William Chargin d6e9b0a72b
Add a command-line script to create example repos (#155)
Summary:
We’ll use this to create the repositories on disk and then push them to
GitHub.

Test Plan:
Generate both kinds of repository, and check out the SHAs:
```shell
$ yarn backend
$ node bin/createExampleRepo.js /tmp/repo
$ node bin/createExampleRepo.js --submodule /tmp/repo-submodule
$ node bin/createExampleRepo.js --no-submodule /tmp/repo-no-submodule
$ # (first and third lines do the same thing)
$ git -C /tmp/repo rev-parse HEAD
677b340674bde17fdaac3b5f5eef929139ef2a52
$ git -C /tmp/repo-submodule rev-parse HEAD
29ef158bc982733e2ba429fcf73e2f7562244188
$ git -C /tmp/repo-no-submodule rev-parse HEAD
677b340674bde17fdaac3b5f5eef929139ef2a52
```
Then, note that these SHAs are expected per the snapshot file in
`exampleRepo.test.js.snap`.

wchargin-branch: create-example-repo-command
2018-04-26 19:53:46 -07:00
William Chargin ef7610a440
Only exclude top-level directories from Prettier (#154)
Summary:
In particular, we excluded `bin`, but this was catching non-root
directories named `bin`, too, and so files like
`src/plugins/github/bin/fetchAndPrintGithubRepo.js` were not subject to
prettification. Happily, those files are all pretty enough, anyway.

Test Plan:
Note that mangling the format of `fetchAndPrintGithubRepo.js` prior to
this commit would not cause `yarn check-pretty` to fail, nor would the
manglings be fixed by `yarn prettify`—but that both of these behaviors
are reversed after this commit.

wchargin-branch: prettier-exclude-root-only
2018-04-26 19:47:58 -07:00
William Chargin 28a118c814
Create an example submodule repository (#153)
Summary:
We want our main repository to include submodules so that we can test
submodule support. Here, we create a repository to be included as a
submodule.

wchargin-branch: example-submodule-repository
2018-04-26 19:43:01 -07:00
William Chargin 75fd068a35
Extract code to create the example repository (#152)
Test Plan:
Note that the snapshot change is simply a move: no SHAs were changed.

wchargin-branch: extract-example-repository-code
2018-04-26 19:38:29 -07:00
William Chargin 6f9941b526
Clean up temporary directories in tests (#151)
Summary:
The `loadRepository` test tries to clean up temporary directories, but
failed to do so because the directories were not empty. The cleanup hook
threw an error, but this error was silenced by Jest due to [a known
bug][1] that was fixed a few days ago. We can fix this by asking `tmp`
to clean up directories even if they are not empty, using the
`unsafeCleanup` option.

[1]: https://github.com/facebook/jest/issues/3266

Test Plan:
While running `watch -n 0.1 'ls /tmp | grep "tmp-.*" | wc -l'`, run
tests. Note that the number increases by five and then drops down again;
before this patch, it would increase by 5 and then stay there.

wchargin-branch: clean-up-tmpdirs
2018-04-26 19:34:23 -07:00
William Chargin 3679529bef
Move `localGit`/`GitDriver` into Git utils (#150)
Summary:
A few reasons for this:
 1. This _is_ a utility, so it makes sense semantically.
 2. This unifies the utilities API; clients like `loadRepository.test`
    don’t have to keep around both a `git` and a `gitUtils`.
 3. Most importantly, further scripts and tests shouldn’t depend on
    `loadRepository` just for `localGit`. Depending on `gitUtils` makes
    much more sense.

(Note that `makeUtils` is no longer dependency-injectable, but that’s
okay; I considered this and favored YAGNI on this one.)

Test Plan:
Existing unit tests pass.

wchargin-branch: move-localgit
2018-04-26 19:31:27 -07:00
William Chargin dad8777e6c
Add utility functions for working with Git repos (#149)
Summary:
Utilities like `deterministicCommit` provide valuable functionality that
we will want to use in other scripts and perhaps other test cases. It
makes sense to factor these out into utility functions.

Test Plan:
Existing tests pass.

wchargin-branch: git-utils
2018-04-26 19:17:16 -07:00
Dandelion Mané 087d8d561e
Enable type filtering on Graph.get{In,Out}Edges (#147)
This commit adds an optional `typeOptions` argument to Graph.getInEdges
and Graph.getOutEdges. The `typeOptions` allow filtering the returned
edges by the type of the edge, and the type of the node that the edge is
connected to. This makes it much easier to use these methods to find
connections that have a certain relationship, e.g. finding the author of
a commit or the comments on an issue.

Test plan:
A new test suite was written that comprehensively tests this behavior,
both for getInEdges and getOutEdges.
2018-04-26 19:09:37 -07:00
Dandelion Mané c635034dab
Add richer types to graphDemoData (#146)
In preparation for using type info in the Graph apis, it is helpful to
have richer type info in the graph demo data.

Test plan: Check that the snapshot changes only consist of type changes,
and that CI passes.
2018-04-26 16:44:34 -07:00
Dandelion Mané 18ce9982d2
Refactor GH parser to expose a functional api (#145)
Our GitHub parser is implemented via a `GithubParser` class which builds
the GitHub graph. This is a convenient implementation, but an awkward
API. This commit refactors the module so that it exposes a clean `parse`
function, which ingests the GitHub JSON data and returns as completed
graph.

Test plan:
The unit tests have been re-written to use the new public API. All the
snapshots are unchanged, and flow passes. Additionally, I ran `yarn
start` and verified that the GithubGraphFetcher for the Artifact plugin
is still working.
2018-04-25 19:39:28 -07:00
William Chargin 5e4b7b1fcc
Extract repository types to a separate module (#141)
Summary:
We should be able to get the types without depending on the function to
load a Git repo from disk, and in particular without depending on
`child_process`.

Test Plan:
Flow and tests are sufficient.

wchargin-branch: extract-repository-types
2018-04-25 14:35:45 -07:00
Dandelion Mané ad56ba087c
Use urls as ids for GitHub nodes (#144)
There's some context at #127, in which I initially proposed this change.

In addition to the long-term benefits described in #127, there is a
short-term benefit which is that it makes snapshot tests easier to read,
because the GitHub ids are opaque and unreadable, while the GitHub urls
are relatively easy to parse.

This change results in significant snapshot churn.
2018-04-25 14:28:06 -07:00
Dandelion Mané 4f857a8bb1
Add return type info for fetchGithubRepo (#143)
Once the type was added, flow correctly discovered a bug in
GithubGraphFetcher.js, which resulted in broken graph fetching in the
ArtifactEditor. Oops! / Good work Flow!

I made `ensureNoMorePages` expect the result it is testing to be an
`any`, which is appropriate for how the function is written (i.e. it is
written in a way that is agnostic to the actual result).
2018-04-24 16:00:50 -07:00
Dandelion Mané 8fdfacd097
fetchGithubRepo: remove vestigial data field (#142)
The example-repo.json file is regenerated with large diffs due to the
change in indentation level throughout the file.

Test plan:
Sanity check the snapshot (close inspection is unnecessary due to the
simplicity of the code change). Check that CI pases.
2018-04-24 15:46:16 -07:00
Dandelion Mané 6a3e4d754c
Add flow types for GitHub graphql query response (#140)
This commit adds flow typing for the JSON result from hitting the GitHub
graphql api. We can't prove that the flow typing is correct, but since
the type definition is colocated with the corresponding fragment
definitions, we can hope that maintainers will maintain both together.

We update the parser to consume the new flow types. There are no flow
errors.

Test plan:
Inspect the flowtypes, verify that they correspond to the data in
example-repo.json, and that there are no flow errors.
2018-04-24 14:05:51 -07:00
William Chargin 418b745d7c
Load Git repositories into memory (#139)
Summary:
In this newly added module, we load the structural state of a git
repository into memory. We do not load into memory the contents of any
blobs, so this is not enough information to perform any analysis
requiring file diffing. However, it is sufficient to develop a notion of
“this file was changed in this commit”, by simply diffing the trees.

Test Plan:
Unit tests added; `yarn test` suffices. Reading these snapshots is
pretty easy, even though they’re filled with hashes:
  - First, read over the commit specifications on lines 69–83 of
    `loadRepository.test.js`, so you know what to expect.
  - In the snapshot file, keep handy the time-ordered list of commit
    SHAs at the bottom of the file, so that you know which commit SHA is
    which.
  - To verify that the large snapshot is correct: for each commit, read
    the corresponding tree object and make sure that the structure is
    correct.
  - To verify the small snapshot, just check that it’s the correct
    subset of the large snapshot.
  - If you want to verify that the SHA for a blob is correct, open a
    terminal and run `git hash-object -t blob --stdin`; then, enter the
    content of the blob and press `<C-d>`. The result is the blob SHA.

To run a sanity-check on a large repository: apply the following patch:

<details>
<summary>Patch to print out statistics about loaded repository</summary>

```diff
diff --git a/config/paths.js b/config/paths.js
index d2f25fb..8fa2023 100644
--- a/config/paths.js
+++ b/config/paths.js
@@ -62,5 +62,6 @@ module.exports = {
     fetchAndPrintGithubRepo: resolveApp(
       "src/plugins/github/bin/fetchAndPrintGithubRepo.js"
     ),
+    loadRepository: resolveApp("src/plugins/git/loadRepository.js"),
   },
 };
diff --git a/src/plugins/git/loadRepository.js b/src/plugins/git/loadRepository.js
index a76b66c..9380941 100644
--- a/src/plugins/git/loadRepository.js
+++ b/src/plugins/git/loadRepository.js
@@ -106,3 +106,7 @@ function findTrees(git: GitDriver, rootTrees: Set<Hash>): Tree[] {
   }
   return result;
 }
+
+const result = loadRepository(...process.argv.slice(2));
+console.log("commits", result.commits.size);
+console.log("trees", result.trees.size);
```
</details>

Then, run `yarn backend` and put the following script in `test.sh`:

<details>
<summary>Contents for `test.sh`</summary>

```shell
#!/bin/bash
set -eu

repo="$1"
ref="$2"

via_node() {
    node bin/loadRepository.js "${repo}" "${ref}"
}

via_git() (
    cd "${repo}"
    printf 'commits '
    git rev-list "${ref}" | wc -l
    printf 'trees '
    git rev-list "${ref}" |
        while read -r commit; do
            git rev-parse "${commit}^{tree}"
            git ls-tree -rt "${commit}" \
                | grep ' tree ' \
                | cut -f 1 | cut -d ' ' -f 3
        done | sort | uniq | wc -l
)

echo
printf 'Running directly via git...\n'
time a="$(via_git)"

echo
printf 'Running Node script...\n'
time b="$(via_node)"

diff -u <(cat <<<"${a}") <(cat <<<"${b}")
```
</details>

Finally, run `./test.sh /path/to/some/repo origin/master`, and verify
that it exits successfully (zero diff). Here are some timing results on
SourceCred and TensorBoard:

  - SourceCred: 0.973s via Node, 0.327s via git.
  - TensorBoard: 30.836s via Node, 6.895s via git.

For TensorFlow, running via git takes 7m33.995s. Running via Node fails
with an out-of-memory error after 39 minutes, with 10GB RAM and 4GB
swap. See details below.

<details>
<summary>
Full timing details, commit SHAs, and OOM error message
</summary>

```
+ ./test.sh /home/wchargin/git/sourcecred 01634aabcc

Running directly via git...

real	0m0.327s
user	0m0.016s
sys	0m0.052s

Running Node script...

real	0m0.973s
user	0m0.268s
sys	0m0.176s
+ ./test.sh /home/wchargin/git/tensorboard 7aa1ab9d60671056b8811b7099eec08650f2e4fd

Running directly via git...

real	0m6.895s
user	0m0.600s
sys	0m0.832s

Running Node script...

real	0m30.836s
user	0m3.216s
sys	0m10.588s
+ ./test.sh /home/wchargin/git/tensorflow 968addadfd4e4f5688eedc31f92a9066329ff6a7

Running directly via git...

real	7m33.995s
user	5m21.124s
sys	1m5.476s

Running Node script...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x121a2cc [node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewFixedArray(int, v8::internal::PretenureFlag) [node]
 6: v8::internal::DeoptimizationInputData::New(v8::internal::Isolate*, int, v8::internal::PretenureFlag) [node]
 7: v8::internal::compiler::CodeGenerator::PopulateDeoptimizationData(v8::internal::Handle<v8::internal::Code>) [node]
 8: v8::internal::compiler::CodeGenerator::FinalizeCode() [node]
 9: v8::internal::compiler::PipelineImpl::FinalizeCode() [node]
10: v8::internal::compiler::PipelineCompilationJob::FinalizeJobImpl() [node]
11: v8::internal::Compiler::FinalizeCompilationJob(v8::internal::CompilationJob*) [node]
12: v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() [node]
13: v8::internal::Runtime_TryInstallOptimizedCode(int, v8::internal::Object**, v8::internal::Isolate*) [node]
14: 0x12dc8b08463d
```
</details>

wchargin-branch: load-git-repositories

# Please enter the commit message for your changes. Lines starting
# with '#' will be kept; you may remove them yourself if you want to.
# An empty message aborts the commit.
#
# Date:      Mon Apr 23 23:02:14 2018 -0700
#
# HEAD detached at origin/wchargin-load-git-repositories
# Changes to be committed:
#	modified:   package.json
#	new file:   src/plugins/git/__snapshots__/loadRepository.test.js.snap
#	new file:   src/plugins/git/loadRepository.js
#	new file:   src/plugins/git/loadRepository.test.js
#
# Untracked files:
#	out
#	runtests.sh
#	src/plugins/artifact/editor/ArtifactSetInput.js
#	src/plugins/git/repository.js
#	test.sh
#	todo
#
2018-04-24 13:57:10 -07:00
Dandelion Mané 515c577825
Split github/githubPlugin.js into three files (#138)
github/githubPlugin.js was growing ungainly - it contained two major
pieces: all of the node and edge types, and the GitHub parser. As I
contemplated adding a third major new section of logic (an easy-to-use
api for traversing the GitHub graph, with first class support for
comments, authorship, etc), I found the prospect of adding even more
into that file quite unappealing. So, I have instead split it into three
files:

* github/pluginName.js: Exports the plugin name.
* github/types.js: Exports types of nodes, edges, and their payloads
* github/parser.js: Exports `GithubParser`

No logic has been changed whatsoever - this is purely a rename-refactor.

Test plan:
CI still passes. I manually verified that the Artifact Editor can still
load and display GitHub data.
2018-04-24 09:30:22 -07:00
Dandelion Mané 01634aabcc
Add url to author payloads (#137)
We have urls for all the author types, so for consistency across GitHub
payloads, I am adding urls to the author paylods.

Test plan:
Check that snapshot changes consist entirely of adding urls, and that
the urls are appropraite.
2018-04-23 18:30:19 -04:00
Dandelion Mané e191614dd4
Find GitHub username references (#136)
Add logic to findReferences for finding GitHub username references,
e.g. "Hello, @wchargin!". The API is unchanged.

Test plan:
There are new unit tests that verify this behavior works as expected.
2018-04-23 18:13:21 -04:00
Dandelion Mané d9d83c1157
Rename parseReferences.js to findReferences.js (#135)
This is consistent with the single export (findReferences.js now exports
`findReferences`)
2018-04-23 17:59:00 -04:00
Dandelion Mané 240f05899c
Simplify GH reference handling (#130)
Currently, every type of reference has its own type signature: numeric
references are returned as numbers, url references are a complicated
object containing url parts, and so forth.

Since ultimately the references are just strings, it makes more sense to
treat references as plain strings. This allows a much simpler
implementation of reference edge creation in the GitHub plugin. It also
results in a simpler API for the parseReferences file (it only exports a
single findReferences function).

Test plan:
Verify that the updated tests encode appropriate behavior.
2018-04-23 17:49:09 -04:00
William Chargin e0a5118f8d
Switch typed dispatch table to `empty` assertion (#133)
Summary:
This replaces the implementation of a static check from a somewhat
complicated use of higher-order types to a more simple empty-union
assertion, as suggested by jez in “Case Exhaustiveness in Flow”:
https://blog.jez.io/flow-exhaustiveness/

(I know; we’re not using Reason. One step at a time. :-) )

I adapted the implementation a bit because I prefer explicitly disabling
an ESLint warning over a no-op function call; it is not clear from the
latter that the purpose is to suppress a lint warning.

Test Plan:
In `githubPlugin.js`, add `| "ANOTHER"` to the `NodeType` type, and note
a compile-time Flow error on the appropriate line, with a very readable
error message. Note that all unit tests pass, and running the UI on
`sourcecred/sourcecred` yields correct titles for each node type present
(namely, all node types except for `ORGANIZATION` and `BOT`).

wchargin-branch: empty-union-assertion
2018-04-23 13:12:44 -07:00
Dandelion Mané 1e311c59f4
Add README and update flow for GitHub demo data (#129)
Keeping the GitHub demo data up-to-date is important, and there isn't
good documentation for how to do that.

This commit adds a short README.md for the demo data, and adds an update
flag to fetchGithubRepoTest.sh that can be used to easily update it.

Test plan:
Modify example-repo.json (e.g. by deleting it entirely). Run
fetchGithubRepoTest.sh -u and confirm that the data was regenerated
without change. Run fetchGithubRepoTest.sh and confirm the test passes.

Note: The end cursor is sensitive to the timezone, which seems to be
cached with the GitHub token. An erroneous switch to Israel timezone
made it into master; this commit reverts back to US/Pacific.
2018-04-23 14:21:01 -04:00
Dandelion Mané d025e78b1d
Get urls for all GitHub objects (#128)
This commit modifies our GitHub graphql query so that we request urls
for all objects (e.g. users, pull requests, pull request review
comments). Some change along these lines is necessary so that we can
correctly represent URL reference edges to e.g. issue comments. (It
might be possible to do without by reverse-enginering from the ids, but
we are resolved to treat ids as opaque).

Strictly speaking, we don't need to collect urls for users, issues, and
pull requests - they are generated via simple schema. However, for
consistency, I think it's better to just take URLs on everything.

Test plan: The example-repo.json has been regenerated. The diffs are as
expected.
2018-04-20 10:50:03 -04:00
Dandelion Mané 8992f2a9f5
Update the README (#124)
- Give a more detailed and up-to-date vision for SourceCred
- Update the discuss link to point to Spectrum rather than Slack
- Wrap text
2018-04-09 08:49:54 +03:00
William Chargin 9d1200275e
Implement exhaustive fetching for our GitHub query (#123)
Summary:
This commit completes the ad hoc pagination solution described in #117:
we implement pagination specifically for our current query against the
GitHub API. This is done in such a way that reasonable additions to the
query will not be hard to implement—for instance, if we want to fetch
a new kind of field, the marginal cost is at most a bit of extra
copy-and-paste and some modifications to tests. However, we should
certainly still plan to implement the fully automatic pagination system
described in #117.

Running on TensorBoard with the default page limits takes 30–33 seconds
over 7 queries, uses 103 GitHub API points (out of 5000 for any given
hour), and produces a JSON file that is 8.7MB (when pretty-printed).
This all seems quite reasonable to me.

Test Plan:
Extensive unit tests added. The snapshots are quite readable, by design.

For a real-world test, you can `yarn start` the artifact viewer and use
the GUI to fetch the data for tensorflow/tensorboard.

To demonstrate that the fetching process gives the same results
regardless of the page size, follow these steps:

 1. In `fetchGithubRepo.js`’s `postQuery` function, insert a new
    statement `console.error("Posting query...")` at the beginning. (It
    is important to print to stderr instead of stdout.)
 2. Run `yarn backend` and then invoke `fetchGithubRepo.js` on a repo
    large enough to require pagination, like SourceCred or TensorBoard.
    Pipe the result to `shasum -a 256` and note the SHA.
 3. In `github/graphql.js`, change the page size constants near the top
    of the file. Re-run step 2. The number of queries that are posted
    will vary significantly as the page size constants vary, but the
    checksum should remain the same.
 4. Repeat until satisfied. (I tried three sets of values: the standard
    values, the standard values but all divided by 10, and all 5s.)

wchargin-branch: ad-hoc-pagination
2018-04-06 07:29:55 -07:00
William Chargin e82b56e52c
Implement a function to merge query results (#122)
Summary:
Once we execute the root query, find continuations, embed the
continuations into queries, and execute the continuation query, we will
need to merge the continuations’ results back into the root results.
This commit adds a function `merge` that will be suitable for doing just
that.

Test Plan:
New unit tests added, with 100% coverage. Run `yarn test`.

wchargin-branch: merge-query-results
2018-04-05 02:27:03 -07:00
William Chargin 751172ea77
Create pagination continuations for GitHub query (#121)
Summary:
Per #117, this is a first step toward at writing a pagination API that
specifically targets our current GitHub query. For design details, see
new module docs on `src/plugins/github/graphql.js`.

This commit modifies the core GitHub query and thus the
`example-repo.json` snapshot: we now request `endCursor` fields for all
pagination info, and we request the `id` of the root `repository` field.
The former is obviously necessary. The latter is necessary for the
repository to be consistent with other nodes that offer connections as
fields: we require an ID on the node containing the connection so that
we can have random access to it in a continuation selector.

Test Plan:
Unit tests added. You can also try out the generated continuation
queries for yourself: apply the patch below, run `yarn backend`, and
then run the `fetchGithubRepo.js` script on `sourcecred/sourcecred`.
This will output a nicely formatted query that you can paste directly
into GitHub’s API explorer and execute. (Note that, because this patch
is not fully polished, the query must be run against a repository that
has a continuation for every node type: more pages of issues, PRs,
comments, reviews, and review comments. This is due to an
easy-but-annoying-to-fix bug in the patch, not in the code included in
this commit.)

<details>
<summary>Patch for generating a continuations query</summary>

```diff
diff --git a/src/plugins/github/fetchGithubRepo.js b/src/plugins/github/fetchGithubRepo.js
index 789a20e..418c736 100644
--- a/src/plugins/github/fetchGithubRepo.js
+++ b/src/plugins/github/fetchGithubRepo.js
@@ -6,8 +6,13 @@

 import fetch from "isomorphic-fetch";

-import {stringify, inlineLayout} from "../../graphql/queries";
-import {createQuery, createVariables} from "./graphql";
+import {stringify, inlineLayout, multilineLayout} from "../../graphql/queries";
+import {
+  continuationsFromQuery,
+  continuationQuery,
+  createQuery,
+  createVariables,
+} from "./graphql";

 /**
  * Scrape data from a GitHub repo using the GitHub API.
@@ -66,8 +71,13 @@ function postQuery(payload, token) {
       if (x.errors) {
         return Promise.reject(x);
       }
-      ensureNoMorePages(x);
-      return Promise.resolve(x);
+      console.log(
+        stringify.body(
+          continuationQuery(Array.from(continuationsFromQuery(x.data))),
+          multilineLayout("  ")
+        )
+      );
+      throw new Error("STOPSHIP");
     });
 }

diff --git a/src/plugins/github/graphql.js b/src/plugins/github/graphql.js
index 9ea2592..9ead42b 100644
--- a/src/plugins/github/graphql.js
+++ b/src/plugins/github/graphql.js
@@ -39,11 +39,11 @@ import {build} from "../../graphql/queries";
  *
  * [1]: https://developer.github.com/v4/guides/resource-limitations/#node-limit
  */
-export const PAGE_LIMIT = 100;
-const PAGE_SIZE_ISSUES = 100;
-const PAGE_SIZE_PRS = 100;
-const PAGE_SIZE_COMMENTS = 20;
-const PAGE_SIZE_REVIEWS = 10;
+export const PAGE_LIMIT = 10;
+const PAGE_SIZE_ISSUES = 10;
+const PAGE_SIZE_PRS = 10;
+const PAGE_SIZE_COMMENTS = 3;
+const PAGE_SIZE_REVIEWS = 1;
 const PAGE_SIZE_REVIEW_COMMENTS = 10;

 /**
@@ -340,6 +340,36 @@ function* continuationsFromReview(
   }
 }

+/**
+ * Combine continuations into a query.
+ */
+export function continuationQuery(
+  continuations: $ReadOnlyArray<Continuation>
+): Body {
+  const nonces: string[] = continuations.map((_, i) => `_n${String(i)}`);
+  const nonceToIndex = {};
+  nonces.forEach((n, i) => {
+    nonceToIndex[n] = i;
+  });
+  const b = build;
+  const query = b.query(
+    "Continuations",
+    [],
+    continuations.map((continuation, i) =>
+      b.alias(
+        nonces[i],
+        b.field(
+          "node",
+          {id: b.literal(continuation.enclosingNodeId)},
+          continuation.selections.slice()
+        )
+      )
+    )
+  );
+  const body = [query, ...createFragments()];
+  return body;
+}
+
 /**
  * These fragments are used to construct the root query, and also to
  * fetch more pages of specific entity types.
```
</details>

wchargin-branch: ad-hoc-pagination-continuations
2018-04-05 02:23:17 -07:00
William Chargin 7711f01b84
Extract paginatable fragments of GitHub query (#120)
Summary:
Any time that we pull fields off a connection object, we may need to
repeat the query for subsequent pages. Therefore, such fragments will be
shared across multiple queries, and also shared within a query if we
need to fetch—say—more issue comments on two or more distinct issues.
This is a perfect use case for fragments.

This commit refactors the GitHub query to be organized in terms of
fragments, without changing the format of the results.

(We also take this opportunity to factor the page limits into
constants.)

Test Plan:
After running `yarn backend`, the `fetchGithubRepoTest.sh` test passes.

wchargin-branch: extract-github-query-fragments
2018-04-05 02:19:29 -07:00
William Chargin fbb6ec28db
Extract GitHub GraphQL code to a module (#119)
Summary:
Per #117, we want to develop an ad hoc pagination API written
specifically against the query that we use to interact with GitHub.
The pagination logic should be separate from the logic to actually fetch
the repo, but should be colocated with the query itself, so this commit
extricates the query from `fetchGithubRepo.js` into a new module.

Test Plan:
Existing tests pass, including `fetchGithubRepoTest.sh`.

wchargin-branch: extract-github-graphql
2018-04-05 00:08:52 -07:00