a reputation protocol for open collaboration
Go to file
William Chargin 418b745d7c
Load Git repositories into memory (#139)
Summary:
In this newly added module, we load the structural state of a git
repository into memory. We do not load into memory the contents of any
blobs, so this is not enough information to perform any analysis
requiring file diffing. However, it is sufficient to develop a notion of
“this file was changed in this commit”, by simply diffing the trees.

Test Plan:
Unit tests added; `yarn test` suffices. Reading these snapshots is
pretty easy, even though they’re filled with hashes:
  - First, read over the commit specifications on lines 69–83 of
    `loadRepository.test.js`, so you know what to expect.
  - In the snapshot file, keep handy the time-ordered list of commit
    SHAs at the bottom of the file, so that you know which commit SHA is
    which.
  - To verify that the large snapshot is correct: for each commit, read
    the corresponding tree object and make sure that the structure is
    correct.
  - To verify the small snapshot, just check that it’s the correct
    subset of the large snapshot.
  - If you want to verify that the SHA for a blob is correct, open a
    terminal and run `git hash-object -t blob --stdin`; then, enter the
    content of the blob and press `<C-d>`. The result is the blob SHA.

To run a sanity-check on a large repository: apply the following patch:

<details>
<summary>Patch to print out statistics about loaded repository</summary>

```diff
diff --git a/config/paths.js b/config/paths.js
index d2f25fb..8fa2023 100644
--- a/config/paths.js
+++ b/config/paths.js
@@ -62,5 +62,6 @@ module.exports = {
     fetchAndPrintGithubRepo: resolveApp(
       "src/plugins/github/bin/fetchAndPrintGithubRepo.js"
     ),
+    loadRepository: resolveApp("src/plugins/git/loadRepository.js"),
   },
 };
diff --git a/src/plugins/git/loadRepository.js b/src/plugins/git/loadRepository.js
index a76b66c..9380941 100644
--- a/src/plugins/git/loadRepository.js
+++ b/src/plugins/git/loadRepository.js
@@ -106,3 +106,7 @@ function findTrees(git: GitDriver, rootTrees: Set<Hash>): Tree[] {
   }
   return result;
 }
+
+const result = loadRepository(...process.argv.slice(2));
+console.log("commits", result.commits.size);
+console.log("trees", result.trees.size);
```
</details>

Then, run `yarn backend` and put the following script in `test.sh`:

<details>
<summary>Contents for `test.sh`</summary>

```shell
#!/bin/bash
set -eu

repo="$1"
ref="$2"

via_node() {
    node bin/loadRepository.js "${repo}" "${ref}"
}

via_git() (
    cd "${repo}"
    printf 'commits '
    git rev-list "${ref}" | wc -l
    printf 'trees '
    git rev-list "${ref}" |
        while read -r commit; do
            git rev-parse "${commit}^{tree}"
            git ls-tree -rt "${commit}" \
                | grep ' tree ' \
                | cut -f 1 | cut -d ' ' -f 3
        done | sort | uniq | wc -l
)

echo
printf 'Running directly via git...\n'
time a="$(via_git)"

echo
printf 'Running Node script...\n'
time b="$(via_node)"

diff -u <(cat <<<"${a}") <(cat <<<"${b}")
```
</details>

Finally, run `./test.sh /path/to/some/repo origin/master`, and verify
that it exits successfully (zero diff). Here are some timing results on
SourceCred and TensorBoard:

  - SourceCred: 0.973s via Node, 0.327s via git.
  - TensorBoard: 30.836s via Node, 6.895s via git.

For TensorFlow, running via git takes 7m33.995s. Running via Node fails
with an out-of-memory error after 39 minutes, with 10GB RAM and 4GB
swap. See details below.

<details>
<summary>
Full timing details, commit SHAs, and OOM error message
</summary>

```
+ ./test.sh /home/wchargin/git/sourcecred 01634aabcc

Running directly via git...

real	0m0.327s
user	0m0.016s
sys	0m0.052s

Running Node script...

real	0m0.973s
user	0m0.268s
sys	0m0.176s
+ ./test.sh /home/wchargin/git/tensorboard 7aa1ab9d60671056b8811b7099eec08650f2e4fd

Running directly via git...

real	0m6.895s
user	0m0.600s
sys	0m0.832s

Running Node script...

real	0m30.836s
user	0m3.216s
sys	0m10.588s
+ ./test.sh /home/wchargin/git/tensorflow 968addadfd4e4f5688eedc31f92a9066329ff6a7

Running directly via git...

real	7m33.995s
user	5m21.124s
sys	1m5.476s

Running Node script...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x121a2cc [node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewFixedArray(int, v8::internal::PretenureFlag) [node]
 6: v8::internal::DeoptimizationInputData::New(v8::internal::Isolate*, int, v8::internal::PretenureFlag) [node]
 7: v8::internal::compiler::CodeGenerator::PopulateDeoptimizationData(v8::internal::Handle<v8::internal::Code>) [node]
 8: v8::internal::compiler::CodeGenerator::FinalizeCode() [node]
 9: v8::internal::compiler::PipelineImpl::FinalizeCode() [node]
10: v8::internal::compiler::PipelineCompilationJob::FinalizeJobImpl() [node]
11: v8::internal::Compiler::FinalizeCompilationJob(v8::internal::CompilationJob*) [node]
12: v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() [node]
13: v8::internal::Runtime_TryInstallOptimizedCode(int, v8::internal::Object**, v8::internal::Isolate*) [node]
14: 0x12dc8b08463d
```
</details>

wchargin-branch: load-git-repositories

# Please enter the commit message for your changes. Lines starting
# with '#' will be kept; you may remove them yourself if you want to.
# An empty message aborts the commit.
#
# Date:      Mon Apr 23 23:02:14 2018 -0700
#
# HEAD detached at origin/wchargin-load-git-repositories
# Changes to be committed:
#	modified:   package.json
#	new file:   src/plugins/git/__snapshots__/loadRepository.test.js.snap
#	new file:   src/plugins/git/loadRepository.js
#	new file:   src/plugins/git/loadRepository.test.js
#
# Untracked files:
#	out
#	runtests.sh
#	src/plugins/artifact/editor/ArtifactSetInput.js
#	src/plugins/git/repository.js
#	test.sh
#	todo
#
2018-04-24 13:57:10 -07:00
config Make GitHub capitalization consistent within code (#100) 2018-03-20 18:32:05 -07:00
experiments Tweak commit_graph_dump & store sample data. (#12) 2018-02-15 22:16:37 -08:00
flow-typed/npm Move package json to root (#37) 2018-02-26 22:32:23 -08:00
scripts Configure Webpack for backend applications (#84) 2018-03-18 22:43:23 -07:00
src Load Git repositories into memory (#139) 2018-04-24 13:57:10 -07:00
.flowconfig Move package json to root (#37) 2018-02-26 22:32:23 -08:00
.gitignore Configure Webpack for backend applications (#84) 2018-03-18 22:43:23 -07:00
.prettierignore Don't run prettify on build (#92) 2018-03-20 11:46:30 -07:00
.prettierrc.json Move package json to root (#37) 2018-02-26 22:32:23 -08:00
.travis.yml Setup travis CI testing (#58) 2018-03-02 14:39:54 -08:00
LICENSE Add LICENSE 2018-02-03 17:58:49 -08:00
README.md Update the README (#124) 2018-04-09 08:49:54 +03:00
package.json Load Git repositories into memory (#139) 2018-04-24 13:57:10 -07:00
yarn.lock Standardize on Enzyme shallow rendering (#104) 2018-03-21 18:28:06 -07:00

README.md

SourceCred

Build Status

Vision

Open source software is amazing, and so are the creators and contributors who share it. How amazing? It's difficult to tell, since we don't have good tools for recognizing those people. Many amazing open-source contributors labor in the shadows, going unappreciated for the work they do.

As the open economy develops, we need to go beyond commit streaks and follower counts. We need transparent, accurate, and fair tools for recognizing and rewarding open collaboration. SourceCred aims to do that.

SourceCred will enable projects to create and track "cred", which is a quantitative measure of how much value different contributors added to a project. We'll do this by providing a basic data structure—a cred graph—into which projects can add all kinds of information about the contributions that compose it. For example, a software project might include information about GitHub pull requests, function declarations and implementations, design documents, community support, documentation, and so forth. We'll also provide an algorithm (PageRank) which will ingest all of this information and produce a "cred attribution", which assigns a cred value to each contribution, and thus to the people who authored the contributions.

Principles

SourceCred aims to be:

  1. Transparent

    If it's to be a legitimate and accepted way of tracking credit in projects, cred attribution can't be a black-box. SourceCred will provide tools that make it easy to dive into the cred attribution, and see exactly why contributions were valued the way they were.

  2. Community-controlled

    At the end of the day, the community of collaborators in a project will know best which contributions were important and deserve the most cred. No algorithm will do that perfectly on its own. To that end, we'll empower the community to modify the cred attribution, by adding human knowledge into the cred graph.

  3. Forkable

    Disputes about cred attribution are inevitable. Maybe a project you care about has a selfish maintainer who wants all the cred for themself :(. Not to worry—all of the cred data will be stored with the project, so you are empowered to solve cred disputes by forking the project.

Roadmap

SourceCred is currently in a very early stage. We are working full-time to develop a MVP, which will have the following basic features:

  • Create: The GitHub Plugin populates a project's GitHub data into a Contribution Graph. SourceCred uses this seed data to produce an initial, approximate cred attribution.

  • Read: The SourceCred Explorer enables users to examine the cred attribution, and all of the contributions in the graph. This reveals why the algorithm behaved the way that it did.

  • Update: The Artifact Plugin allows users to put their own knowledge into the system by adding new "Artifact Nodes" to the graph. An artifact node allows users to draw attention to contributions (or groups of contributions) that are particularly valuable. They can then merge this new information into the project repository, making it canonical.

Community

Please consider joining our community.