This commit modifies `discourse/createGraph` so that it finds all of the
same-server Discourse references in Discourse posts, and creates
appropriately typed references edges in response.
The unit tests have been updated with cases for both references that
should exist, and references that shouldn't (e.g. post index out of
bounds, or a reference to the wrong server).
Test plan: `yarn test --full` along with snapshot update.
This is progress towards [Discourse reference and mention detection][1].
[1]: https://discourse.sourcecred.io/t/discourse-reference-mention-detection/270
The `discourse/references` module now has a `linksToReferences` method
which extracts the parsed Discourse references from an array of
hyperlinks. The method is tested.
Test plan: Unit tests added; `yarn test` passes.
This is progress towards [Discourse reference and mention detection][1].
[1]: https://discourse.sourcecred.io/t/discourse-reference-mention-detection/270
Summary:
The notes used to focus on the legacy implementation with a minor note
about the EAV implementation; this change flips that relationship.
Test Plan:
None.
wchargin-branch: mirror-eav-impl-notes
Summary:
This flips the switch for all production `Mirror` reads to use the
single `primitives` EAV table as their source of truth, rather than the
legacy type-specific primitives tables. For context and design
discussion, see issue #1313 and commits adjacent to this one.
Test Plan:
All relevant code paths are already tested (see test plans of commits
adjacent to this one). Running `yarn test --full` passes.
wchargin-branch: mirror-eav-flip
Summary:
This completes the end-to-end EAV mode pipeline, but does not yet set it
as default or use it in production.
A note about indentation: we take care to avoid reindenting the entire
block of `extract` test cases, which is over 900 lines long. As to the
implementation code, reindenting the legacy type-specific primitives
branch is not easily avoidable, but when we remove that branch we won’t
have to reindent the EAV mode branch: we can replace its `if` block with
two scope blocks (which is the right thing to do, anyway).
Test Plan:
We reuse existing tests, which suffice for full coverage in both
implementation branches. Note that these tests cover the case of object
types with no primitive fields (the `Feline` and `Socket` types), which
are more likely to fail in a broken EAV implementation than in a broken
type-specific primitives implementation due to deletion anomalies.
To check that all relevant calls to `mirror.extract(…)` have been
properly replaced with `extract(mirror, …)`, run
yarn coverage -f graphql/mirror -t 'EAV primitives'
and note that the “else” path of the `if (fullOptions.useEavPrimitives)`
branch is not taken; then, run
yarn coverage -f graphql/mirror -t 'legacy type-specific primitives'
and note that the “if” path of the same branch is not taken.
To check that the table hiding logic is working, invert the branch that
checks `if (fullOptions.useEavPrimitives)`, and note that every test
case using the table hiding logic fails (except for some of the error
handling test cases, which do not actually need to read primitive data).
Finally, `yarn test --full` passes after flipping the `useEavPrimitives`
default to `true`.
wchargin-branch: mirror-eav-extract
This is a minor refactor to re-organize the createGraph function in the
Discourse plugin to use a class under the hood. Using a hidden class
makes sense because there is a fair bit of shared state that's needed
while creating the graph.
The proximate cause for this refactor is tha adding reference edges will
bloat the `addPost` section of the function, which was already a little
too complex. Simply shoving in more complexity would make it unweidy. So
I opted for this minor refactor. It's internal-only (no public APIs are
changed).
Test plan: `yarn test` passes. As noted, refactor is internal-only.
This is progress towards [Discourse reference and mention detection][1].
[1]: https://discourse.sourcecred.io/t/discourse-reference-mention-detection/270
This commit adds a `parseLinks` method to a new module,
`plugins/discourse/references`. `parseLinks` allows us to extract the
hyperlinks from `<a>` tags in "cooked" html.
I added `htmlparser2` as a dependency to parse the html. There were a
lot of options to choose from; I chose htmlparser2 because it has a lot
of usage, reasonable performance, and suits our needs. We use this
dependency in a lightweight and local way, so we can always change it
later if needed.
One thing which was a bit odd: I wasn't able to import it using
`import`, and needed a `require` statement instead.
Test plan: Unit tests added; `yarn test` passes.
This is progress towards [Discourse reference and mention detection][1].
[1]: https://discourse.sourcecred.io/t/discourse-reference-mention-detection/270
This modifies the Discourse fetcher and mirror so that we now keep post
contents around, thus enabling future reference detection (and other
things). The post contents are stored and provided as retrieved from the
API, which is in "cooked" HTML form.
Test plan: Unit tests and snapshots updated. Observe that the snapshots
now include Discourse post contents.
This is progress towards [Discourse reference and mention detection][1].
[1]: https://discourse.sourcecred.io/t/discourse-reference-mention-detection/270
We need one tiny change in test code, where Flow (correctly) detects an
error. I've added an error suppression comment because it is truly a
Flow error, but is appropriate as we are testing an error condition.
Test plan: `yarn test`
This commit modifies the TimelineExplorer so that the user can both see
the chosen alpha value, and change it. Alpha has a pretty profound
impact on the final scores, and I want to tweak it for CredSperiment
week two, so this is an important addition.
Test plan: Modify the alpha, re-run cred calculation, and observe that
the scores change. `yarn test` passes.
This commit integrates the identity plugin, which was created in #1384.
It does this by adding explicit identity fields to the project
configuration, which are then applied when loading the graph in
`api/load.js`.
The actual integration is quite straightforward.
Test plan: The underlying logic is thoroughly tested; I added one new
test case to verify that it is integrated properly. Since the project
compat has changed, I've updated all the snapshots. Prior to merging
this PR, I will produce one "integration test", using this code to do
identity resolution for a real project (i.e. on the SourceCred instance
itself).
This commit adds the new SourceCred identity plugin. As described in the
README.md file:
This folder contains the Identity plugin. Unlike most other plugins, the
Identity plugin does not add any new contributions to the graph. Instead, it
allows collapsing different user accounts together into a shared 'identity'
node.
To see why this is valuable, imagine that a contributor has an account on both
GitHub and Discourse (potentially with a different username on each service).
We would like to combine these two identities together, so that we can
represent that user's combined cred properly. The Identity plugin enables this.
Specifically, the instance maintainer can provide a (locally unique) username
for the user, along with a list of aliases the user is known by, e.g.
`github/username` and `discourse/other_username`. The aliases are simple string
representations, that are intended to be easy to maintain by hand in a
configuration file. Then, the identity plugin will provide a list of
`NodeContraction`s that can be used by `Graph.contractNodes` to combine the
user identities as described.
The plugin is broken up into a few submoudles:
- `declaration.js` provides the PluginDeclaration. It has a single node
type (the identity node).
- `identity.js` declares the `Identity` type (a username and list of
aliases), allows constructing identity nodes, and does some validation
on the identity username.
- `alias.js` implements the logic for parsing aliases like
"github/decentralion" or "discourse/s_ben" into a node address.
- `nodeContractions.js` provides logic for turning a list of Identities
into a list of NodeContractions, suitable for use in
`Graph.contractNodes`.
The plugin is not yet integrated; that will come in a followon commit.
Test plan: Unit tests added; `yarn test` passes.
Currently attempting to load the SourceCred discourse instance fails
with foreign key constraint errors.
Basically, we have a few weird situations:
- A post (which corresponds to the 'psuedo-topic' generated by creating
a new category) is picked up, but its topic is not detected, because
Discourse does not list these 'psuedo-topics' in the latest topic
endpoint. Attempting to add the post breaks the foreign key constraint.
- We have several likes which correspond to posts that don't exist.
Possibly they were deleted? I'm not sure.
Right now, the load process fails entirely when it hits these
exceptions, which is bad. It should print a warning instead, and
continue without the offending interactions. This commit effects that
change in behavior.
Test plan:
Before this commit, loading the SourceCred discourse with a clean cache
fails. After building with this commit, loading the SourceCred discourse
with a clean cache workes and prints the following warnings:
```
$ node bin/sourcecred.js discourse https://discourse.sourcecred.io credbot
GO load-discourse.sourcecred.io
GO discourse
GO discourse/topics
DONE discourse/topics: 3m 53s
GO discourse/posts
Warning: Encountered error 'FOREIGN KEY constraint failed' while adding
post https://discourse.so urcecred.io/t/214/1.
DONE discourse/posts: 2m 38s
GO discourse/likes
DONE discourse/likes: 50s
DONE discourse: 7m 21s
GO compute-cred
DONE compute-cred: 547ms
DONE load-discourse.sourcecred.io: 7m 22s
```
Also, unit tests have been added that verify the specific behavior
changes.
Fixes#1353
Tested manually by creating a docker image including the changes.
Running the dev-preview @passbolt command until completion.
(once hitting the github rate limit, once till #1354 happens)
No more problematic interactions show up during load.
This fixes a bug introduced in #1371, where selecting a type other than
"All users" and then trying to reselect "All users" would break the UI.
Test plan: Manual inspection; load an instance, try selecting a
different type, and then go back to "All users". It now works as
expected.
This adds a new command, `discourse`, which makes it convenient to load
Discourse servers as standalone SourceCred projects.
For example, you could load the official SourceCred discourse via the
following:
```sh
export SOURCECRED_DISCOURSE_KEY=....
yarn backend
node bin/sourcecred.js discourse https://discourse.sourcecred.io credbot
yarn start
```
I've updated the README with instructions for using the plugin.
Test plan: No automated testing because I see this tool as a temporary
placeholder until we get the SourceCred instances setup. I manually
tested the error cases (e.g. providing an invalid server url) as well as
success cases like the one above. I validated that the weights file
argument is being interpreted correctly (i.e. trying to load invalid
weights produces an expected error message, loading valid weights
results in those weights being present in the UI).
Allow getting the node address for a user, given the user's login. This
will be needed by the upcoming identity plugin.
If the login in question corresponds to a bot, then a bot address will
be returned. When we make the bot-set configuration (rather than
hardcoded), we'll need to change the signature of this function; I think
that's fine.
Test plan: Unit tests added. (Also, it's really simple.)
This commit adds Graph.contractNodes, which allows collapsing certain
nodes in the graph into each other. This will enable the creation of a
SourceCred "identity" plugin, allowing identity resolution between users
different accounts on different services.
Test plan: Thorough unit tests have been added. `yarn test` passes.
Thanks to @wchargin for [review feedback][1] which significantly
improved this API.
[1]: https://github.com/sourcecred/sourcecred/pull/1380#discussion_r324958055
Summary:
Flow provides a utility type for this purpose; there’s no need to
implement, document, and keep it in sync ourselves:
<https://flow.org/en/docs/types/utilities/#toc-shape>
Test Plan:
As written, `yarn flow` passes. Changing the definition of `params` on
line 77 of `load.test.js` to add a key `foo: "wat"` or change the value
of `weights` to `{hmm: "hmm"}` yield appropriate type errors.
wchargin-branch: use-shape
Summary:
This commit modifies `_updateOwnData` to write to both the old
type-specific primitives tables as well as the new EAV table. This
establishes the invariant that a node with non-null `last_update` will
always have primitive data (if its object type has primitive fields).
Test Plan:
Existing tests expanded. Commenting out each of the `updateEavPrimitive`
calls (independently) causes a test to fail. Note that every test that
queries an internal `primitives_*` table to inspect the database state
has been expanded to make an equivalent query against the `primitives`
table as well.
wchargin-branch: mirror-eav-update
Summary:
This establishes the invariant that every object in the `objects` table
has all relevant rows in the `primitives` table, though those rows’
values are never yet set.
Test Plan:
Unit tests updated. Manually loading `sourcecred/example-github` and
running `.dump primitives` generates reasonable-looking output, with
lots of rows, including entries for nested fields and eggs. Verified
that the set of non-`id` columns on `Issue` equals the set of values for
the `fieldname` column of an `Issue` object, and likewise for `Commit`s,
thus covering each kind of field.
wchargin-branch: mirror-eav-init
Summary:
See #1313 for context. The plan is to set up dual-writes with `extract`
calls still reading from the old tables until the new ones are complete
and tested. The primary risk to production would be a fatal exception in
the new write paths, which seems like an acceptable risk.
Test Plan:
Unit tests pass.
wchargin-branch: mirror-eav-schema
Summary:
Prior to this commit, removing the `addLink.run({id, fieldname})` on
line 487 of `mirror.js` would cause test failures down the pipeline, but
not at the root cause. Such an error is now caught earlier.
Test Plan:
Comment out line 487 of `mirror.js` and observe that the newly added
test case fails, but the other `registerObject` test cases do not.
wchargin-branch: mirror-test-registerobject-nested
For phase one of the CredSperiment, I need a SourceCred instance which combines GitHub and Discourse servers. I'll also need to be able to give it very specific configuration to collapse certain user identities together.
Shortly after launching the CredSperiment, I plan to come back and totally re-write SourceCred's command line interface and site building system, in a way that will throw away most of the existing codebase.
As such, I found it expedient to add rather hacky and untested support for loading combined GitHub/Discourse instances, so I can land the promised features. This PR does so by:
- adding sourcecred gen-project for constructing project.json files
- adding sourcecred load --project for loading a project.json file
- ensuring that load provides the right plugins based on the project that's in scope
- updating build_static_site so that it can use the new --project flag
Test plan:
I have done some end-to-end testing, but the overall commit stack lacks automated testing. This is a deliberate tradeoff: I'm planning to re-write this section of the codebase, and the testing ergonomics are not great, so I'd rather accept some technical debt, especially since I plan to pay it off soon.
See the pull request on GitHub for the individual constituent commits.
As suggested by @Beanow in [a review comment][1], this commit factors
loading weights from disk into a cli/common utility method.
The actual method is really generic, and we have a number of similar
constructions across the codebase (grep for `JSON.parse` to find them).
I considered factoring out a generic utility for loading and
deserializing JSON data from disk in general, but it didn't seem
valuable enough at this time.
Test plan: Unit tests added, existing tests pass.
[1]: https://github.com/sourcecred/sourcecred/pull/1374#discussion_r323149740
At present, every place in the codebase that needs
TimelineCredParameters constructs them ad-hoc, meaning we don't have any
shared defaults across different consumers.
This commit adds a new type, `PartialTimelineCredParameters`, which
is basically `TimelineCredParameters` with every field marked optional.
Callers can then choose to override any fields where they want
non-default values. A new internal `partialParams` function promotes
these partial parameters to full parameters.
All the public interfaces for using params (namely,
`TimelineCred.compute` and `TimelineCred.reanalyze`) now accept optional
partial params. If the params are not specified, default values are
used; if partial params are provided, all the explicitly provided values
are used, and unspecified values are initialized to default values.
Test plan: A simple unit test was added to ensure that weights overrides
work as intended. `git grep "intervalDecay: "` reveals that there are no
other explicit parameter constructions in the codebase. All existing
unit tests pass.
The `timelineCred.js` file is a bit of a beast. One way to start
slimming it down is to pull the parameters into their own file. This is
especially helpful as I'm planning a followon PR that will colocate the
default parameter values with their declaration.
The naming of everything in the `/timeline/` subdirectory is a bit
wonky: it reflects that at the time of creation, "Timeline" designated
an experimental version of SourceCred. Now, it is becoming canonical,
but the cumbersome naming persists. I haven't made any effort to tackle
the name debt here.
Test plan: `yarn test` passes; since this is merely a code
reorganization, this give me great confidence that the change is
correct. I also added a few small tests to the new module. Although the
behavior in question is already tested, I think setting up test files
liberally is a good practice, as the existence of the test file invites
the creation of more tests.
Now that we're adding support for the Discourse plugin, we'll start
having >1 plugin present in the frontend again. As such, we should
provide clear grouping of types in the frontend so that it's possible to
distinguish between a GitHub user and a Discourse user. This commit does
just that, by resurrecting code that we used when the GitHub and Git
plugins co-existed in the frontend.
Test plan: Launch the fronted and observe that node types in the filter
selection dropdown are grouped by the name of their plugin. Also,
clicking on the name of a plugin should filter to all nodes from that
plugin.
Previously, the `sourcecred scores` command assumed that all users are
GitHub users, and assigned users an id based on their GitHub login.
Now, the command returns information on all users, regardless of which
plugin provided them. As such, we need to identify users differently.
Instead of a string id, they now have an array of address parts. That
array contains all of the parts of their corresponding node address.
For example, the GitHub user `@Beanow` would correspond to the address
array `["sourcecred", "github", "USERLIKE", "USER", "Beanow"]`
As a general convention, the first two components of any node's address
contain information about the plugin that owns that node. The first
component is the owner of the plugin, and the second is the name of the
plugin. Afterwards, the plugin may represent nodes in whatever manner it
sees fit.
Thanks to @Beanow and @vsoch for some feedback and discussion on this
design.
Test plan: Snapshots have been updated. `yarn test` passes.
Now instead of always defaulting to GitHub users, it shows all
user-typed nodes. This will make SourceCred work non-hackily when there
is e.g. just a Discourse plugin in scope.
I also fixed an issue where it was loading the GitHub declaration in a
hardcoded way, instead of properly getting it from the TimelineCred's
plugin array.
Test plan: Manual UI inspection.
This is a convenience method that extracts cred for all the user-typed
nodes. It's basically an abstraction over calling `credSortedNodes` with
the right set of prefixes.
I forsee using it in at least two places (score retrieval in the CLI and
score display in the frontend) so I decided to make it a method.
Test plan: A very simple unit test was added. (It's a very simple
wrapper function.)
This lets us filter by a group of prefixes simultaneously, which enables
e.g. seeing all user node types at once.
I also tweaked the API to make it a bit more convenient, you can now
pass no arguments and get all nodes in sorted order.
Test plan: Unit tests updated.
The PluginDeclaration has all of the information we need to configure
TimelineCred: it knows all the node and edge types, as well as which
node types are user (or scoring) node types.
Therefore, we can replace the ad-hoc config object with a simple array
of plugin declarations. Since the plugins will be saved as part of the
TimelineCred, it means the UI can configure to only show information for
plugins that are actually in scope.
Test plan: `yarn test` passes, and the prototype still works. Snapshots
updated.
When a post or topic is deleted, Discourse fetch will give status 410.
As with 404 and 403, we should just ignore the post and move on.
I took the opportunity to slightly refactor the fetch error handling
while I was there.
Test plan: Previously, doing a load on the SourceCred discourse instance
would fail due to a deleted topic. Now, it doesn't.
This modifies the pluginDeclaration so that it can specifiy user node
types. This will allow us to replace the TimelineCredConfig type with a
plugin collection instead.
It's expected that the user types will also be present in the node
types, although this isn't validated anywhere at present.
Test plan: `yarn flow`.
This updates the cred computation logic so that we can have multiple
"scoring node types".
Context: Currently, we designate a single node type (GitHub users) as
the scoring node type, and normalize so that all users have 1000 score
in total.
This commit updates the pipeline to admit using more than one prefix for
scoring, meaning that we could have GitHub users, Discourse users, and
more, and still have all users sum to 1000 score.
We will still need to update the frontend so that it will have a user
pane which aggregates across all users.
Test plan: Unit tests updated. `yarn test` passes.
Summary:
This adds `MDM6Qm90NDY0NDczMjE=` (`@allcontributors`) to the blacklist
to enable loading the `aragon/aragon` repository. See #1362 and #996 for
context.
Test Plan:
Running `node ./bin/sourcecred.js load aragon/aragon` on a clean cache
now completes successfully.
wchargin-branch: blacklist-allcontributors
This changes how TimelineCred filtering works. Instead of using the
filterTimelineCred module, which includes all nodes matching
filterPrefixes, we now take all nodes matching scorePrefixes and
additionally the top `k` nodes for every other type.
This ensures that we will have the top comments, pull requests, issues,
etc in the UI, without needing to take every single comment or PR or
issue.
Concurrently, the UI is updated so that every type is included in the
filter dropdown.
CHANGELOG has been updated, since this is user facing.
Test plan: `yarn test` passes, snapshots are updated, and I also tested
the UI manually.
TimelineCred computation is implemented as follows:
- Compute Distribution
- Filter it down to specified node types
- Wrap the filtered results into a TimelineCred
I want to change how the filtering works. The new filtering logic will
depend on logic we've already implemented in TimelineCred; therefore
filtering should be done on the TimelineCred object and not separately.
Specifically, I want to be able to filter down to the highest-scored
nodes by type (dependent on the type).
As a first step, I've refactored the interface to TimelineCred so that
the filtering is an implementation detail, i.e. the TimelineCred
constructor doesn't expect objects defined in `filterTimelineCred`.
Test plan: `yarn test` passes after a snapshot update.
This modifies the TimelineCred serialization so that it includes the
CredConfig in the JSON. This means that it's easier to coordinate which
plugins and types are in scope, as the data itself can contain that
information.
Rather than define a new hand-rolled serializer, I just passed the
config directly through for stringification. Unit tests verify that this
still works (round-trip serialization is tested). As an added sanity
check, I generated a new small `cred.json`, and inspected the file via
`cat` to ensure that it's still legible text, and isn't interpreted as a
binary file due to the `NUL` bytes in node addresses.
Every client that previously depended on the `DEFAULT_CRED_CONFIG` now
properly gets its cred configuration from the JSON.
Test plan: Unit tests for serialization already exist. Generated a fresh
`cred.json` file and tested the frontend with it. Also,
`yarn test --full` passes.
Blacklist more problematic quasar interactions
Summary:
Context: <https://github.com/sourcecred/sourcecred/issues/1256#issuecomment-526252852>
Without also blacklisting the reaction, we hit an invariant violation in
the relational view (reactions are expected to have exactly one author).
Test Plan:
Running `node ./bin/sourcecred.js load quasarframework/quasar-cli` now
completes successfully (in about 2 minutes 40 seconds). It does emit a
warning:
```
Issue[MDU6SXNzdWUzNDg0NjUzNDg=].reactions: unexpected null value
```
…because one of the reactions was blacklisted. But the relational view
handles this correctly, it seems: timeline cred is still computed and
renders without obvious error.
wchargin-branch: blacklist-more-quasar
Summary:
The format of GitHub’s GraphQL object IDs is explicitly opaque, and so
we must not introspect them in any way that would influence our results.
But it seems reasonable to introspect these IDs solely for diagnostic
purposes, enabling us to proactively detect GitHub’s contract violations
while we still have useful information about the root cause.
This commit adds an optional `guessTypename` option to the Mirror
constructor, which accepts a function that attempts to guess an object’s
typename based on its ID. If the guess differs from what the server
claims, we continue on as before, but omit a console warning to help
diagnose the issue more quickly.
Resolves#1336. See that issue for details.
Test Plan:
Unit tests for `mirror.js` updated, retaining full coverage. To test
manually, revert #1335, then load `quasarframework/quasar-cli`. Note
that it emits the following warning before failing:
> Warning: when setting Reaction["MDg6UmVhY3Rpb24zNDUxNjA2MQ=="].user:
> object "MDEyOk9yZ2FuaXphdGlvbjQzMDkzODIw" looks like it should have
> type "Organization", but the server claims that it has type "User"
Unit tests for the GitHub typename guesser added as well.
Running `yarn test --full` passes.
wchargin-branch: mirror-guess-typenames
Summary:
The current implementation of `NullUtil.filterList` uses an `any`-cast.
This is fine as long as the definition is actually typesafe; we should
take a least a little care to ensure that it is. This commit adds a
typesafe version, commented out but still typechecked, and refines the
type around the `any`-cast to make the cast slightly more robust.
Test Plan:
Note that changing `$ReadOnlyArray<?T>` to `$ReadOnlyArray<?T | number>`
in the declaration of `filterList` caused no Flow error prior to this
commit, but now causes one.
wchargin-branch: filter-list-typecheck
PR #1325 introduced a failing snapshot test, which was promptly caught
by @wchargin. This commit fixes it by running
`./scripts/update_snapshots.sh`. Also, I bumped the project JSON version
number, which also should have happened in #1325.
Test plan: `yarn test --full` passes.
This commit modifies `cli/load` to appropriately load a Discourse key
from the environment, if it is available.
The mechanics are basically the same as with the GitHub token.
Test plan: Unit tests added. `yarn test` passes.
This commit modifies the `Project` type so that it allows settings for a
Discourse server, and ensures that `api/load` will appropriately load
the server in question, and include it in the output graph.
Putting the full Discourse declaration directly into the Project type is
an unsustainable development practice—in general, adding plugins should
not require changing core data types. However, at the moment I'm punting
on polishing the plugin system, in favor of adding the Discourse plugin
quickly, so I just put it into Project alongside the repo ids.
In the future, I expect to refactor the plugins around a much cleaner
interface; it's just not a priority as yet. (Tracking: #1120.)
This commit also makes the GitHub token optional in `api/load`, since
now it's very plausible that a user will want to only load a Discourse
server, and therefore not require a GitHub token.
As of this commit, it's still impossible to load Discourse projects, as
the CLI always sets a null Discourse server; and in any case, the
frontend would not properly display the project in question, as any
Discourse types would get filtered out.
Test plan: Mocking unit tests have been added to `api/load.test.js` to
ensure that the Discourse graph is loaded and merged correctly.
This adds a new method called `filter` to the `NullUtil` module.
`filter` enables you to filter all the null-like values out of an array
in a convenient typesafe way. (It's really just a wrapper around
`Array.filter((x) => x != null)` with a type signature.)
Test plan: Unit tests added (for both functionality and type safety).
This is the analogue to `github/loadGraph`, but for Discourse. It
basically pipes together the mechanisms for loading Discourse data and
creating a Discourse graph from them, resulting in a single endpoint for
consumption in the API.
In contrast to github, the method is called `loadDiscourse` and not
`loadGraph`, which seemed more appropriate to me. I haven't changed
the corresponding GitHub method's name. (I'm currently knowingly letting
conceputal debt accumulate around the plugin interface; I expect to do a
full refactor within the next few months.)
Test plan: This is the kind of "pipe together tested APIs involving IO"
code which I have decided not to write explicit tests for. However, it
is still protected by flow, and I have a branch (`discourse-plugin`)
which uses this code to do a full Discourse load.
This implements rate limiting to the Discourse fetch logic, so that we
can actually load nontrivial servers without getting a 529 failure.
We could have used retry; I thought it was more polite to actually limit
the rate at which we make requests. However, to avoid seeing 529s in
practice, I left a bit of a buffer: we make only 55 requests per minute,
although 60 would be allowed.
If we want to improve Discourse loading time, we could boost up to the
full 60 request/min, but add in retries. (Or we could switch to retries
entirely.)
Test plan: This logic is untested, however my full discourse-plugin
branch uses it to do full Discourse loads without issue.
Summary:
In #1194, we upgraded Prettier from 1.13.4 to 1.18.2, but this upgrades
past <https://github.com/prettier/prettier/pull/5647>, which was first
released in Prettier 1.16.0. This commit fixes the uses of deprecated
code introduced as a result. It also upgrades the type definitions to
match, via `flow-typed install prettier@1.18.2`.
Addresses part of #1308.
Test Plan:
Prior to this commit, running `yarn unit` would print
```
console.warn node_modules/prettier/index.js:7934
{ parser: "babylon" } is deprecated; we now treat it as { parser: "babel" }.
```
in two test cases; it no longer prints any such warnings. Furthermore,
running `git grep 'parser.*babylon'` no longer finds any matches.
wchargin-branch: prettier-deprecations
Summary:
A `PluginDeclaration` must have a `nodePrefix` and an `edgePrefix`, but
the Discourse plugin declaration was missing these. This was not caught
by Flow because `deep-freeze` was introduced in #1249 without type
definitions; see #1308.
Test Plan:
Apply the following patch:
```diff
diff --git a/src/plugins/discourse/declaration.js b/src/plugins/discourse/declaration.js
index 246a0a28..36ae5f13 100644
--- a/src/plugins/discourse/declaration.js
+++ b/src/plugins/discourse/declaration.js
@@ -1,6 +1,6 @@
// @flow
-import deepFreeze from "deep-freeze";
+declare function deepFreeze<T>(x: T): T;
import type {PluginDeclaration} from "../../analysis/pluginDeclaration";
import type {NodeType, EdgeType} from "../../analysis/types";
import {NodeAddress, EdgeAddress} from "../../core/graph";
```
Note that, with this patch, `yarn flow` fails before this change but
passes after it. Running `yarn unit` still passes.
wchargin-branch: discourse-plugin-prefixes
Summary:
All links in SourceCred must use the `Link` component, providing either
an external URL `href={…}` or an internal route `to={…}`. Any uses of a
raw `<a>` element for internal routes will incur 404s when the
application is hosted on a non-root path, as is currently the case on
the production website.
The change to `FileUploader` is not strictly necessary, as the link has
no styled text and uses a `data:` URL, but there’s no reason not to.
Fixes#1304.
Test Plan:
Build the static site:
```
scripts/build_static_site.sh --target cred --project sourcecred/example-github
```
Then run `python3 -m http.server` from the repository root directory—not
the `cred/` subdirectory—and navigate to the timeline cred view:
<http://localhost:8000/cred/timeline/sourcecred/example-github/>
Observe that the “(legacy)” link now has the correct styling and
correctly navigates to the legacy mode page when clicked: prior to this
change, it would navigate to a URL without the proper `/cred/` path
prefix, yielding a 404. On the legacy page, verify that the “timeline
mode” link has the same properties.
Then, visit <http://localhost:8000/cred/test/FileUploader/> and verify
that the inspection test still passes.
Added a regression test to catch further such errors. Note that
reverting the code changes in this commit causes the test to fail, and
that running it with `--verbose` prints the problematic files.
wchargin-branch: fix-bad-routing-404s
Summary:
This is firing on a production page load of the “prototype” link from
the homepage, and does not seem to actually be an error condition.
Test Plan:
Run `yarn start`, navigate to `/timeline/sourcecred/example-github/`,
and observe that the console error has disappeared.
wchargin-branch: defaultloader-console-error
Summary:
When inserting a “like” action with `INSERT OR IGNORE` semantics, we
also learn whether the action had any effect. We can use this bit to
avoid a separate query checking whether the “like” already exists.
As mentioned here:
<https://github.com/sourcecred/sourcecred/pull/1298#discussion_r314994911>
Test Plan:
Running `yarn test` passes as is, and fails if you change `addLike` to
always return either `changed: true` or `changed: false`.
wchargin-branch: discourse-likes-one-query
Summary:
Calling `db.prepare(sql)` parses the text in `sql` and compiles it to a
prepared statement. This takes time, both for the parsing and allocation
itself and for the context switch from JavaScript to C (SQLite).
Prepared statements are designed to be invoked multiple times with
different bound values. This commit factors prepared statement creation
out of loops so that each call to `update` prepares only a constant
number of statements.
In doing so, we naturally factor out some light JS abstractions over the
raw SQL: `addTopic((topic: Topic))`, rather than `addTopicStmt.run(…)`.
In principle, these could be factored out of `update` entirely to
properties set on the class at initialization time, but, as described in
a comment on the GraphQL mirror, we defer this optimization for now as
it introduces additional complexity.
Test Plan:
Running `yarn test --full` passes.
wchargin-branch: discourse-sql-cse
This commit changes the Discourse default weights around, mostly
significantly moving many weights (e.g. LIKES) that have a 0 backward
weight to have a small positive backward weight instead, like 1/16. In
practice, this mitigates an issue where users with few outbound edges
act as "cred sinks" because the cred gets stuck in a loop between the
user and content they've authored.
Test plan: In local experimentation, I've found the new weights produce
more reasonable-seeming cred attribution.
I've written the Discourse plugin with distinct edge types for post and
topic authorship; it allows us to have more precise control over how
cred flows (and mitigates the need for #968). However, I gave the two
types the same name, which is confusing in the weight config ui. Now
they are properly distinct.
Test plan: It's a simple string change. In (unpublished) commits with a
full Discourse integration, the new strings show up nicely in the UI.
The previous code incorrectly constructed a Discourse post url based on
the post's id, rather than its index within the containing topic. This
is now fixed.
Test plan: There isn't actually a snapshot diff, because the post with
id 2 is also the second post in its thread. I'm not too worried about
this, though: this kind of code changes infrequently, and it's pretty
obvious when it's wrong.
The Discourse mirror class now keeps an up-to-date record of all of the
likes within an instance. It does this by iterating over every user in
the history, and requesting their likes. If at any point we hit a like
we've already seen, we move on to the next user. In the future, we can
improve this so we only query users we haven't checked in a while, or
users who were recently active.
Test plan: Tests verify that we correctly store all the likes, including
after partial updates, and that we don't issue unnecessary queries.
This is a minor change to the Discourse mirror so that it supports a
query to get all users from the server. It will be convenient for a
followon change which makes `update` search for every user's likes.
I also modified createGraph so that it uses the new method, which
results in code that is cleaner and slightly more efficient.
Test plan: Unit tests updated.
For the Discourse plugin, we really want to be able to add a full record
of all of the users' liked posts as edges in the graph. It's a really
high-signal way to move cred, that also gives individual users a lot of
agency and way to engage.
However: we need an API to get this data. Initial searches of API docs
were un-promising; fisrt, we would need to query potentially every post
to get its likes individually (makes it very expensive to find the likes
on old posts), and second, the likes did not come with timestamp
information. For a while, I thought we were at an impasse.
I then went fishing in the Discourse implementation for a solution (yay
open source!). Lots of the API is un-documented, since it's whatever
they happen to add to run Discourse. And it turns out there's a
`user_actions` API ([source]) which can provide all of a user's actions
in order, and having your content liked by someone else is considered an
action. Best of all, these actions come with timestamps.
The upshot is that instead of querying every post to get its likes, we
can query every user to get likes. Iterating over all users can still
be slow, but it's far better than iterating over all posts; plus we can
implement caching so that we only infrequently check in on inactive
users.
I've added a `likesByUser` method to the Discourse fetch interface that
provides this information. I've also added a snapshot test for it (and
updated all of the snapshots). I also rolled in a slight refactor to
error handling in the fetcher.
The mirror doesn't yet use this information (will come later).
[source]: 82e07cb0f4/app/controllers/user_actions_controller.rb (L3)
Test plan: `yarn test` passes. Snapshots look good.
This commit adds the logic needed for creating a contribution graph
based on the Discourse data. We first have a declaration with
specifications for the node and edge types in the plugin. We also have a
`createGraph` module which creates a conformant graph from the Mirror
data. The graph creation is thoroughly tested.
Test plan: Inspect unit tests, run `yarn test`. I also have (yet
unpublished) code which loads the graph into the UI, and it appears
fine.
This is a quick fixup so that the coming createGraph module can be
properly tested.
Shout out to @Beanow for anticipating this need in a [review comment].
[review comment]: https://github.com/sourcecred/sourcecred/pull/1266#discussion_r314305108
Test plan: trivial refactor, run `yarn test`
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.
On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.
Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.
Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).
Progress towards #865.
The `DiscourseFetcher` class abstracts over fetching from the Discourse
API, and post-processing and filtering the result into a form that's
convenient for us.
Testing is a bit tricky because the Discourse API keys are sensitive
(they are admin keys) and so I'm reluctant to commit them, even for our
test instance. As a workaround, I've added a shell script which
downloads some data from the SourceCred test instance, and saves it with
a filename which is an encoding of the actual endpoint. Then, in
testing, we can use a mocked fetch which actually hits the snapshots
directory, and thus validate the processing logic on "real" data from
the server. We also test that the fetch headers are set correctly, and
that we handle non-200 error codes appropriately.
Test plan: In addition to the included tests, I have an end-to-end test
which actually uses this fetcher to fully populate the mirror and then
generate a valid SourceCred graph.
This builds on API investigations
[here](https://github.com/sourcecred/sourcecred/issues/865#issuecomment-478026449),
and is general progress towards #865. Thanks to @erlend-sh, without whom
we wouldn't have a test instance.
Summary:
In ES6, the [`try` statement grammar][1] requires a catch parameter; the
parameter is only optional in the latest draft of ECMAScript, which is
of course not yet ratified as any actual standard.
Even though we don’t officially pledge to support Node 8, this is
currently the only breakage, and it’s easy enough to fix.
[1]: https://www.ecma-international.org/ecma-262/6.0/#sec-try-statement
Test Plan:
Running `yarn start` on Node v8.11.4 no longer raises a syntax error.
wchargin-branch: catch-parameter
Summary:
Introduced in #1277.
Test Plan:
Run `yarn start` and visit <http://localhost:8080/test/FileUploader/>.
Conduct the test plan as specified on that page.
wchargin-branch: fileuploader-target
Summary:
Backticks are discouraged relative to the `$(…)` form for command
substitution, because they are harder to read and do not nest without
exponential escaping:
```shell
$ foo=$(echo $(echo hi $(echo bye))) # clear
$ foo=`echo \`echo hi \\\`echo bye\\\`\`` # hmm
```
In this context, command substitution should not be used at all; `PWD`
is a special variable that always contains the current working
directory. The new version of the code will be correct even if the
current working directory ends with whitespace that would be stripped
off by the command substitution.
Test Plan:
Prepended an `echo` to the relevant line, and verified that the script
has the same output before and after this change.
wchargin-branch: fix-backticks
The code is mostly ported from the legacy app. However, we no longer
assume that we are showing every type for every plugin. Instead, the
types are manually selected. For now, we permit the GitHub user type,
and the GitHub repo type, as these are the two types that are included
in filtered timeline cred.
Test plan: Manual inspection is necessary, since this frontend is mostly
untested. I've done that inspection. Also, `yarn test` passes.
Minor change to the API for MapUtil.pushValue. Now it returns the
resultant array. I've found this convenient in at least one case, and
previously we weren't returning anything, so it's a cheap change.
Test plan: Unit test added.
I moved sourcecred/example-git{,hub} to the @sourcecred-test org.
This commit fixes the build given that move.
I've realized that in #1233 I in-advertently made some Git tests that
depend on a snapshot un-updateable. I'm going to compound on that slight
technical debt by skipping the tests that depended on that snapshot. I
recognize and accept that I'll need to pay this down when I resuscitate
the git plugin.
Test plan: `yarn test --full`.
This commit removes the `pagerank` and `analyze` commands (both of which
never saw real usage), removes the outdated adapter-based `loadGraph`
method, and removes all traces of the analysis adapters.
It builds on work in #1233 and #1136.
Test plan: `yarn test --full` passes.
This commit swaps usage over to the new implementation of `cli/load`
(the one that wraps `api/load`) and makes changes throughout the project
to accomodate that we now track instances by Project rather than by
RepoId.
Test plan: Unit tests updated; run `yarn test --full`. Also, for safety:
actually load a project (by whole org, why not) and verify that the
frontend still works.
I'm re-organizing SC data to be oriented on the graph, rather than on
plugin-specific data structures. So there is no longer a need for the
git loading logic which orients around saving a repository.json file
that's been potentially merged across repos, or indeed the logic for
merging repositories at all. So I'm removing `git/loadGitData`,
`git/mergeRepository`, and relatives.
Test plan: `yarn test --full` passes.
There's no need for us to depend on `mkdirp`, because the `fs-extra`
module already has `fs.mkdirp` and `fs.mkdirpSync`. This commit removes
the dep from our `package.json`, and removes all explicit imports of it.
Test plan: `yarn test --full` passes. `git grep "import mkdirp"` has no
hits.
This adds a new module, `api/load`, which implements the logic that will
underly the new `sourcecred load` command. The `api` package is a new
folder that will contain the logic that powers the CLI (but will be
callable directly as we improve SourceCred). As a heuristic, nontrivial
logic in `cli/` should be factored out to `api/`.
In the future, we will likely want to refactor these APIs to
make them more atomic/composable. `api/load` does "all the things" in
terms of loading data, computing cred, and writing it to disk. I'm going
with the simplest approach here (mirroring existing functionality) so
that we can merge #1233 and realize its many benefits more easily.
This work is factored out of #1233. Thanks to @Beanow for [review]
of the module, which resulted in several changes (e.g. organizing it
under api/, having the TaskReporter be dependency injected).
[review]: https://github.com/sourcecred/sourcecred/pull/1233#pullrequestreview-263633643
Test plan: `api/load` is tested (via mocking unit tests). Run `yarn test`
This commit refactors the `util/taskReporter` module so that
`TaskReporter` is an interface; the class previously called
`TaskReporter` is renamed to `LoggingTaskReporter`. We also export a
`TestTaskReporter` which implements the interface, and is very easy to
test.
The motivation: This will make it much easier to write tested code that
uses a `TaskReporter`, as now the test code can provide a
`TestTaskReporter` and check that all tasks get finished, that task
ids are as expected, etc.
Test plan: The `TestTaskReporter` is tested. Run `yarn test`.
Throughout the codebase, we freeze objects when we want to ensure that
their properties are never altered -- e.g. because they are a plugin
declaration, or are being re-used for various test cases.
We generally use `Object.freeze`. This has the disadvantage that it does
not work recursively, so a frozen object's mutable fields and properties
can still be mutated. (E.g. if `const obj = Object.freeze({foo: []})`,
then `obj.foo.push(1)` will succeed in mutating the 'frozen' object).
Sometimes we anticipate this and explicitly freeze the sub-fields (which
is tedious); sometimes we forget (which invites errors). This change
simply replaces all instances of Object.freeze with [deep-freeze], so we
don't need to worry about the issue at all anymore.
Test plan: `yarn test` passes (after updating snapshots);
`git grep Object.freeze` returns no hits.
[deep-freeze]: https://www.npmjs.com/package/deep-freeze
This is a replacement for `github/loadGithubData` which returns a
combined Graph rather than a combined RelationalView. This provides a
major benefit, which is that we can use the (robust) Graph merge logic
rather than the (buggy) relational view merge.
Test plan: This function is untested. It basically pipelines a few APIs
together. I think that flow is basically sufficient to validate that it
works, and writing a unit test will be frustrating (mostly will involve
re-integrating the funcitonality via mocks). A future commit makes this
part of the pipeline that generates snapshot tests, so it is de-facto
integration tested.
This module builds on the project logic added in #1238, and makes it
easy to create projects based on a simple string configuration.
Basically, the spec `foo/bar` creates a project containing just the repo
foo/bar, and the spec `@foo` creates a project containing all of the
repos from the user/organization named foo.
This is pulled out of #1233, but I've enhanced it to support
organizations out of the box.
The method is thoroughly tested.
Test plan: `yarn test --full` still passes. Also, I've ensured that the
async `_getProjectIds` is still usable in our webpack configs (via
modifying and testing the dependent commits).
This creates a new `Project` type which will replace `RepoId` as the
index type for saving and loading data.
The basic data type is added to `project.js`. Rather than having a
`RepoIdRegistry`, I intend to infer the registry at build time by
scanning for available projects saved in the sourcecred directory. I've
added the `project_io` module for this task. It has methods for setting
up a project subdirectory, and loading the `Project` info from that
subdirectory.
To ensure that projects ids can be encoded even if they have symbols
like `/` and `@`, we base64 encode them.
To ensure that project ids can be retrieved at build time, the
`getProjectIds` method is factored out into its own plain ECMAScript
module. For all non-build time needs, it is re-exported from
`project_io`.
Test plan: Unit tests added; run `yarn test`.
See #1243 for context. This is basically a more aggressive version of
pull #1230 -- instead of just running unit tests in isolation, we also
run flow in isolation, and kill the servers afterwards.
Test plan: See how this fares in CI :)
Includes a change to `cli/load` and `build_static_site.sh` to accept a `--weights WEIGHTS_FILE` argument.
This allows overriding the default weights at build-time using a `weights.json` that has the same format as previously generated in the frontend.
Test plan:
Adds an additional test-case as well for propagating the optional parameter.
The file I/O of loading and parsing a weights.json file was tested manually. As analysis/weights' fromJSON() is tested elsewhere as is passing weight parameters.
Prior to #1136, we needed an `ExplorerAdapter` abstraction to get node
description data to the frontend. Now that it's included in the graph,
we can throw away this abstraction, which is a big step towards plugin
simplification (work towards #1120).
Since it only affects a deprecated/legacy part of the code base, I
didn't put much effort into making the result super clean. I also
removed a few tests that became inconvenient.
Test plan: Verified that the legacy frontend still works. There's one
tiny regression, which is that the link color in the legacy frontend no
longer matches the rest of the UI, but that's actually consistent with
the timeline frontend, so no biggie.
`yarn test` passes.
The scores are lightly processed from their internal representation.
Example usage:
```
$ yarn backend;
$ node bin/sourcecred.js load sourcecred/sourcecred
$ node bin/sourcecred.js scores sourcecred/sourcecred > scores.json
```
The data structure is as follows:
```js
export type NodeOutput = {|
+id: string,
+totalCred: number,
+intervalCred: $ReadOnlyArray<number>,
|};
export type ScoreOutput = Compatible<{|
+users: $ReadOnlyArray<NodeOutput>,
+intervals: $ReadOnlyArray<Interval>,
|}>;
```
Test plan: I added sharness tests at `sharness/test_cli_scores.t`.
In the past, we've used javascript tests for CLI commands. However,
those are pretty time-consuming to write, and are less robust than
simply running the command from bash. Check the snapshot for a sense of
what the new data format looks like. Also, the snapshot updater now
updates this snapshot too.
Relevant for #1047.
Thanks to @Beanow for feedback on the output format and design.
Thanks to @wchargin for help in code review.
As of this commit, the main SourceCred prototypes page now links to
timeline cred, meaning that timeline cred is now live. I've added a
link from the legacy explorer to the timeline explorer (which already
has a link out to the legacy explorer).
Test plan: Careful inspection of the frontend by the committer.
Also, yarn test.
This is a bulk rename of all the old explorer code into
`explorer/legacy`. Now that the timeline explorer exists, I intend to
prioritize development on that going forwards. Once the timeline
explorer is as good as the old explorer at decomposing a node's sources
of cred, I will remove the legacy explorer entirely.
Test plan: `yarn test`
This commit adds a TimelineExplorer for visualizing timeline cred data.
The centerpiece is the TimelineCredChart, a d3-based line chart showing
how the top users' cred evolved over time. It has features like tooltips,
reasonable ticks on the x axis, a legend, and filtering out line
segments that stay on the x axis.
An inspection test is included, which you can check out here:
http://localhost:8080/test/TimelineCredView/
Also, you can run it for any loaded repository at:
http://localhost:8080/timeline/$repoOwner/$repoName
This commit also includes new dependencies:
- recharts (for the charts)
- react-markdown (for rendering the Markdown descriptions)
- remove-markdown (so the legend will be clean text)
- d3-time-format for date axis generation
- d3-scale and d3-scale-chromatic for color scales
Test plan: The frontend code is mostly untested, in keeping with my
observation that the costs of testing the old explorer were really high,
and the tests brought little benefit. However, I have manually tested it
thoroughly. Also, there is an inspection test for the TimelineCredView
(see above).
It's very simple: a method that creates a copy of a `Weights`.
While writing this, I realized I should probably refactor the weights
module so that it exports a class rather than a bunch of methods
operating on a data structure. It would just be a cleaner API. But I'm
leaving that for another day.
Test plan: Unit tests added.
As described in #987, we use a single TTL across GitHub types. Right
now, the TTL is set to 7 days. This means that it's possible to run
`sourcecred load`, but still be missing the last 7 days worth of issues.
Now that we're doing timeline cred (cf #1212), this is not acceptable.
As a workaround until we fix#987, I'm decreasing the TTL to 12 hours.
That's still long enough to make a good experience for someone who is
tweaking config and calling `sourcecred load` a lot, but ensures that
freshly-loaded results still have recent activity.
Test plan: `yarn test`
This adds a TimelineCred class which serves several functions:
- acts as a view on timeline cred data
- (lets you get highest scoring nodes, etc)
- has an interface for computing timeline cred
- lets you serialize cred along with the graph and paramter settings
that generated it in a single object
One upshot of this design is that now if we let the user provide weights
(or other config) on load time in the CLI, those weights will get
carried over to the frontend, since they are included along with the
cred results.
TimelineCred has 'Parameters' and 'Config'. The parameters are
user-specified and may change within a given instance. The config is
essentially codebase-level configuration around what types are used for
scoring, etc; I don't expect users to be changing this. To keep the
analysis module decoupled from the plugins module, I put a default
config in `src/plugins/defaultCredConfig`; I expect all users of
TimelineCred to use this config. (At least for a while!)
Test plan: I've added some tests to `TimelineCred`. Run `yarn test`. I
also have a yet-unmerged branch that builds a functioning cred display
UI using the `TimelineCred` class.
fixup tlc
This adds the `filterTimelineCred` module, which dramatically reduces
the size of timeline cred by throwing away all nodes that are not a user
or repository. It also supports serialization / deserialization.
Test plan: unit tests included
This module takes the timeline distributions created by
`timelinePagerank`, and re-normalizes the scores into cred. For details
on the algorithm, read comments and docstrings in the module.
Test plan: Unit tests added.
As the name would suggest, this module allows computing timeline
PageRank on a graph. See documentation in the module for details on the
algorithm.
Test plan: The module has incomplete testing. The timelinePagerank
function calls out to iterators for getting the time-decayed node
weights and the time-decayed markov chain; these iterators are tested.
However, the wrapper logic that composes these pieces together,
calculates seed vectors, and runs PageRank is not tested. I felt it
would be a pain to test and settled for reviewing the code carefully,
and putting a cautionary note at the top of the function.
This commit adds new weight evaluators for nodes and edges. Unlike the
previous evaluator, edges and nodes are handled as separate concerns,
rather than composing the node weights into the edge weights. I think
this separation is cleaner.
Both evaluators use only the address, not the full (Node or Edge)
object. Although we may want to give the edge evaluator access to the
full Edge later, if we decide we want node-type-differentiated edge
weights (e.g. if a hasParent edge has a different weight depending on
whether it is connected to an Issue or a Repository).
weightsToEdgeEvaluator has been refactored to use the new evaluators,
and has been given a deprecation notice.
Test plan: `yarn test`
This commit adds an `interval` module which defines intervals (time
ranges), and methods for slicing up a graph into its consistuent time
intervals. This is pre-requisite work for #862.
I've added a dep on d3-array.
Test plan: Unit tests added; run `yarn test`
This commit disables the Git plugin by removing it from the default list
of plugins to load, or to display in the frontend.
Rationale: The git plugin doesn't currently add very much to cred
quality. Git commits have edges to their parent, which isn't a very
meaningful relationship for cred purposes. We'll want to re-enable the
Git plugin once we're ready to support e.g. file and directory level
cred tracking.
I've skipped a block of tests around the git analysisAdapter. (I intend
to deprecate the analysisAdapters, so skipping the tests seemed
preferrable to updating them). I also updated our sharness test for
catching test files without a proper describe block, so that it won't
error on skipped blocks.
Test plan: `yarn test --full` passes. Loading a new repository and
inspecting it in the frontend gives consistent results. There are no
references to Git plugin weights in the frontend, now that corresponding
nodes are not available.
The dependabot bot has an inconsistent typename in GitHub's database.
We'll blacklist it so we can continue loading `sourcecred/sourcecred`.
Test plan: `node bin/sourcecred.js load sourcecred/sourcecred` fails
before this commit, and succeeds after.
This commit resolves an inconsistency where we called edge weights
`toWeight` and `froWeight` in the core/attribution module, but
`forwards` and `backwards` in the analysis module.
I changed field names in the PagerankGraph JSON, so I bumped its compat.
Test plan: `yarn test --full` passes.
This commit just adds a test which verifies that when an
OrderedSparseMarkovChain is created by graphToMarkovChain, its nodeOrder
is the graph's canonical node order.
Test plan: `yarn test`
This means that we no longer need to expose methods for extracting the
order from serialized JSON. We can always count on iterating over the
nodes and edges in sorted order.
Test plan: `yarn test`; tests updated.
This commit updates eslint from v4 to v6. In doing so, I've moved off of
the create-react-app base eslint config. We were on an old version (v2)
and it doesn't make sense to update to v4, as in v4 create-react-app
uses typescript. Also, it didn't make sense to stay on
create-react-app's v2 config, because then it had unmet peer dependency
constraints on old versions of eslint.
Instead, I've moved us to use the default rules for eslint,
eslint-plugin-react, and eslint-plugin-flowtype.
I also made some changes to the codebase to satisfy the new lint rules
that came with this change.
Test plan: `yarn test` passes.
This necessitated a number of type fixes:
- Upgraded the express flow-typed file to latest
- Added manual flow error suppression to where the express flow-typed
file is still using a deprecated utility type
- Removed type polymorphism support on map.merge (see context here[1]).
We weren't using the polymorphism anywhere so I figured it was simplest
to just remove it.
- Improve typing around jest mocks throughout the codebase.
Test plan: `yarn test --full` passes.
[1]: https://github.com/flow-typed/flow-typed/issues/2991
This commit updates our prettier version from `1.13` to `1.18`. Looks
like software does get better over time! I like all of the changes.
Test plan: `yarn test` passes. I've manually inspected the diffs.
When we took a dep on better-sqlite3 in #836, we used a fork, because
better-sqlite3 did not yet support private in-memory databases via the
`:memory:` filepath. As of better-sqlite3 v5, this has been added to
mainline, so we no longer need the fork.
The v4->v5 transition involves some breaking changes. The only ones that
affected us were two field renames, from `lastUpdateROWID` to
`lastUpdateRowid`, and `returnsData` to `reader`.
Test plan:
After updating the field accesses, `yarn test --full` passes. For added
safety, I also blew away cache, loaded a nontrivial repository, and
verified that the full cred workflow still works.
cc @wchargin
This updates the graph `Node` type to include a string description.
The description should be a brief (ideally oneline) string giving
context on what the node is. All planned frontends will support
markdown, so linking to context (e.g. linking to the issue corresponding
to an ISSUE type node) is supported.
This commit updates the Git and GitHub plugins to use the new
description field.
Test plan: `yarn test --full` passes, and I've inspected snapshots and
made sure they look reasonable.
The GitHub plugin no longer adds a Node to the graph for Git commits.
Instead, it creates a dangling edge to it. This frees the GitHub plugin
from responsibility for setting the timestamp or other metadata for Git
nodes.
The Git plugin no longer adds a Commit Node to the graph immediately when
encountering a commit's parent hash. Instead, it creates an edge to the
parent, and then fills in the parent node once it is encountered in the
commit store.
Test plan: Load a real repository with merged pull requests
(e.g. sourcecred/research) into the explorer, and verify that GitHub
commit entities are still connected to Git commits, and that Git
commits are still connected to their parents.
This commit modifies the Graph class so that it permits dangling edges;
that is to say, edges whose src or dst are not present in the graph.
Dangling edges may be directly added to the graph, or existing edges may
become dangling if their src or dst is removed.
This change is prerequisite to #1136; if we require that nodes have
metadata, we should also make it possible to add edges to nodes that
don't yet exist, as the plugin creating an edge may not have access to
the full metadata needed to add the node.
To support this change, there is now an `isDanglingEdge` method on the
graph, which reports whether or not the edge is dangling. Also,
`Graph.edges` requires that the client make an explicit choice on
whether dangling edges are desired. This ensures that we do not
accidentally include dangling edges in a case where they are
inappropriate (e.g. creating a Markov chain) or accidentally discard
dangling edges when they are needed (e.g. when merging or serializing).
The Graph's invariant checker has been updated to reflect the new
semantics.
The Graph compat version has been bumped, since this is a break in
backwards compatibility.
Note that this commit does not change the behavior of any plugins; that
is to say, no plugins create dangling edges (yet).
Test plan: The advanced graph test case has been updated to include
dangling edges. The tests for Graph, PagerankGraph, and
GraphToMarkovChain have been updated. `yarn test --full` passes.
This commit modifies the base `Graph` class so that nodes are now
represented by `Node` objects rather than `NodeAddressT`. The intention
is to start adding additional fields (e.g. description and timestamp) to
nodes, although that is not included in this commit.
See #1136 for rationale.
Test plan: The graph is very well tested, and this commit adds
additional tests and invariant checking. Some additional test code
needed update. `yarn test --full` passes, and the SourceCred UI works as
expected.
PagerankGraph's `node` and `edge` getters returned null for unavailable
entries, rather than undefined. This is inconsistent with general JS
behavior, and with the base Graph. I've now cleaned it up.
Test plan: unit tests updated; `yarn test` passes.
Every GitHub entity now has a `description` method which returns a short
markdown description. These will be added to the graph as part of #1136.
Test plan: Inspected snapshots, `yarn test --full`.
This will allow timeline cred (#862) to do a better job of flowing cred
across reaction edges. (Very old reactions should not be moving a lot of
present-day cred.)
Test plan: Inspected snapshot changes.
Every GitHub entity from `RelationalView` now has a `timestampMs`
method. This replaces the standalone `createdAt` method.
Test plan: Snapshots look good.
It's an extension of #1152 induced by #1175.
It's a very simple change; I just changed the schema, and
`scripts/update_snapshots.sh` took care of the rest.
Test plan: Inspected snapshots and generated flow types.
This pulls distribution related code out of `markovChain.js` into the new
`distribution.js` module, and from `graphToMarkovChain.js` into
`nodeDistribution.js`.
Since the `computeDelta` method is now exported, I've added some unit
tests.
Test plan: `yarn test` passes.
As #1136 will be moving timestamps into the graph, we no longer need
`createdAt` method in the `AnalysisAdapter`. Actually, we no longer need
the adapter/loader distinction introduced in #1157. I haven't taken the
time to remove the `BackendAdapterLoader` concept because a) we may need
it later, and b) if we don't, I'll likely remove the `AnalysisAdapter`
concept entirely, in favor of having plugins directly save `graph.json`
files to a known location.
Test plan: `yarn test` passes.
I added `mentionsAuthorReference` based on an untested hypothesis that
they would be useful. With the passage of time, I've never seen any
evidence that they actually improve cred socres (their impact seems
negligible), and they add complexity.
In the future, "go-fishing" style heuristics like this should not merge
unless they are of clearly demonstrated value. Also, it would be better
to add stuff like this via a standalone plugin rather than in the core
GitHub logic.
Undoes #806.
Test plan: `yarn test`
Now that the graph is saved by default as a part of load, users who need
the graph can grab it directly from the `$SOURCECRED_DIRECTORY`. If we
really need a command line util for grabbing it, we should rewrite that
command to just grab the graph from that spot rather than re-computing
it.
Test plan: `yarn test`
As of the timeline cred work, I'm shifting emphasis away from raw
PageRank results, in favor of timeline pagerank results. As such,
there's no need to have load save the regular pagerank results on
creation.
As of #1136, there will be no need for timestampMap, as that data will
be present directly in the graph. As the timeline cred UI will depend on
the full graph for analysis, let's save the graph instead.
Test plan: `yarn test` and snapshot inspection.
This commit adds new helper methods for creating test nodes (`node` and
`partsNode`) and for creating test edges (`edge` and `partsEdge`) to
graphTestUtil.js.
This is very helpful in light of work related to #1136. I'm going to
change the concept of "node" from a raw address to an object, add fields
to that object, and add fields to the `Edge` type. If done naively, we
would need to change all the test code across the project for every one
of those changes.
By centralizing the creation of test nodes and edges behind the new
functions, we can update all the test code in a single place.
This change is trivial from a conceputal perspective, and very
broad-reaching from a code-touching perspective. It should be easy to
review, because if tests pass then the change is probably working as
intended. :)
Test plan: `yarn test`
In #1132 and #1134, I started work on the Odyssey plugin. However,
before getting it to a state where it's usefully included in SourceCred,
I decided to pivot to focus on timeline cred first.
Now I'm merging significant refactors as a part of timeline cred
(#1136). As a side effect of this refactor, the Odyssey plugin should
undergo significant changes (OdysseyInstance is now basically redundant
with base Graph.) Rather than incrementally update unused code, I elect
to remove the plugin. This code should be revived on a side branch, and
then merged into master once we have a fully functioning prototype.
Test plan: `yarn test` passes.
This commit refactors the Graph class so that rather than having
separate maps for inEdges and outEdges, there is a single incidentEdges
map, which contains objects with inEdges and outEdges.
This is motivated by a forthcoming big change as part of #1136; namely, to
allow storing dangling edges in the graph. Once we do so, we'll need a
consistent source of truth that enumerates all of the node addresses
which are accessible in the graph (either because they correspond to a
node in the graph, or because they are the src or dst of a dangling
edge). We could do this by adding another field to graph which tracks
this set, but by making this refactor, we can instead use the key set of
_incidentEdges as the source of truth for which node addresses are
present.
Besides being motivated by #1136, I think it's cleaner in general. Note
there are fewer ways for the graph to be inconsistent, as it's no longer
possible for inEdges and outEdges to have inconsistent sets of node
addresses.
The most complicated piece of this change was updating the automatic
invariant checker. It was no longer possible to test 3.1 and 4.1
separately, so they needed to be merged into a new invariant. Rather
than re-enumerate the invariants, I called the new one the 'Temporary
Invariant', because it is going to disappear in a subsequent commit.
Test plan: `yarn test` passes. Since Graph has extremely thorough
testing, this gives me great confidence in this commit. Note that no
observable behavior has changed.
At present, the Git commit node type lives in a strange state of shared
responsibility between GitHub and Git. The Git plugin is nominally
responsible for it, but its render method tries to show a hyperlink to
GitHub -- which is awkward for many reasons, including that the same Git
commit could have multiple hyperlinks on GitHub.
This commit resolves that issue by separating the existing commit type
into two: the Git Commit type, which is owned by the Git plugin and
doesn't have hyperlinks or any fancy GitHub metadata, and the GitHub
Commit, which is owned by the GitHub plugin, corresponds to a unique
database id in GitHub, and has a corresponding GitHub url.
The two commits are connected by a CorrespondsToCommit edge type, which
links from the GitHub commit to the corresponding Git commit.
This is necessary for #1136, as if we want to make descriptions a part
of the graph payload, we need for descriptions to be unique for a given
address--and descriptions are only unique if we identifiy each GitHub
commit pointer as a separate address.
Test plan: The unit testing in this part of the codebase is light, so I
verified that the frontend work as expected for `sourcecred/sourcecred`
and `sourcecred/research`. The new node type and edge type appear
properly in the UI, the GitHub commits are connected to their Git
counterparts, etc.
A long time ago, we made graph views for git and github. These are
interfaces over the graph which allow retrieving nodes' relations, e.g.
finding the parent address of a commit just using the graph.
These are fairly complex, and have seen almost no use at all. The one
thing they are used for is implementing invariant checking. The
invariant checking is nice in principle, but since we only apply it to
the example data, its of very limited value in practice.
Since I'm planning a significant Graph refactor (#1136), I'd rather delete this
code than continue to maintain it, since I think it's complexity/value
ratio is unfavorable.
Test plan: `yarn test`
This is another minor silent test failure: the error message thrown by
loadIndividualPlugin when a GitHub token is not available is not quite
the error that was specified in the test.
There were two issues: that we were testing for the wrong error message,
and that the failure didn't fail the test. I fixed both issues (by
changing the message thrown to match the test, and by having the test
_return_ the expectation that the promise will reject, and by expecting
there to be one assertion.)
Test plan: `yarn unit cli/load` no longer shows any unhandled promise
rejection warnings. If the test is modified so that it checks for the
wrong string, it now properly fails rather than passing with an
unhandled rejection.
Prior to this commit, if you run `yarn unit cli/load`,
you would see a lot of unhandled promise rejection warnings related to
the fact that load calls a `saveTimestamps` function which expects the
SourceCred directory to contain the results of really loading SourceCred
plugins. However, when testing, these functions have been mocked, and so
saveTimestamps rejects. This rejection was not caught, and polluted the
test output without actually failing the tests.
This commit updates the tests so that saveTimestamps can be mocked (via
dependency injection) and we can both verify that it is invoked
correctly, and not pollute the test output with spurious warnings.
Test plan:
- `yarn test --full` passes
- `yarn unit cli/load` produces far fewer
UnhandledPromiseRejectionWarnings (there is still one unrelated one)
- loading sourcecred/research still works (as a canary)
Note: the PR which introduced this issue is #1162.