Bibify: Building an open-source citation service

Vincent Wang
5 min readMay 19, 2020

DISCLAIMER: This article is shameless self promo. The links for the frontend and backend source code are here: https://gitlab.com/bibify. If you want to read about how we did it, go on.

We’ve all had to do it: a teacher assigns a research project and says that it needs a full bibliography, formatted in MLA or APA or whatever. Now, most of us don’t bother to remember how to cite a website in MLA (because how often do you actually need to do that in life?), so we usually turn to a citation generator.

These things usually suck.

Most of these citation generators are slow and laden with ads. The ones that aren’t usually don’t work properly. We put up with most of them because the alternative would be to do it by hand, and doing it by hand is painful. So what if we built an free and open source citation generator?

Step 1 — the frontend

Our citation generator needs a frontend. Easy job. Grab some React, your favorite components library, and slap them together. Bing bang bong, add some axios for making HTTP requests, a querystring library, and you’re done!

The source for the frontend is here: https://gitlab.com/bibify/bibify.

Step 2 — the backend

The backend side is a bit harder. In order to match current citation generators in features, our app needs to be able to:

  • Generate accurate citations for all common citation styles
  • Get book and website info to auto-cite (because entering data by hand is lame)

Let’s break down these problems.

Fetching Website Info

Fetching website info is pretty easy. Grab a metadata scraping library and point it at the website you want to get the info of. That’s about it!

Fetching Book Info

Fetching accurate book information for free is a bit harder than fetching websites; the book database that most people use is locked behind a subscription fee. However, the Google Books API is free, and there’s no limit! (Although they do request that you stay under 10,000 requests per day as a courtesy limit.) All we need to do is grab a good wrapper library, give it a search query, and we’re good to go.

Generating accurate citations

Generating accurate citations for every common citation style is difficult, especially because doing it yourself would mean reading through every citation style guide’s rules on how to cite every media type. Luckily, we don’t have to! The CSL (Citation Style Language) project already contains 9000+ citation styles that we can use. Combine this with the citeproc-js processor, which takes these citation styles and spits out a citation, and we’re in business!

Except it’s not that simple.

Being able to access 9000 styles through citeproc-js is great, but getting it to work is a bit of a slog. In order to use the citeproc-js engine, you need to write your own sys object which provides the functions retrieveItem() and retrieveLocale(); while it isn’t really that hard to write these functions yourself, it is still a good amount of boring boilerplate. So, to solve this, we s̶t̶e̶a̶l̶ borrow this nice wrapper (as well as this helper script). Now, instead of writing our own sys object, we can just let sys = citeprocnode.simpleSys(). Much easier, isn’t it?

Once we have our sys object, it’s a simple matter of giving it an item and calling makeBibliography:

Just load in the CSL style file, the locale, and the bibliography item (formatted in CSL-JSON), and it spits out a bibliography!

Side Note: CSL-JSON

In order to generate a bibliography, you need to load your info into citeproc-js in CSL-JSON format. It varies based on type, but it more or less looks something like this:

{
'id': 'random-id-aihwgew',
'type': 'book', // any one of the CSL types listed here
'title': 'A Book about Something',
'publisher': 'Random Publisher Inc.',
// other type-specific fields,
'authors': [
{ 'family': 'Last', 'given': 'First' }
]
}

The full docs for CSL-JSON are here.

Pitfalls

One thing about CSL is it has 2 categories of styles: independent and dependent. Basically, the dependent styles are different names for an independent style — for example, Harvard Educational Review links to APA. Unfortunately, that’s the only information that’s available in the dependent style:

An example of a dependent CSL style. Note that only the parent style and some metadata are available.

This means that trying to load this style directly into citeproc-js won’t work, because citeproc-js is expecting a full independent CSL style:

An example of an independent style. Note how there’s a lot more information on how to actually create the citation.

So, we need to grab the linked independent style from the dependent style and load that instead:

Here we use xpath to grab the path of the linked independent CSL style and load that instead.

Now we’re in business!

That’s the main CSL issue we need to get out of the way. Here’s some other minor inconveniences:

  • This is more of a side note and less of a pitfall, but it turns out that the “Harvard” style that most popular citation services provide doesn’t really exist in CSL. There’s a (deprecated) harvard1.csl reference Harvard style, which links to the Harvard Cite Them Right style. However, most universities actually have their own Harvard variant.
  • MLA 7 and 8 are both shortened to “MLA”, so when you display the short titles side by side (as bibify does), both of them show up as “MLA”. This is easily fixed by going into each CSL file (modern-language-association.csl and modern-language-association-7th-edition.csl) and changing the <title-short> content.

The Future

While bibify currently has feature parity with other popular citation generators, there’s some things still planned in the works:

  • Bibify currently scrapes websites for metadata in real time. While this approach generally works, it also means that slow websites will take longer to cite, and websites that are down won’t be able to be cited at all. To solve this, we’re planning on maintaining a cache of cited websites, as well as integrating with The Wayback Machine. (NOTE: As of update 2020.05.18, Bibify now caches website fetch results with superagent-cache. While this speeds things up, websites that are down still can’t be cited.)
  • Autocitation only currently works with books and websites. We can improve on this by adding autocitation for other media types.
  • Like most citation generators, Bibify struggles to handle autociting author names with more than two words (e.g. “Bartolome de las Casas”). Currently, Bibify simply treats author names with more than two words as a literal, meaning that the name is put in as is; this is generally not compliant with most citation styles.
  • The frontend UI can definitely be improved.

Come and contribute!

This open-source citation generator project lives at https://gitlab.com/bibify. Come on over and contribute! File an issue or two, maybe fix some bugs or add some new features. New additions are always welcome.

--

--