Why and How Does Context Indexing Matter?

Context indexing is a new paradigm in information retrieval that integrates both keyword indexing or subject indexing and citation indexing, which are traditional. Therefore, it performs much better than either. It usually extracts such textual or contextual blocks where keywords and citations are in context, hence the name. However, such extracts are not essential but for economy or convenience. Instead, full texts can be used for context indexing as far as there are both keywords and citations in context, where their meaning and relevance can be easily judged by the user. The web page is usually full-text, while CiteSeer uses text blocks or extracts.

Until early 1970s, however, information scientists paid little attention to the power of contexts in sense-making, in spite of the user’s Anomalous State of Knowledge (ASK). Refer to Nicholas Belkin’s ASK[1] and Brenda Dervin’s sense-making.[2] To resolve your ASK more and more, you need to travel from text to text, guided by keywords and citations, as far as they are sense-making in context. Psychology and linguistic or psycholinguistic matter indeed. Or cognitivism matters. And cognitive users’ interaction with the system also matters. Most useful in this case is context indexing and that automated with electronic media.

The principle of context indexing was first formulated in 1975 by K. Y. Park, who read for an MSc degree in information science at University College London (UCL) in 1973-1974, while being supervised by B. C. Brookes. He was awarded the degree in 1975 by the University of London. He wrote a thesis titled “A Direct Approach to Information Retrieval.”[3] There, as suggested by the title, he wished to help scholars search information of themselves, by themselves, and for themselves, hence user-orientation. He also wished them to be equipped better than their traditional manual way of information searching on non-electronic media.

Why on earth do cognitivism and user-orientation matter? To resolve this, Park founded his principle of context indexing on C. K. Ogden and I. A. Richards’ contextual theory of meaning.[4] They begin with the semantic triangle composed of Word, Thought, and Thing or Referent, where Word does not necessarily refer to Referent. It is Thought for Word to stand for Referent. It is so central that you cannot do without that. Hence cognitivism. By “context,” they mean not only textual but also psychological context. From the cognitivist perspective, Park differentiates users’ subjective relevance from computer systems’ objective subject similarity. Computer systems can only predict relevance with varying probability, based on subject similarity. Because of this, users would not delegate their relevance judgment to intermediaries such as reference librarians. They themselves judge the relevance of the given context with keywords and citations. At last, he ended his thesis with the following cry of S. I. Hayakawa[5]: “…the ignoring of contexts in any act of interpretation is at best a stupid practice. At its worst, it can be a vicious practice.”

Park was also inspired by H. G. Wells’ World Brain[6] on the one hand, and by J. D. Bernal’s The Social Function of Science[7] on the other. The former said: “The modern World Encyclopaedia should consist of relations, extracts, quotations, very carefully assembled with the approval of outstanding authorities in each subject, carefully collated and edited and critically presented. It would not be a miscellany, but a concentration, a clarification and a synthesis.” The latter said: “In science men have learned consciously to subordinate themselves to a common purpose without losing the individuality of their achievements. Each one knows that his work depends on that of his predecessors and colleagues, and that it can only reach its fruition through the work of his successors.”

By the way, there was a connectionist movement in the UK in 1970s. In 1973, UCL was connected with Stanford via ARPANET, the former Internet. In 1974, Tony Buzan started a TV series Use Your Head,[8] where he popularized the mind map where various ideas are interconnected. He provided a practical method for visualizing associations among ideas. In 1978, meanwhile, the famous science historian James Burke started another TV series Connections.[9] He demonstrated how historical innovations and discoveries were woven into intricate networks of influence. As such, both emphasized that knowledge is not isolated but interconnected, and thus popularized connectionist thinking for broad audiences. Sir Tim Berners-Lee may be the next famous connectionist, who developed the World Wide Web[10] (WWW) for CERN in 1989. Meanwhile, Park of context indexing was among them. He was a connectionist-plus-contextualist, so to speak.

Three American hypertext fathers emerged after the WWW to insist that it is their offspring. They are Vannevar Bush, Douglas Engelbart, and Ted Nelson. Bush wrote As You May Think,[11] where he proposed what may be called a thinking machine, Memex. This is something like a personal computer with a lot of information resources stored and interconnected. Memex was imaginary. Engelbart invented the mouse which is very useful in web-surfing. And Ted Nelson coined “hypertext.” Since the 1960s He carried on the Xanadu Project, based on the two-way communication, with which he expected to make money by recognizing which is the incoming signal. They could influence the development of the web. However, the principle of the WWW is firmly based on Park’s context indexing, with which users can surf the web using not only keywords but also citations in context as an embedded hyperlink.

Traditionally, there was a strong system-oriented paradigm in information retrieval, in contrast. It was represented by Lockheed’s DIALOG and Salton’s SMART since 1960s. Retrieval was framed as a technical problem of indexing, matching, and ranking. Systems were designed to optimize internal efficiency and accuracy, not user autonomy. These were the main-stream paradigm in information retrieval.

Lockheed’s DIALOG[12] was developed at Lockheed Palo Alto Research Laboratory under Roger K. Summit in the 1960s, later commercialized as DIALOG Information Services. It was a pioneering online retrieval system, widely used by libraries, government agencies, and corporations. Users had to learn rigid query syntax (Boolean operators, field codes) rather than interact naturally. Ordinary users rarely searched directly. Instead, trained librarians or search specialists mediated between the system and its users. Emphasis was placed on building vast collections of bibliographic databases and efficient system access, rather than tailoring results to individual user needs. Success was measured by system performance (e.g., precision/recall, response time) rather than by user satisfaction or contextual needs.

SMART[13] (System for the Mechanical Analysis and Retrieval of Text) was a long-term research project, funded a large amount of money by the US Government. It was developed by Gerard Salton and his team at Cornell after Harvard. The main focus was on indexing methods, weighting schemes, similarity functions, and evaluation measures. All these was tested against fixed document collections and standard queries in experimental settings, rather than live, diverse users. Effectiveness was judged by system-oriented metrics, not by interactive or user-oriented measures. Users were treated as sources of queries, not as complex actors with evolving information needs.

DIALOG and SMART represents the paradigm of keyword indexing that is contrasted with citation indexing. Eugene Garfield’s Institute for Scientific Information (ISI) introduced the Science Citation Index (SCI) in 1961 for papers published in academic journals. Later, it also did the Social Sciences Citation Index (SSCI) and the Arts and Humanities Citation Index (AHCI). SCI, SSCI, AHCI, and the like represent the paradigm of citation indexing.

Nowadays, both keyword indexing and citation indexing are rarely used but for books. In contrast, context indexing became the main-stream paradigm in information retrieval. NEC’s CiteSeer was the next application of context indexing, after the WWW that is essentially comprehensive. In fact, it was closest to Park’s model. However, they didn’t call it context indexing, but marketed it as “autonomous citation indexing,” although it also includes keyword indexing. The prefix “autonomous” is trivial, although it helps differentiate it from the manual way of SCI.

Meanwhile, if NEC invented it, its successors should have cited NEC. As it did not cite Park, CiteSeer might have been reinvented. On the other hand, the probability theory denies that so many applications after CiteSeer cannot be reinvented within a short period of time and within one country that is the USA. If NEC did invent CiteSeer indeed, its successors should have cited it. Why did they not cite that? They would do so when they surely know NEC invented or reinvented that. But they must know that NEC did not invent or reinvent that, but merely copied Park’s principle without proper attribution to him. Therefore, this may explain why CiteSeer was transferred from NEC to Pennsylvania State University, of no contribution hence no reason, together with Lee Giles, one of NEC’s developers thereof. Such was exactly the case with the World Wide Web was transferred from CERN to MIT’s Media Lab of no reason together with Sir Tim Berner’s Lee, CERN’s developer thereof. Those mysteries could be best explained by plagiarism.

Very strangely, at the moment, Wikipedia has no article for context indexing or the like, although this became the mainstream paradigm in information retrieval, and there are roughly more than 10 applications thereof, including NEC’s CiteSeer (1997), [14] Google Scholar (2004), [15] and Microsoft Academic (2016-2022). [16] This is really an unbelievable situation. Nevertheless, Wikipedia would not have an article for context indexing, because such an article could not be written without attributing to its true inventor, K. Y. Park. Wikipedia might wish him to be unnoticed and unknown, because many associatives, including not only the numerous copiers but also the US Government , especially DoD, would wish that. What a historic shame! Aren’t they fools? It is a mystery and paradox that proper citations and attributions have been avoided in this field of context indexing which vitally include citations. QED. N evertheless, science historians would make it clear someday when all the associatives were dead.

Why and How Does Context Indexing Matter?

Share this: