Friday, 22 October 2004

Federated Search ( The Buntine Oration - Reflection 1)

As I have noted before, I think I share more similarities to Stephen than differences, but on this issue, I am offering an alternate view point.

Because learning objects are invisible to Google [sic] operators of learning content management systems are intended to access these federated searches.

Repository existed not only for objects which are invisible to Google. Back in 1999, while serving EdNA, I have looked at the collaboration issues of "subject gateways" (SG in short hereafter) which are in many ways similar to today's repository.

First of all, we must recognize the value provided by these SGs. Let me use two examples. Stephen's own Edu-RSS is widely subscribed, so is his OLDaily. I am a keen reader of his writings. The greatest value in OLDaily, to me at least, is the "coloured"-view through a man who I trust. I respect his selection, I like to read his comment and would occasionally follow through to the items. Another example: some investors pay good money for the recommendation of some advisors. The information on which such advisor is based should be public knowledge (otherwise, they are liable to the charge of insider trading). But the value to their subscribers is the unique analysis - the coloured view. Subject gateways serve a similar function. These SGs are run and owned by people who have immense interest and "authority" in the subject area. Users trust the recommendation. On 21st Oct, 2004 issue of OLDaily, Stephen pointed to an item "Information Cascades in Online Learning" which describes the chaotic environment which may be created by the information linking. SGs, in a way, can serve as a nice balance to this chaos.

The second thing I did on tackling this SG collaboration issue was to look at the sustainability of the SGs and recognized the competition and collaboration requirements. On one hand, SG owners need to competitively bid for funding, but there are obviously significant duplication in terms of infra-structure and scalability. Referring back to the data model I developed (see Meta Meta Meta Data Draft 0.1, the value the SG owners would protect in order to maintain their survival is "type 2" data. I did not see how they will give up type 2 data without killing themselves. The collaboration model, hence, has to protect the ownership of type 2 data. What we can do, at the time, was to come up with a harmonized type 3 data so that searches can be done to these SG without SG explicitly sharing their type 2 data. We called this "mega" search (instead of the then current term meta search) and it is what is known as "federated search".
Stephen also commended on the efficiency of federated search.
If this process seems odd and cumbersome, it is. In practice, the federated search over even a small number of repositories is significantly slower than Google.

This is a price we pay for the concept of "loose" connectedness. While Stephen is a great promoter of loosely coupled structure for learning services, I don't see any reason why search must be centralized controlled by Google. As our network speed and the interface among SG (or repositories) improves, the performance will improve as well. Peer-to-peer network is based on the same principle. I understand the value of a network is a function of the number of nodes in the network. To really make federated search a viable strategy, repositories need to join together to form a value-network which is comparable to Google and continue to leverage on the network effort to grow the value. If repositories remain to be closed, I agree with Stephen that the doom day is coming.

As I noted, we share more common than difference. The last quote from Stephen will pull us back together.
What Google has, that a federated search system by definition cannot have, is what I call third party metadata and what Google calls PageRank.

All our difference boils down to just terminology. To Stephen, repositories are just data store and hence federated search across multiple stores makes no sense. What I try to say here is that *some* repositories have significant value added and we should also support mechanisms which protect such value. Diversity is great!

2 comments:

Stephen Downes said...

The point of dispute is in this paragraph:

"On one hand, SG owners need to competitively bid for funding, but there are obviously significant duplication in terms of infra-structure and scalability. Referring back to the data model I developed (see Meta Meta Meta Data Draft 0.1, the value the SG owners would protect in order to maintain their survival is 'type 2' data. I did not see how they will give up type 2 data without killing themselves. The collaboration model, hence, has to protect the ownership of type 2 data. What we can do, at the time, was to come up with a harmonized type 3 data so that searches can be done to these SG without SG explicitly sharing their type 2 data."

The presumption is, producers of type 2 data (what I would call third party metadata) cannot share this data without, as Albert says, "killing themselves". I don't agree with this point; I think there are numerous business models which allow the sustainable sharing of type 2 data. It is, after all, just the content sharing model all over again, but with a new type of content.

The content sharing argument, which would be applied to resources such as news articles, academic essays, and music (to name a few), was that you cannot share this content sustainability; that unless access were restricted it would not be financially viable.

Numerous arguments have been offered against this position, but it seems evident that in a world with open source software, Creative Commons, the Public Library of Science and OAI (to name a few) we are finding that the sharing of content can be financially viable. What made Siskel and Ebert - the classic progenitors of Type 2 data - so popular is that they made their ratings widely and freely available.

I moreover add that the economics of content will also apply to Type 2 data - that while, in the past, there was a substantial market for this content, there will be in the future no market worth preserving. If it's true that everyone is an author, and hence that the world will be flooded with content, it is even more true that everyone is a critic, and that we will be deluged with Type 2 data. Hiding and protecting an asset that has no intrinsic sale value makes no sense.

Yes, there is a value-add in the "colour" I and others add with out news feeds. But it is a value that cannot be realized in and of itself - why would people pay for, say, MERLOT's peer review evaluations when they can get equally valuable evaluations for free?

There is a suggestion in Ip's argument that the 'free' content will be realized in a third layer, Type 3 data, which involves the use of Type 2 data in order to provide aggregated feeds. That such a Type 3 data will begin to exist is certain. But the creator of type 3 resources will be no more likely to pay for Type 2 data than a consumer.

What Federated Searching does is to close off the possibility of free Type 2 data by making sure that this data can never be associated with the Type 1 data to which it refers. It is an attempt to create the monopoly before the marketplace exists.

But the enforcement of such a monopoly will be as problematic as the enforcement of any closed data regime. Follow the logic, and you see that in order to protec this monopoly, basic data - such as the location of a resource - must be treated as intellectual property, and the posting of a link to that resource, or metedata about that resource, depicted as piracy.

When reference becomes a privilege, rather than a right, we destroy the network.

Albert Ip said...

Thank you, Stephen for responding and again for highlighting a blind spot which I failed to update since the 1999 writing. I now totally agree that there is no reason why type 2 and type 3 data are not subjected to the same economical forces of type 1 data. While there are many resources (type 1 data) locked in fee-paying repositories, there are many resources which are based on more liberated IP licenses. Type 2 and type 3 data are no different. I am also in favour of protecting such diversity as well. If any fee-paying service can economically survive, let it be and good luck to it.

The subject gateways which I examined back in 1999 were ALL open access, i.e. their services were available free to the general public. These subject gateways were funded by Australian public money and hence the work was related to how to efficiently utilise that money, e.g. by eliminating unnecessary duplications. How far up the command chain had that work reached is unknown to me and I don't know whether it had any impact on any decision! The idea was to provide a business model for the SGs owner to survive and compete at the same time.

That said, I believe the point of misunderstanding is that I failed to recognise a characteristic of the repositories in Stephen's federated search. I believe Stephen is against* "closed" data repositories which are based on an industrial age IP model. I agree that that business model is outdated and changes are needed. The 1999 work had described several layers of collaboration. Enabling "mega" search was the 2nd level of collaboration.

One of my focus was on the diversity of search technology. Just to repeat a point I already made, I don't see any reason why search has to be centralised. Searching across repositories in real-time is possible in the future. At present, the network latency may make such a search slow, but I believe it will improve so that several rounds of "mega" searches can be conducted within human response time range.

There exists many major repositories: EdNA, MARLOT, and so on. It would be nice if my search on a resource will aggregate comments from EdNA, MARLOT and a number of my "trusted" authorities of opinions. I would also like to be able to specify which opinion authorities I prefer and hence put a higher weight on the results from those. Someone said Internet Economy is an economy based on the scarcity of "attention". Mega search may help us better utilize our Internet scarce resource by helping us to focus on things we trust.

*this is may a strong word I forced into Stephen's mouth. I just lack of a better word. :)

Note: I have switched to make a distinction between two terms in this comment: "federated search" is reserved for searching across closed repositories and "mega search" for searching across repositories without regard to whether such repositories are closed or not. Mega search is more a technology than a business model.