[OSI] Beyond Simple Search

Sorry, this last set of notes on a breakout session at the open source conference somehow ended up stuck on my laptop. Moderator Eric Haseltine started us out with the premise: “Search is great, but sometimes it sucks.” As always, this is a highly filtered, non-transcript-level set of reflections.

* John Howard, Deputy Associate Director of Enterprise Solutions, Office of the Chief Information Officer, Office of the Director of National Intelligence
* Dr. James Mulvenon, Director, Center for Intelligence Research and Analysis, Defense Group Inc.
* Francis Kubala, Scientist, BBN Technologies
* Jason Hines, Principal Search Engineer, Google
* Moderator: Dr. Eric Haseltine, Associate Director of National Intelligence for Science and Technology, Office of the Director of National Intelligence

Howard begins by talking about projects within the community to provide search that is sensitive to the users attributes, particularly access permissions. Beyond this, should everything be discoverable? To what extent do you actually reach in and discover things? Is it OK if I find your half-finished document? Can you “half-discover” something: e.g., finding an owner of a document, but not revealing the text, and tell the searcher to seek out the author?

Kubala works on getting data from speech, retrieving information from a stream. There are lots of problems with this. One is simply a matter of the text coming out is a true wall-of-text, with no boundaries. The first thing that can be done is separating out the speakers. You can also do some highlighting of named entities. Adding structure to the transcription is a non-trivial piece of the puzzle–speech-to-text is not enough, on its own.

Add translation to this and you get real-time translation of broadcast materials. This allows for federated search across multiple languages. You can have a persistent watch list that catches keywords (in all languages) for incoming streams. [How cool would it be for this to be an open search engine! Doubt there is enough advertising to support it.] Works really well for news broadcasts–showing the stream with transcriptions in the original language as well as machine-translated text, all clickable for random access to the original video stream. But current tech does not work quite as well for “found speech” (e.g., YouTube). Currently exploring example-based queries, and iterated queries that work with the searcher to a convergent set of responses. “One of the futures for beyond simple searching is to get beyond the ad hoc query.”

Mulvenon works mainly on what appears on the Chinese web. He notes that the biggest way that they use open source is as a way to cue targeted collection of classified materials (an oft-heard refrain). Chinese is “China’s first layer of encryption.” Finding linguists and getting them cleared is extremely difficult: 3.5 to 4 years to get even naturalized Chinese and Taiwanese cleared. (This issue of the time it takes to clear personnel came up in several of the sessions.) That said, it is a great source; the number of Chinese-language pages will soon outstrip English-language pages. Must do searches “from within” China. Google.ca is not the best search engine. Baidu is also not particularly great, despite the “Chinese-centric” advertising. Lots of blogging content as well, which is both difficult to track on and sometimes valuable.

They move beyond search to look for text in the under- or un-linked dark web, as well as seeking out structural information (other hosts on shared subnets, open ports, etc.) to help leverage more complete collections. They have also been working with open source geospatial data to combine with their data and do geolocation of various materials in China.

Even military targets are fairly transparent “It’s not transparent in English, as if they’re supposed to make it easy for us.” Some items are pretty opaque, but there are some areas (logistics, etc.) that end up being made available through the mountain of online and off-line posting that goes on in China. Finally, a lot of it is beyond the ability to machine translate, layered beneath cultural allusion that simply won’t be picked up.

Hines reiterated Google’s mission “organizing the world’s information.” Google Enterprise is the fastest unit of Google, with 100% growth each year. They move products to enterprise when they are mature and ready to be secured for enterprise deployment. “Universal search” is the aim: collapsing vertical search and machine translation to do one search on the Google search box that draws in lots of different materials. Didn’t talk in much detail about the kinds of projects they are currently working on.

Q & A

Q: What’s wrong with search tools?

Too much “flotsam and jetsam.”

Q: What would be the ideal user experience?

“Smarter search would be helpful.” Wenlin, with a customized dictionary, is vital to Mulvenon’s work. Need to be able to tag content. “Every time I start to do search, it’s like I’m starting from scratch.”

What is Google doing about this? You should use Google Toolbar for, e.g., on the fly translation, as well as personalized search. [But I think there is a disconnect there; clearly Toolbar is too superficial for the kind of work we’re talking about.]

Kubala: Deep semantics still a deep problem. On the shallow end, there is duplicate and near-duplicate handling.

Q: Is search the right paradigm?

Haseltine: “We can only ask what we know to ask.” [In other words, we want to find, not search.]

Hines makes the comment, and this comes up a number of times, that there is a need to re-introduce the social to the search process. Sometimes it’s more important to know who has the tacit expertise, rather than where it is on the web.

Q. What is Google doing beyond not being evil? (Doesn’t the acquisition of Double-Click’s records provide an extremely valuable & dangerous source of traffic data?)

Hines: “We take user privacy very seriously. We wouldn’t do anything that would challenge our credibility to our users.” He goes on to note that there’s nothing to stop you from leaving Google at any time. However, he doesn’t touch the issue of whether Google will then let you take your data with you (and not have it remain in their control), the way Ask.com has recently allowed for users to request non-collection.

Q. What about Wales positioning himself as a way of taking on the search paradigm?

Hines suggests that others have tried something similar, but perhaps without the weight that Wales can put behind it. It will be interesting to see what he makes of it.

Q. What have you learned about the behavior of humans at Google?

Hines: People want simple search. They don’t want to move beyond it, and when they experiment with ways of providing alternatives, users don’t like it. They want a line to type stuff into, and an answer, and they want that to work with some consistency. They don’t want to know what’s happening under the hood.

This entry was posted in Technology and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>