I was invited to give a presentation at a workshop at the UDS-SJTU Joint Research Lab for Language Technology, a joint research lab of Saarland University, Germany and Shanghai Jiao Tong University, China. I gave a brief overview on how we have been building Totuba’s research workspace based on existing services and data. It an interesting time to be an AI researcher: thanks to the Linked Open Data Initiative huge amounts of interlinked machine-processable data are available in Web; similarly Web services exist that enable sophisticated processing of text, for instance OpenCalais, a service that extracts the topics a text is about. In plain English, this means that we were able to reuse these tools and data for Totuba and could quickly build a prototype that would have been prohibitively expensive only a few years ago.
In my talk I stressed that a landscape that encourages reuse creates advantages for research / commercial applications.
Here are the slides:
In the following, a brief overview on some of the other talks.
Prof. Hans Uszkoreit presented Hybrid Machine Translation, which combines the two leading paradigms of machine translation: statistical machine translation and rule-based machine translation. Both ways have their advantages: statistical systems are ahead in closed domains, while in an open domain, rule-based systems do better. The main idea of hybrid machine translation is to substitute phrases from the rule-based translation with phrases from the statistical machine translation.
Prof. Uszkoreit also presented Project EuroMatrix, a championship for translation with European languages.
Feiyu Xu explained how to use “seeds” to extract information from a text, e.g., the seed (ElBaradei, Nobel prize, peace, 2005) can help find similar information. Feiyu showed how important it is to select the right seed and how negative seeds (e.g., (nominated, Noble Prize)) can improve the precision (but then recall suffers).
Xiwen Cheng presented the EU project RASCALLI and gave a demo of their gossip agent.
Jun Liu gave an overview of the Chinese opinion analysis evaluation (COAE 2008), organized by the Chinese Information Processing Society China. This could have been a very interesting starting point to learn more about this topic, but I was unable to find any Web page for it, just one paper. Additionally, the results are completely anonymous so you don’t even know who performed at what level. Not really useful. Related competitions for English are TREC and NTCIR MOAT (Multilingual opinion analysis task).
Hongyan Song discussed the problem that evaluating opinion mining requires annotated opinioned corpus, which is labor intensive to produce. He showed how active learning can speed up the annotation. In his approach, the active learning algorithm queries the user for labels in a training data, an approach suitable for situations in which unlabeled data is abundant but labeling data is expensive. The basic idea is to take those instances that the classifier is most unsure about and query the user about them. He uses HowNet, a Chinese common-sense knowledge system based on WordNet for this purpose.
Xiaojun Zhang presented an iterative reinforcement approach for attribution-sentiment pair extraction. His approach starts with attribution/sentiment seeds, and then retrieves potential other attribution/sentiments from the training data. He too uses HowNet to compute the similarity to the seeds.
I’m not well-versed in Machine Translation, so this workshop helped to get an idea of what has become possible today and an introduction to how it is done. At Totuba we are looking at many ways to automatically extract information about courses from the Web. However, we are faced with many challenges, as there is lack of standardisation and open source libraries that we can draw data from. Prof. Uszkoreit suggested to more deeply investigate how people encode addresses, pricing information, etc, not only in the language course domain, but in other domains, too, to enlarge the amount of examples we can use for training. That feedback was well received and has been included in our range of investigations already in progress. Prof. Uszkoreit cited the work of Frank Puppe on Textmarker, a system for learning meta knowledge for rule-based knowledge-extraction.
To sum up, a very interesting workshop and one that shows how research can be facilitated and enabled by existing tools, libraries and data.




[...] new article titled “Presentation at the UDS-SJTU Joint Research Lab for Language Technology” has been posted at Totuba Labs. The following is an excerpt: At Totuba we are looking at many [...]
[...] 原文链接:http://labs.totuba.com/?p=8 [...]