用大数据解放科学家,学术更简单-阿里云开发者社区

毕业季不仅有聚会、合照、啤酒、大哭，还有让所有人抓狂的毕业论文。而毕业论文中的文献综述工作是让小编印象最深刻的恐怖经历。相信很多读者也一样经历过寻找、逐一阅读并整合一个领域成百上千篇论文的经历。本期原点的创业嘉宾也一样，在完成博士论文期间萌生了用计算机和算法代替学者们完成这一枯燥工作的想法。这一想法从诞生到初步完成花费了五年时间，现在，生物医学领域的学者们已经基本可以利用这一工具追溯一个领域的研究历史及重要论文了。

Sam Molyneux给自己的这个网站取名位Sciencescape，希望能够建立起一种逃离传统学术搜索类网站的用户体验，他给自己网站的一个定义是“twitter-like“，从外观和使用感受上都想给读者一种全新的学术体验。
大数据文摘原点今日有幸跟大家分享对Sciencescape创始人Sam Molyneux的访谈，带您一起了解这位医学家的创业经历，看看他如何在做学术的同时赚取自己的一桶金。

观看以下视频先看看Sciencescape在做什么（02:12）

英语阅读训练马上开始Are You Ready? Q & A 1. Canyou briefly introduce Sciencescape? Whatare the factors/motivations that drive you to set up Sciencescape?

Briefly,we set out to build Sciencescape which is a place on the web that connects youto breaking research as happens around the world, and also gives you somepowerful tool to explore the entire history of Biomedicine.
The motivations for building Sciencescape, were a number of experiences that I hadas a cancer genomics researcher, through a PhD that I was working on a numberof years ago, essentially with the challenges around how difficult it is tobrowse and explore the history of any fields of research or fields of researchyou were working with, identifying landmark papers, tracing the connectionsamong the papers through history. And then second to that, how challenging itis, more nearly possible to be to actually stream and follow the research in apersonalized manner. Based on that, we looked at the tools were in market atthe time, which were essentially just search engines with email alerts and RSSFeed that you could link to from journal tables of content.
We envision a platform thatyou could walk into and follow anything in your world of research whether it ispeople, genes, diseases, proteins, products, places, affiliations, etc., anyentity, among millions of potential entities in the world of research thatrelate to you and in a “twitter-like” experience, research which stream to you,through the web and through the mobile, and then there will be some powerfulinterfaces for explore history. That is something we envisioned, and 5 yearslater, we finally made some progress with that. 0?wx_fmt=jpeg

- Technical & Product-Related Questions

2. Based on your short intro video and online information, we’ve learned that Sciencescape leverages the advantages of Natural Language Processing and Content Identification techniques. How to apply those techniques to your platform to provide real-time customized recommendations for users?

A really good call of what we work on Sciencescape beyond the platform and product user experience, the core work we work on is a knowledge graph, which we have been building for almost half decades now. Essentially the knowledge graph concept is a network of linked entities, to which you organize content around and you organize related to each other, applying that concept to the world of biomedical literature. We have set up on this project to be able to in an automatic fashion using machine learning intelligence and identify all of these entities that are related to any paper. So those entities come from essentially informatics classes, some of these I just mentioned. To do that, you have to apply a family of nature language processing with different algorithms, versus genes, versus disease, versus an organism, versus a product, etc., and a number of other machine learning approaches, such as classification algorithms, to do with disambiguating authors or affiliations. We apply many different machine learning approaches to the universal challenge of tagging papers correctly with entities and concept.

3. What kinds of metrics are used to value the paper? Through the numbers of citation? What else? How to evaluate the importance of paper？

A number of years ago, we looked at the metrics field and we concluded that there is a lot of work to do with citation work alone – a lot of work that are not fully productized. We thought to identify a universal metrics that would allow us to organize papers relative to each other, and organize entities relative to each other. We landed on a various of pagerank algorithm, which is well known as the algorithm that google started on. We identify the variant which is called eigenfactor, which is very similar to pagerank, we just applied the citation network. And if you calculate that on a very large an accepted citation network – today I believe we have the 4th largest in the world. If you calculated on that, you will end up with robust metrics at the article level to use any information which is retrievable, search rankings and other listing of the papers. But those metrics also propagates at the level of entity, so you can calculate the institute level eigenfactor, or the author level eigenfactor, or the gene level eigenfactor. So we ended up with a flexible, powerful and robust metric, which propagates through our system and allows much of these experiences to keep figured in list things. A short answer is to use eigenfactor.

That is only where we started, so we thought we identify base metric to build our initial system on and where we are going with is of course a diverse set of metrics, possibly hundreds of metrics, pulled from through the web all the metrics, as well as internal network metrics to be able to improve the experience in a personalized way for users and allow to explicitly rank the papers and entities against. We actually get a lot of progress on that as well.

4. Some of the algorithm, such as page ranks, you just mentioned, are widely applied by other companies in industry as well. Compared to those big article search engines, your competitors, such as google scholar, what are your competitive advantages?

First, we do not spend a lot of time thinking about our competitors, we spent a lot of time thinking about how to build a different transformed experience for streaming and exploring papers through history. I don’t think there are any products, any platforms and any services there that fill that need, which is why is worth passion on it and why we work intensively on this. We think it is a unifying platform – it is not necessarily a distractive platform, because the function is not well filled. So thinking about search, search is an effective way that literature is access. We think there is a great search there, such as google scholars, is the most extensive search on citation. But search is a solved problem for papers you know exist, if you can describe a paper well. If you have fragment of the title, or part of whole text abstract, or a couple of the authors, it is trivial to use searches on that paper. You can use pubmed and google scholar to do this. There is great enterprise search as well. We don’t work on and think about search too much, we intend to have good search on our platform. But we think about streaming and discovering a lot more and we think the best opportunity really is to improve scientific leadership experience globally.

5. Through some other interview to you, we got to know that your company also had some partnership relation with some individuals or organizations, such as publishers. Compared to the product you mentioned before, this is totally a different business model. Can you talk a little bit more on that?

We have been building our partnership with publishers for quite a while. Essentially, we worked with publishers to help increase the discoverability of article on full text bases. We think publisher is really important in terms of on time delivery of weekly or monthly articles that researchers searched for. In the current marketplace, that is essentially not the case. The researchers are unserved and publishers are unserved. So we built a program over the last a couple of years for publishers. We accept their content and mine it in a very deep way to increase the discoverability by applying the family of machine learning intelligence algorithms to teach one of the articles. We are open to publisher industrial and value the relationship that we are working together.

- Operation and Strategy Questions

6. Has the company generated revenue so far? How do you monetize your business model?

We are in Series A stage. We had some fantastic investors; we have a large Canadian group investors, who we built the company with during Series A stage. During the next stage, we have venture capital firms in Canada, Hong Kong and US. We have 30 people in the company and we are located in a beautiful start-up based downtown Toronto. Our company is dominant by data science and engineering and we just start to build up our product team, in terms of product management, marketing, design. Our company has gone through tremendous transformation internally. A lot of workload will start to expand externally over the next 6 months.

7. What is the biggest challenge your company came across so far? How did you deal with the challenge?

Ourchallenge is pretty similar to any other start-up companies. Fund raising isalways a challenge. We are in very important but specialized market. We are notproviding a consumer-facing application in the sense that we offered to thebroad consumer market. We have consumer-facing application with respect ofresearchers and leaders of research literature. So there is not 1 billionreaders of scientific literature – it is just a small number. But the people whoinvolved in reading the literature are tremendous important from what theycontributed to society and the budget they command in terms of researchdollars, in terms of driving medical and other society progress. Because ofthat specialized nature of market, sometimes it is difficult to persuade peopleto invest in the company. I think at this time, we have achieved a lot and weovercome a lot of those communication limitations.

8. Our big data digest has over 150 thousands followers who are interested or currently working in various big data fields. Is China going to be one of your future markets? How can we help you to make influence using our effective connections and resources?

We see tremendous opportunity to help increase discoverability of we search for Chinese researchers, a market growing really fast. One of the companies we have partnership with, focus on some of these emergent research markets and some of these data shows the growth of these markets is just explosive. So China is definitely a big market for us and India as well. If any opportunities help us to grow our usership through networks, we will be very graceful to work with you.

We have Institutional Ambassador Program – researchers who interested can join our company and work from abroad, to promote their research through our product. We can provide contact info for the readers who are interested in this program.

- Other questions related to market future, startups management and entrepreneurship

9. As a researcher, how do you envision the cancer therapy in future?

I think the program of personalized medicine and Cancer Genomics is a long-term program. So it is going to take many years to elucidate all the long tails of changes in each cancer genome. Each tumor is very different. Cells in the tumor presents lot of differences. heterogenous changes are the most constant thing in cancer genome. But looking into this in another way, despite of the large number of mutations in different cancer genes, all the changes are kind of resolved into a relatively small number of pathways. If we have enough pathway-targeting drugs and we can use them in a combinatorial fashion we can systematically and effectively suppress cancer across multiple pathways. Of course we also need to consider and take care of cancer evolution, but I think this is a productive approach. Also recent cancer immunotherapy results are staggering. Considering the nature of cancer and cancer genome evolution, it is definitely better to have multiple-line of therapies. If we can combine them we can probably suppress cancer in a long run.

10. Do you have any plans to expand your business to clinical informatics? Or other related medical informatics?

I think that is interesting. We have a lot of people who have background of bioinformatics and clinical/medical informatics as well. We are passionate to those approaches. But as a start-up company, we must focus on a main problem, a main market and get really good with something within that space. So we think that the data we are yielding in the process of mining scientific articles to create our knowledge fast can probably be used in precision medicine and probably be used in bioinformatics as another data source to analyze. We will probably focus on providing a transformed data set that can be used by any academics. We are working on for next year the ability for academic to use our data and build on it.

尽管Sam认为Sciencescape尚在初级阶段，还有很多潜力有待深度挖掘，但从其言语间不难听出他为Sciencescape所描绘的宏伟蓝图。他希望用户在文献搜索时，能通过Sciencescape徜徉在生物医学的知识世界，更高效地随手触及从古至今的文献资料。
然而，不光是Sciencescape界面的用户体验还是目前的文献数量都还不甚完善，要想成为与Google scholar相媲美、并有自己独特优势的学术文献搜索工具，Sciencescape还有很长的路要走。
面对潜力巨大的中国市场，Sam明确表达了他想要接触并合作的渴望。但至于能否像Wikipedia一样准确进行非英语语言的翻译工作，Sam只是表达了愿望，却并未将这项工作列入近期的议事日程。

也许对于Sciencescape这样一家创业公司来说，能够“标新立异”可以算是一个很好的开端，但未来能走多远、走多广，还要看其对于市场的把握以及对于主体业务的探索。

原文发布时间为：2015-07-04

本文来自云栖社区合作伙伴“大数据文摘”，了解相关信息可以关注“BigDataDigest”微信公众号

用大数据解放科学家,学术更简单

大数据文摘

热门文章

最新文章

相关课程

相关电子书

相关实验场景