加载中…
正文 字体大小:

谷歌机器学习工业应用实践43条军规

(2017-01-11 20:42:50)
标签:

机器学习实践

分类: 数据挖掘

谷歌机器学习工业应用实践43条军规


1. Don’t be afraid to launch a product without machine learning.

机器学习不是万能的,产品不一定需要机器学习,重点在于解决问题,不要为了机器学习而机器学习;

2. First, design and implement metrics.

首先,定义和设计评估指标,方便统计效果,对比和迭代优化。不谈指标的机器学习都是
耍流氓;

3. Choose machine learning over a complex heuristic.

在复杂,具备启发式的问题上使用机器学习方法;

4. Keep the first model simple and get the infrastructure right.

首先设计好机器学习系统架构,方便扩展。选择简单,易于实现的,具备可执行性的模型;

5. Test the infrastructure independently from the machine learning.

架构要具备可执行性;

6. Be careful about dropped data when copying pipelines.

复制粘贴害死懒人,小心删除数据;

7. Turn heuristics into features, or handle them externally.

基于启发式思考,设计特征;

8. Know the freshness requirements of your system.

记得更新算法系统哦,算法会逐步退化;

9. Detect problems before exporting models.

10. Watch for silent failures.

11. Give feature column owners and documentation.

文档!文档!文档!

12. Don’t overthink which objective you choose to directly optimize.

抓住业务的核心目标;

13. Choose a simple, observable and attributable metric for your first objective.

目标可量化

14. Starting with an interpretable model makes debugging easier.

15. Separate Spam Filtering and Quality Ranking in a Policy Layer.

16. Plan to launch and iterate.

迭代选择特征,没有一蹴而就的,多尝试;

17. Start with directly observed and reported features as opposed to learned features.

18. Explore with features of content that generalize across contexts.

19. Use very specific features when you can.

20. Combine and modify existing features to create new features in human­ understandable ways.

21. The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.

22. Clean up features you are no longer using.

23. You are not a typical end user.

24. Measure the delta between models.

25. When choosing models, utilitarian performance trumps predictive power.

26. Look for patterns in the measured errors, and create new features.

27.Try to quantify observed undesirable behavior.

28. Be aware that identical short­-term behavior does not imply identical long­-term behavior.

29. The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

30.Importance weight sampled data, don’t arbitrarily drop it!

31. Beware that if you join data from a table at training and serving time, the data in the table may change.

32. Re­use code between your training pipeline and your serving pipeline whenever possible.

33. If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.

34. In binary classification for filtering (such as spam detection or determining interesting e­mails), make small short­ term sacrifices in performance for very clean data.

35. Beware of the inherent skew in ranking problems.

36.Avoid feedback loops with positional features.

37. Measure Training/Serving Skew.

38. Don’t waste time on new features if unaligned objectives have become the issue.

39. Launch decisions are a proxy for long­term product goals.

40. Keep ensembles simple.

41. When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.

42. Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.

43. Your friends tend to be the same across different products. Your interests tend not to be.










0

阅读 评论 收藏 转载 喜欢 打印举报
已投稿到:
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

      

    新浪BLOG意见反馈留言板 不良信息反馈 电话:4006900000 提示音后按1键(按当地市话标准计费) 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有