博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
经典算法在几个开源项目中的应用
阅读量:5908 次
发布时间:2019-06-19

本文共 15188 字,大约阅读时间需要 50 分钟。

PS:很多学生和软件工程师都会好奇自己过去学习的算法有什么实际应用的价值。这个StackExchange的回答列出了各种经典算法在几个开源项目中的应用。作者罗列出了从最基础的hash table到字符串匹配和加密算法等在Chromium和Linux内核的代码。查看开源代码是学习算法实现一个好途径。

出处:

【编者按】本文原始内容来源于,遵循协议;

近日在Stackexchange上提了这样的一个,他希望有人能够列举一些目前软件、硬件中正在使用的算法的实际案例来证明算法的重要性,对于大家可能给到的回答,他还提出了几点要求:

  1. 使用这些算法的软件或者硬件应该是被广泛应用的;
  2. 例子需要具体,并给出确切的系统、算法的引用地址;
  3. 在经典的本科生或者博士的课程中应该教过这些算法或者数据结构;

的回复获得了最佳答案,他的具体回复内容如下:

Linux内核中的基本数据结构和算法

  1. 、和
  2. ,代码中的注释将会告诉你一些教科书中不能学到的内容:

    这是一个简单的B+树实现,我写它的目的是作为练习,并以此了解B+树的工作原理。结果该实现发挥了它的实用价值。

    ...

    一个不经常在教科书中提及的技巧:最小值应该放在右侧,而不是左侧。一个节点内所有被使用的槽位应该在左侧,没有使用的节点应该为NUL,大部分的操作只遍历一次所有的槽位,在第一个NUL处终止。

  3. 用于、等;

  4. 调度、虚拟内存管理、跟踪文件描述符和目录条目等;
  5. ,用于内存管理、NFS相关查找和网络相关的功能;

    radix树的一个常见的用法是保存页面结构体的指针;

  6. ,文字上的描述,主要是在教科书中实现,用于;

    包含指针的只允许简单插入的静态大小优先级堆,基于CLR(算法导论)第七章

  7. 哈希函数,引用Knuth和他的一篇论文:

    Knuth建议选择与机器字长所能表达的最大整数约成黄金比例的素数来做乘法散列,Chuck Lever 证实了这个技术的有效性;

    这些选择的素数是位稀疏的,也就是说对他们的操作可以使用位移和加法来替换机器中很慢的乘法操作;

  8. 有些代码,比如这个,他们是自己实现的哈希函数

  9. ,用于实现、等;
  10. ,用于处理flags、中断等,在Knuth第四卷中有对其特性的描述;
  11.  和 
  12. 二叉树搜索用于、等;
  13. 和他的变体被应用于;

    在命名空间树中执行一个修改过的深度优先算法,开始(和终止于)start_handle所确定的节点。当与参数匹配的节点被发现以后,回调函数将会被调用。如果回调函数返回一个非空的值,搜索将会立即终止,这个值将会回传给调用函数;

  14. 用于在运行时检查锁的正确性;
  15. 链表上的用于、等;
  16. 在某个里,冒泡排序居然也被实现了
  17. Knuth、Morris和 Pratt [1]实现了一个线性时间复杂度字符串匹配算法。该算法完全规避了对转换函数DELTA的显式计算。其匹配时间为O(n)(其中n是文本长度),只使用一个辅助函数PI[1...m](其中m是模式的长度),模式的预处理时间是O(m)。PI这个数组允许DELTA函数在需要时能迅速运行。大体上,对任意状态q=0,1,...,m和任意SIGMA中的字符"a",PI["q"]保存了独立于"a"的信息,并用于计算DELTA("q", "a")。由于PI这个数组只包含m个条目,而DELTA包含O(m|SIGMA|)个条目,我们通过计算PI进而在预处理时间保存|SIGMA|的系数,而非计算DELTA。

    [1] Cormen, Leiserson, Rivest, Stein Introdcution to Algorithms, 2nd Edition, MIT Press

    [2] See finite automation theory

  18. Boyer-Moore模式匹配,如下是引用和对其他算法的使用建议;

    Boyer-Moore字符串匹配算法:

    [1] A Fast String Searching Algorithm, R.S. Boyer and Moore. Communications of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.

    [2] Handbook of Exact String Matching Algorithms, Thierry Lecroq, 2004

    注意:由于Boyer-Moore(BM)自右向左做匹配,有一种可能性是一个匹配分布在不同的块中,这种情况下是不能找到任何匹配的。

    如果你想确保这样的事情不会发生,使用Knuth-Pratt-Morris(KMP)算法来替代。也就是说,根据你的设置选择合适的字符串查找算法。

    如果你使用文本搜索架构来过滤、网络入侵检测(NIDS)或者任何安全为目的,那么选择KMP。如果你关乎性能,比如你在分类数据包,并应用服务质量(QoS)策略,并且你不介意可能需要在分布在多个片段中匹配,然后就选择BM。

Chromium 浏览器中的数据结构和算法

  1. 此树会被分配策略参数化,这个策略负责在C的自由存储空间和区域中分配列表,参见zone.h

  2. Demo中使用了图

同时,代码中还包含了一些第三方的算法和数据结构,例如:

  1. 用于压缩的
  2. 苹果实现的

编程语言类库

  1. ,包含的有列表、堆、栈、向量、
  2. 非常广泛,包含的太多
  3. ,包含了诸如Boyer-Moore和Knuth-Morris-Pratt字符串匹配算法等;

分配和调度算法

  1. 最近最少使用算法有多种实现方式,在Linux内核中是基于的;
  2. 其他可能需要了解的是先入先出、最不常用和轮询;
  3. VAX、VMS系统中大量使用FIFO的变体;
  4. 的被用于Linux中页面帧替换;
  5. Intel i860处理器中使用了随机替换策略;
  6. 被用于一些IBM的存储控制中,由于在PostgreSQL只有简单的应用;
  7. Knuth在TAOCP第一卷中提到的被用于Linux内核中,FreeBSD和都在使用jemalloc并发分配器;

*nix系统中的核心组件

  1. grep和awk都实现了使用Thompson-McNaughton-Yamada构建算法实现从正则表达式中创建NFA
  2. tsort实现了拓扑排序
  3. fgrep实现了;
  4. GNU grep,据作者Mike Haertel所说,;
  5. Unix中的crypt(1)实现了(Enigma Machine)中的加密算法的变种;
  6. Doug Mcllroy基于和James合作的原型实现的,比用来计算Levenshtein距离的标准动态规划算法更好,Linux版本被用来计算最短编辑距离;

加密算法

  1. ,尤其是Tiger Tree Hash的变种,用于点对点的程序,例如和;
  2. 用于为软件包提供校验码,还用于*nix系统()中的完整性校验,同时他还支持Windows和OS X系统;
  3. 实现了需要加密算法,诸如AES,Blowfish,DES,SHA-1,SHA-2,RSA,DES等;

编译器

  1. yacc和bison实现了
  2. 支配算法用于基于SSA形式的最优化编译器;
  3. lex和flex将正则表达式编译为NFA;

压缩和图片处理

  1. 为GIF图片格式而出现的Lempel-Zivsraf算法在图片处理程序中经常被应用,从一个简单的*nix组件转化为一个复杂的程序;

  2. 运行长度编码被用于生成PCX文件(用于Paintbrush这个程序中),压缩BMP文件和TIFF文件;

  3. 小波压缩(Wavelet压缩)是JPEG 2000的基础,所以所有生成JPEG 2000文件的数码相机都是实现了这个算法;

  4. Reed-Solomon纠错用于、CD驱动、条形码读取,并且结合卷积从航行团队进行图片传输;

冲突驱动条款学习算法(Conflict Driven Clause Learning)

自2000年以来,在工业标准中的SAT(布尔满足性问题)求解器的运行时间每年都在成倍减少。这一发展的一个非常重要的原因是冲突驱动条款学习算法(Conflict Driven Clause Learning)的使用,它结合了Davis Logemann和Loveland的约束编程和人工智能研究技术的原始论文中关于布尔约束传播的算法。具体来说,工业建模中SAT被认为是一个简单的问题()。对我来说,这是近代最伟大的成功故事之一,因为它结合了先进的算法、巧妙的设计思路、实验反馈,并以一致的共同努力来解决这个问题。。许多大学都在教授这个算法,但通常是在逻辑或形式化方法的课程中。

原始问题链接:

转自 

出处:

167
accepted

Algorithms that are the main driver behind a system are, in my opinion, easier to find in non-algorithms courses for the same reason theorems with immediate applications are easier to find in applied mathematics rather than pure mathematics courses. It is rare for a practical problem to have the exact structure of the abstract problem in a lecture. To be argumentative, I see no reason why fashionable algorithms course material such as Strassen's multiplication, the AKS primality test, or the Moser-Tardos algorithm is relevant for low-level practical problems of implementing a video database, an optimizing compiler, an operating system, a network congestion control system or any other system. The value of these courses is learning that there are intricate ways to exploit the structure of a problem to find efficient solutions. Advanced algorithms is also where one meets simple algorithms whose analysis is non-trivial. For this reason, I would not dismiss simple randomized algorithms or PageRank.

I think you can choose any large piece of software and find basic and advanced algorithms implemented in it. As a case study, I've done this for the Linux kernel, and shown a few examples from Chromium.

Basic Data Structures and Algorithms in the Linux kernel

Links are to the .

  1. , , .
  2.  with comments telling you what you can't find in the textbooks.

    A relatively simple B+Tree implementation. I have written it as a learning exercise to understand how B+Trees work. Turned out to be useful as well.

    ...

    A tricks was used that is not commonly found in textbooks. The lowest values are to the right, not to the left. All used slots within a node are on the left, all unused slots contain NUL values. Most operations simply loop once over all slots and terminate on the first NUL.

  3.  used for , , etc.

  4.  are  are used for scheduling, virtual memory management, to track file descriptors and directory entries,etc.
  5. , are used for , NFS related lookups and networking related functionality.

    A common use of the radix tree is to store pointers to struct pages;

  6. , which is literally, a textbook implementation, used in the .

    Simple insertion-only static-sized priority heap containing pointers, based on CLR, chapter 7

  7. , with a reference to Knuth and to a paper.

    Knuth recommends primes in approximately golden ratio to the maximum integer representable by a machine word for multiplicative hashing. Chuck Lever verified the effectiveness of this technique:

    These primes are chosen to be bit-sparse, that is operations on them can use shifts and additions instead of multiplications for machines where multiplications are slow.

  8. Some parts of the code, such as , implement their own hash function.

    hash function using a Rotating Hash algorithm

    Knuth, D. The Art of Computer Programming, Volume 3: Sorting and Searching, Chapter 6.4. Addison Wesley, 1973

  9.  used to implement ,  etc.
  10. , which are used for dealing with flags, interrupts, etc. and are featured in Knuth Vol. 4.

  11.  and 

  12.  is used for , , etc.

  13.  and variant used in .

    Performs a modified depth-first walk of the namespace tree, starting (and ending) at the node specified by start_handle. The callback function is called whenever a node that matches the type parameter is found. If the callback function returns a non-zero value, the search is terminated immediately and this value is returned to the caller.

  14.  is used to check correctness of locking at runtime.

  15.  on linked lists is used for , , etc.

  16.  is amazingly implemented too, in a driver library.

  17. ,

    Implements a linear-time string-matching algorithm due to Knuth, Morris, and Pratt [1]. Their algorithm avoids the explicit computation of the transition function DELTA altogether. Its matching time is O(n), for n being length(text), using just an auxiliary function PI[1..m], for m being length(pattern), precomputed from the pattern in time O(m). The array PI allows the transition function DELTA to be computed efficiently "on the fly" as needed. Roughly speaking, for any state "q" = 0,1,...,m and any character "a" in SIGMA, the value PI["q"] contains the information that is independent of "a" and is needed to compute DELTA("q", "a") . Since the array PI has only m entries, whereas DELTA has O(m|SIGMA|) entries, we save a factor of |SIGMA| in the preprocessing time by computing PI rather than DELTA.

    [1] Cormen, Leiserson, Rivest, Stein Introdcution to Algorithms, 2nd Edition, MIT Press

    [2] See finite automation theory

  18.  with references and recommendations for when to prefer the alternative.

    Implements Boyer-Moore string matching algorithm:

    [1] A Fast String Searching Algorithm, R.S. Boyer and Moore. Communications of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.

    [2] Handbook of Exact String Matching Algorithms, Thierry Lecroq, 2004 

    Note: Since Boyer-Moore (BM) performs searches for matchings from right to left, it's still possible that a matching could be spread over multiple blocks, in that case this algorithm won't find any coincidence.

    If you're willing to ensure that such thing won't ever happen, use the Knuth-Pratt-Morris (KMP) implementation instead. In conclusion, choose the proper string search algorithm depending on your setting.

    Say you're using the textsearch infrastructure for filtering, NIDS or

    any similar security focused purpose, then go KMP. Otherwise, if you really care about performance, say you're classifying packets to apply Quality of Service (QoS) policies, and you don't mind about possible matchings spread over multiple fragments, then go BM.

Data Structures and Algorithms in the Chromium Web Browser

Links are to the . I'm only going to list a few. I would suggest using the search feature to look up your favourite algorithm or data structure.

  1. .

    The tree is also parameterized by an allocation policy (Allocator). The policy is used for allocating lists in the C free store or the zone; see zone.h.

  2.  are used in a demo.
  3. .
There are also such data structures and algorithms in the third-party code included in the Chromium code.

  1. Conclusion of Julian Walker

    Red black trees are interesting beasts. They're believed to be simpler than AVL trees (their direct competitor), and at first glance this seems to be the case because insertion is a breeze. However, when one begins to play with the deletion algorithm, red black trees become very tricky. However, the counterweight to this added complexity is that both insertion and deletion can be implemented using a single pass, top-down algorithm. Such is not the case with AVL trees, where only the insertion algorithm can be written top-down. Deletion from an AVL tree requires a bottom-up algorithm.

    ...

    Red black trees are popular, as most data structures with a whimsical name. For example, in Java and C++, the library map structures are typically implemented with a red black tree. Red black trees are also comparable in speed to AVL trees. While the balance is not quite as good, the work it takes to maintain balance is usually better in a red black tree. There are a few misconceptions floating around, but for the most part the hype about red black trees is accurate.

  2.  is used for compression.
  3. .
  4.  implemented by Apple Inc.
  5. .

Programming Language Libraries

I think they are worth considering. The programming languages designers thought it was worth the time and effort of some engineers to implement these data structures and algorithms so others would not have to. The existence of libraries is part of the reason we can find basic data structures reimplemented in software that is written in C but less so for Java applications.

  1. The  includes lists, stacks, queues, maps, vectors, and algorithms for .
  2.  is very extensive and covers much more.
  3. The  includes algorithms like Boyer-Moore and Knuth-Morris-Pratt string matching algorithms.

Allocation and Scheduling Algorithms

I find these interesting because even though they are called heuristics, the policy you use dictates the type of algorithm and data structure that are required, so one need to know about stacks and queues.

  1. Least Recently Used can be implemented in multiple ways. A  in the Linux kernel.
  2. Other possibilities are First In First Out, Least Frequently Used, and Round Robin.
  3. A variant of FIFO was used by the VAX/VMS system.
  4.  by  is used for page frame replacement in Linux.
  5. The Intel i860 processor used a random replacement policy.
  6.  is used in some IBM storage controllers, and was used in PostgreSQL though .
  7. The , which is discussed by Knuth in TAOCP Vol. 1 is used in the Linux kernel, and the jemalloc concurrent allocator used by FreeBSD and in .

Core utils in *nix systems

  1. grep and awk both implement the Thompson-McNaughton-Yamada construction of NFAs from regular expressions, which apparently .
  2. tsort implements topological sort.
  3. fgrep implements the 
  4. GNU grep,  according to the author Mike Haertel.
  5. crypt(1) on Unix implemented a variant of the encryption algorithm in the Enigma machine.
  6.  implemented by Doug McIllroy, based on a prototype co-written with James Hunt, performs better than the standard dynamic programming algorithm used to compute Levenshtein distances. The  computes the shortest edit distance.

Cryptographic Algorithms

This could be a very long list. Cryptographic algorithms are implemented in all software that can perform secure communications or transactions.

  1. , specifically the Tiger Tree Hash variant, were used in peer-to-peer applications such as  and .
  2.  is used to provide a checksum for software packages and is used for integrity checks on *nix systems () and is also supported on Windows and OS X.
  3.  implements many cryptographic algorithms including AES, Blowfish, DES, SHA-1, SHA-2, RSA, DES, etc.

Compilers

  1.  is implemented by yacc and bison.
  2. Dominator algorithms are used in most optimizing compilers based on SSA form.
  3. lex and flex compile regular expressions into NFAs.

Compression and Image Processing

  1.  algorithms for the GIF image format are implemented in image manipulation programs, starting from the *nix utility convert to complex programs.
  2. Run length encoding is used to generate PCX files (used by the original Paintbrush program), compressed BMP files and TIFF files.
  3. Wavelet compression is the basis for JPEG 2000 so all digital cameras that produce JPEG 2000 files will be implementing this algorithm.
  4. Reed-Solomon error correction is implemented in , CD drives, barcode readers and was combined with convolution for image transmission from Voyager.

Conflict Driven Clause Learning

Since the year 2000, the running time of SAT solvers on industrial benchmarks (usually from the hardware industry, though though other sources are used too) has decreased nearly exponentially every year. A very important part of this development is the Conflict Driven Clause Learning algorithm that combines the Boolean Constraint Propagation algorithm in the original paper of Davis Logemann and Loveland with the technique of clause learning that originated in constraint programming and artificial intelligence research. For specific, industrial modelling, SAT is considered an easy problem (). To me, this is one of the greatest success stories in recent times because it combines algorithmic advances spread over several years, clever engineering ideas, experimental evaluation, and a concerted communal effort to solve the problem. The  is a good read. This algorithm is taught in many universities (I have attended four where it was the case) but typically in a logic or formal methods class.

Applications of SAT solvers are numerous. IBM, Intel and many other companies have their own SAT solver implementations. The  in OpenSUSE also uses a SAT solver.

转载于:https://www.cnblogs.com/leonxyzh/p/7289047.html

你可能感兴趣的文章
区块链傻瓜书:EOS与以太坊对比
查看>>
如何设计并实现一个线程安全的 Map ?(上篇)
查看>>
JavaScript的工作原理:解析、抽象语法树(AST)+ 提升编译速度5个技巧
查看>>
react-step-by-step之redux详细注释
查看>>
随手打造一个可以替换全站字符串的nginx镜像(docker)
查看>>
前端开发,关于图片的那些事
查看>>
对于一致性哈希算法的理解
查看>>
初识Spring —— Bean的装配(二)
查看>>
腾讯云上 PhantomJS 用法示例
查看>>
Denial of App - Google Bug 13416059 分析
查看>>
SQL Server中变量的声明和使用方法
查看>>
从函数调用到函数式编程
查看>>
2016 Top 10 Android Library
查看>>
理性分析 AJAX 跨域问题
查看>>
Mysql 到 Hbase 数据如何实时同步,强大的 Streamsets 告诉你
查看>>
浏览器的存储
查看>>
centos7 wiki搭建
查看>>
开放分布式追踪(OpenTracing)入门与 Jaeger 实现
查看>>
郭小喵(CarGuo)的2018总结 | 掘金年度征文
查看>>
聊聊nginx的几个常见异常
查看>>