Boost字符串处理-去除字符串中的空格_rsimager

http://blog.sina.com.cn/u/5136430813

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Boost字符串处理-去除字符串中的空格

(2018-06-06 11:30:13)

标签：

字符串操作

分类：编程

一、概述
    最近工作又开始忙了，额外学习boost的机会也变少了很多，再加上在使用Boost时出现了很多编译错误的问题，让写文章的过程变得不可预测了。但我还是很期待这一部分，这是在平时应用中最常见的，也是boost的看家本领了，将会着重介绍。在标准 C++ 中，用于处理字符串的是std::string 类，它提供很多字符串操作，包括查找指定字符或子串的函数。尽管 std::string囊括了百余函数，是标准C++中最为臃肿的类之一，但却仍不能满足很多开发者在日常工作中的需要。例如， Java中提供的可以将字符串转换到大写字母的函数，std::string就没有相应的功能。Boost C++ 库试图弥补这一缺憾。
二、区域设置
    进入正题之前，需要先看一下区域设置的问题，本章中提到的很多函数都需要一个附加的区域设置参数。区域设置在标准 C++ 中封装了文化习俗相关的内容，包括货币符号、日期时间格式、分隔整数部分与分数部分的符号（基数符）以及多于三个数字时的分隔符（千位符）。

    在字符串处理方面，区域设置和特定文化中对字符次序以及特殊字符的描述有关。例如，字母表中是否含有变异元音字母以及其在字母表中的位置都由语言文化决定。如果一个函数用于将字符串转换为大写形式，那么其实施步骤取决于具体的区域设置。在德语中，字母'ä' 显然要转换为'Ä'，然而在其他语言中并不一定。

    使用类std::string时区域设置可以忽略，因为它的函数均不依赖于特定语言。然而在本章中为了使用 Boost C++ 库，区域设置的知识是必不可少的。C++标准中在 locale 文件中定义了类 std::locale 。每个 C++ 程序自动拥有一个此类的实例，即不能直接访问的全局区域设置。如果要访问它，需要使用默认构造函数构造类std::locale的对象，并使用与全局区域设置相同的属性初始化。如下：

view plaincopy to clipboard
01.#include
02.#include
03.
04.int main()
05.{
06. std::locale loc;
07. std::cout << loc.name() << std::endl;
08.}
    以上程序在iostream中输出C，这就是基本区域设置的名称，它包括了 C 语言编写的程序中默认使用的描述。这也是每个 C++ 应用的默认全局区域设置，它包括了美式文化中使用的描述。如货币符号使用美元符号，基字符为英文句号，日期中的月份用英语书写。全局区域设置可以使用类std::locale中的静态函数global()改变。

view plaincopy to clipboard
01.#include
02.#include
03.
04.int main()
05.{
06. std::locale::global(std::locale("German"));
07. std::locale loc;
08. std::cout << loc.name() << std::endl;
09.}

    静态函数global接收类型为std::locale的对象作为唯一的参数，此类的另一个版本的构造函数接受类型为const char*的字符串，可以为一个特别的文化创建区域设置对象。然而，除了C区域设置相应地命名为 "C" 之外，其他区域设置的名字并没有标准化，这就依赖于接受区域设置名字的C++标准库。VS 2008的语言字符串文档指出，可以使用语言字符串 "German" 选择定义为德国文化。

    上面程序的输出是German_Germany.1252。指定语言字符串为 "German" 等于选择了德国文化作为主要语言和子语言，这里选择了字符映射1252。以此类推，如果想指定与德国文化不同的子语言设置，例如瑞士语，需要使用不同的语言字符串。
view plaincopy to clipboard
01.#include
02.#include
03.
04.int main()
05.{
06. std::locale::global(std::locale("German_Switzerland"));
07. std::locale loc;
08. std::cout << loc.name() << std::endl;
09.}

现在程序会输出 German_Switzerland.1252 。

    在初步理解了区域设置以及如何更改全局设置后，下面的例子说明了区域设置如何影响字符串操作。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::cout << std::strcoll("ä", "z") << std::endl;
08. std::locale::global(std::locale("German"));
09. std::cout << std::strcoll("ä", "z") << std::endl;
10.}

    本例使用了定义在文件cstring中的函数 std::strcoll() ，该函数用于按照字典顺序比较第一个字符串是否小于第二个。也就是两个字符串中哪一个在字典中靠前(郁闷了，VC中居然不让输入ä，自动变成了’?’)。执行程序，得到结果为1和-1。虽然函数的参数是一样的，却得到了不同的结果。原因很简单，在第一次调用函数 std::strcoll() 时，使用了全局 C 区域设置；而在第二次调用时，全局区域设置更改为德国文化。从输出中可以看出，在这两种区域设置中，字符'ä'和'z'的次序是不同的。

    很多C 函数以及 C++ 流都与区域设置有关。尽管类 std::string 中的函数是与区域设置独立工作的，但是以下各节中提到的函数并不是这样。所以，在本章中还会多次提到区域设置的相关内容。

三、字符串算法库 Boost.StringAlgorithms

    Boost C++字符串算法库提供了很多字符操作函数，操作的字符串类型可以为std:;string、std::wstring或任何其他模板类std::basic_string的实例。使用时需包含头文件boost/algorithm/string.hpp，这个库中很多函数都可以接受类型为std::local的对象作为附加的可选参数，若未设置会使用默认的全局区域设置。先看下这个德国区的：
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.#include
05.
06.int main()
07.{
08. std::setlocale(LC_ALL, "German");
09. std::string s = "Boris Schäling";
10. std::cout << boost::algorithm::to_upper_copy(s) << std::endl;
11. std::cout << boost::algorithm::to_upper_copy(s, std::locale("German")) << std::endl;
12.}

    函数to_upper_copy用于转换一个字符串为大写，它返回转换后的字符串。上面代码第一次调用时使用的是默认全局区域设置，第二次调用时则明确将区域设置为德国文化。显然后者的转换是正确的，因为小写字母 'ä' 对应的大写形式 'Ä' 是存在的。而在C区域设置中， ä' 是一个未知字符所以不能转换。为了能得到正确结果，必须明确传递正确的区域设置参数或者在调用 boost::algorithm::to_upper_copy() 之前改变全局区域设置。可以注意到，程序使用了定义在头文件 clocale 中的函数 std::setlocale() 为 C 函数进行区域设置，因为 std::cout 使用 C 函数在屏幕上显示信息。在设置了正确的区域后，才可以正确显示 'ä' 和 'Ä' 等元音字母。另外，程序中的setlocale函数可以用std::locale::global代替，同为全局区域设置操作。
    Boost.StringAlgorithms 库还提供了几个从字符串中删除单独字母的函数，可以明确指定在哪里删除，如何删除。例如，可以使用函数boost::algorithm::erase_all_copy()从整个字符串中删除特定的某个字符，若想只在此字符首次出现时删除，可以使用函数 boost::algorithm::erase_first_copy()。如果要在字符串头部或尾部删除若干字符，可以使用函数boost::algorithm::erase_head_copy()和boost::algorithm::erase_tail_copy()：
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. boost::iterator_range r = boost::algorithm::find_first(s, "Boris");
10. std::cout << r << std::endl;
11. r = boost::algorithm::find_first(s, "xyz");
12. std::cout << r << std::endl;
13.}

    以下各个不同函数boost::algorithm::find_first()、boost::algorithm::find_last()、 boost::algorithm::find_nth()、boost::algorithm::find_head()以及boost::algorithm::find_tail()可以用于在字符串中查找子串。

    上面的程序还用到了一个boost::iterator_range，这个迭代器是所有这些函数的返回类型。此类起源于Boost C++的Boost.Range库，它在迭代器的概念上定义了“范围”。因为操作符<<由boost::iterator_range类重载而来，单个搜索算法的结果可以直接写入标准输出流。以上程序将Boris作为第一个结果输出而第二个结果为空字符串。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.#include
05.
06.int main()
07.{
08. std::locale::global(std::locale("German"));
09. std::vector v;
10. v.push_back("Boris");
11. v.push_back("Schäling");
12. std::cout << boost::algorithm::join(v, " ") << std::endl;
13.}

    函数boost::algorithm::join()接受一个字符串的容器作为第一个参数，根据第二个参数将这些字符串连接起来。相应地这个例子会输出Boris Schäling。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. std::cout << boost::algorithm::replace_first_copy(s, "B", "D") << std::endl;
10. std::cout << boost::algorithm::replace_nth_copy(s, "B", 0, "D") << std::endl;
11. std::cout << boost::algorithm::replace_last_copy(s, "B", "D") << std::endl;
12. std::cout << boost::algorithm::replace_all_copy(s, "B", "D") << std::endl;
13. std::cout << boost::algorithm::replace_head_copy(s, 5, "Doris") << std::endl;
14. std::cout << boost::algorithm::replace_tail_copy(s, 8, "Becker") << std::endl;
15.}

    Boost.StringAlgorithms 库不但提供了查找子串或删除字母的函数，而且提供了使用字符串替代子串的函数，包括 boost::algorithm::replace_first_copy()， boost::algorithm::replace_nth_copy()， boost::algorithm::replace_last_copy()， boost::algorithm::replace_all_copy()， boost::algorithm::replace_head_copy() 以及 boost::algorithm::replace_tail_copy() 等等。它们的使用方法同查找和删除函数是差不多一样的，所不同的是还需要一个替代字符串作为附加参数。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "\t Boris Schäling \t";
09. std::cout << "." << boost::algorithm::trim_left_copy(s) << "." << std::endl;
10. std::cout << "." <<boost::algorithm::trim_right_copy(s) << "." << std::endl;
11. std::cout << "." <<boost::algorithm::trim_copy(s) << "." << std::endl;
12.}

    可以使用修剪函数 boost::algorithm::trim_left_copy()， boost::algorithm::trim_right_copy() 以及 boost::algorithm::trim_copy() 等自动去除字符串中的空格或者字符串的结束符。什么字符是空格取决于全局区域设置。

    Boost.StringAlgorithms库的函数可以接受一个附加的谓词参数，以决定函数作用于字符串的哪些字符。谓词版本的修剪函数相应地被命名为boost::algorithm::trim_left_copy_if()， boost::algorithm::trim_right_copy_if()和boost::algorithm::trim_copy_if()。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "--Boris Schäling--";
09. std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl;
10. std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl;
11. std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl;
12.}

    以上程序调用了一个辅助函数boost::algorithm::is_any_of()，它用于生成谓词以验证作为参数传入的字符是否在给定的字符串中存在。使用函数boost::algorithm::is_any_of后，正如例子中做的那样，修剪字符串的字符被指定为连字符。Boost.StringAlgorithms类也提供了众多返回通用谓词的辅助函数。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "123456789Boris Schäling123456789";
09. std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl;
10. std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl;
11. std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl;
12.}
    函数boost::algorithm::is_digit()返回的谓词在字符为数字时返回布尔值true。检查字符是否为大写或小写的辅助函数分别是boost::algorithm::is_upper()和boost::algorithm::is_lower()。所有这些函数都默认使用全局区域设置，除非在参数中指定其他区域设置。

    除了检验单独字符的谓词之外，Boost.StringAlgorithms库还提供了处理字符串的函数。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. std::cout << boost::algorithm::starts_with(s, "Boris") << std::endl;
10. std::cout << boost::algorithm::ends_with(s, "Schäling") << std::endl;
11. std::cout << boost::algorithm::contains(s, "is") << std::endl;
12. std::cout << boost::algorithm::lexicographical_compare(s, "Boris") << std::endl;
13.}

    函数boost::algorithm::starts_with()、boost::algorithm::ends_with、boost::algorithm::contains和boost::algorithm::lexicographical_compare()均可以比较两个字符串。

    下面再介绍一个字符串切割函数。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.#include
05.
06.int main()
07.{
08. std::locale::global(std::locale("German"));
09. std::string s = "Boris Schäling";
10. std::vector v;
11. boost::algorithm::split(v, s, boost::algorithm::is_space());
12. std::cout << v.size() << std::endl;
13.}

    在给定分界符后，使用函数 boost::algorithm::split() 可以将一个字符串拆分为一个字符串容器。它需要给定一个谓词作为第三个参数以判断应该在字符串的哪个位置分割。这个例子使用了辅助函数 boost::algorithm::is_space() 创建一个谓词，在每个空格字符处分割字符串。

    本节中许多函数都有忽略字符串大小写的版本，这些版本一般都有与原函数相似的名称，所相差的只是以'i'.开头。例如，与函数 boost::algorithm::erase_all_copy() 相对应的是函数 boost::algorithm::ierase_all_copy()。

    最后，值得注意的是类Boost.StringAlgorithms中许多函数都支持正则表达式。以下程序使用函数boost::algorithm::find_regex()搜索正则表达式。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.#include
05.
06.int main()
07.{
08. std::locale::global(std::locale("German"));
09. std::string s = "Boris Schäling";
10. boost::iterator_range r = boost::algorithm::find_regex(s, boost::regex("\\w\\s\\w"));
11. std::cout << r << std::endl;
12.}

为了使用正则表达式，此程序使用了Boost C++库中的boost::regex，这将在下一节介绍。
四、正则表达式库 Boost.Regex
    Boost C++的正则表达式库Boost.Regex可以应用正则表达式于C++。正则表达式大大减轻了搜索特定模式字符串的负担，在很多语言中都是强大的功能。虽然现在C++仍然需要以 Boost C++库的形式提供这一功能，但是在将来正则表达式将进入C++标准库。 Boost Regex库有望包括在下一版的 C++ 标准中。

    Boost.Regex库中两个最重要的类是boost::regex和boost::smatch，它们都在 boost/regex.hpp文件中定义。前者用于定义一个正则表达式，而后者可以保存搜索结果。

    以下将要介绍 Boost.Regex 库中提供的三个搜索正则表达式的函数。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. boost::regex expr("\\w+\\s\\w+");
10. std::cout << boost::regex_match(s, expr) << std::endl;
11.}

   函数 boost::regex_match() 用于字符串与正则表达式的比较。在整个字符串匹配正则表达式时其返回值为 true 。

   函数 boost::regex_search() 可用于在字符串中搜索正则表达式。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. boost::regex expr("(\\w+)\\s(\\w+)");
10. boost::smatch what;
11. if (boost::regex_search(s, what, expr))
12. {
13. std::cout << what[0] << std::endl;
14. std::cout << what[1] << " " << what[2] << std::endl;
15. }
16.}

    函数 boost::regex_search() 可以接受一个类型为 boost::smatch 的引用的参数用于储存结果。函数 boost::regex_search() 只用于分类的搜索，本例实际上返回了两个结果，它们是基于正则表达式的分组。

    存储结果的类 boost::smatch 事实上是持有类型为 boost::sub_match 的元素的容器，可以通过与类 std::vector 相似的界面访问。例如，元素可以通过操作符 operator[]() 访问。

    另一方面，类boost::sub_match将迭代器保存在对应于正则表达式分组的位置。因为它继承自类std::pair，迭代器引用的子串可以使用 first 和 second 访问。如果像上面的例子那样，只把子串写入标准输出流，那么通过重载操作符 << 就可以直接做到这一点，那么并不需要访问迭代器。

    请注意结果保存在迭代器中而boost::sub_match类并不复制它们，这说明它们只是在被迭代器引用的相关字符串存在时才可以访问。

    另外，还需要注意容器boost::smatch 的第一个元素存储的引用是指向匹配正则表达式的整个字符串的，匹配第一组的第一个子串由索引 1 访问。

Boost.Regex 提供的第三个函数是 boost::regex_replace()。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = " Boris Schäling ";
09. boost::regex expr("\\s");
10. std::string fmt("_");
11. std::cout << boost::regex_replace(s, expr, fmt) << std::endl;
12.}

    除了待搜索的字符串和正则表达式之外，boost::regex_replace()函数还需要一个格式参数，它决定了子串、匹配正则表达式的分组如何被替换。如果正则表达式不包含任何分组，相关子串将被用给定的格式一个个地被替换。这样上面程序输出的结果为 _Boris_Schäling_。

    boost::regex_replace()函数总是在整个字符串中搜索正则表达式，所以这个程序实际上将三处空格都替换为下划线。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. boost::regex expr("(\\w+)\\s(\\w+)");
10. std::string fmt("\\2 \\1");
11. std::cout << boost::regex_replace(s, expr, fmt) << std::endl;
12.}

    格式参数可以访问由正则表达式分组的子串，这个例子正是使用了这项技术，交换了姓、名的位置，于是结果显示为 Schäling Boris 。

    需要注意的是，对于正则表达式和格式有不同的标准。这三个函数都可以接受一个额外的参数，用于选择具体的标准。也可以指定是否以某一具体格式解释特殊字符或者替代匹配正则表达式的整个字符串。
view plaincopy to clipboard
01.#include
02.#include
03.#include
04.
05.int main()
06.{
07. std::locale::global(std::locale("German"));
08. std::string s = "Boris Schäling";
09. boost::regex expr("(\\w+)\\s(\\w+)");
10. std::string fmt("\\2 \\1");
11. std::cout << boost::regex_replace(s, expr, fmt, boost::regex_constants::format_literal) << std::endl;
12.}

    此程序将boost::regex_constants::format_literal标志作为第四参数传递给函数 boost::regex_replace()，从而抑制了格式参数中对特殊字符的处理。因为整个字符串匹配正则表达式，所以本例中经格式参数替换的到达的输出结果为 \2 \1。

    正如上一节末指出的那样，正则表达式可以和 Boost.StringAlgorithms 库结合使用。它通过 Boost.Regex 库提供函数如 boost::algorithm::find_regex() 、 boost::algorithm::replace_regex() 、 boost::algorithm::erase_regex() 以及 boost::algorithm::split_regex() 等等。由于 Boost.Regex 库很有可能成为即将到来的下一版 C++ 标准的一部分，脱离 Boost.StringAlgorithms 库，熟练地使用正则表达式是个明智的选择。

转自：https://www.douban.com/note/194712641/

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：GDAL遥感影像读取与显示-vc2010+GDAL+GEOS+Proj+OpenCV环境

后一篇：Boost字符串处理-词汇分割器库

新浪BLOG意见反馈留言板　欢迎批评指正