Coreseek®  
 | 首页 | 注册 | 回复 | 搜索 | 统计资料 |                 网站首页产品服务开放源码安装使用常见问题中文手册社区交流联系我们 
中文分词 论坛首页 / 中文分词 /

csft-3.1搜索问题

 
hgm0910
会员
#1 | 发表时间: 2010 05 13 10:41 | 修改: hgm0910
回复 
我在网上找了一个安装csft-3.1的过程,步骤一模一样
只是在/etc/csft中找不到csfs.conf,所以复制一份当前目录的sphinx.conf改名为csfs.conf
而且也把mysql的编码改过来了
还以防万一删除数据库又重新建立了数据库,创建时加了句character set utf-8
但是英文可以搜索出来中文却不行
为什么会这样呢?

以下是我在网上找的安装过程

安装过程:
wget http://www.coreseek.cn/uploads/csft/3.1/CentOS5/mmseg-3.1-1.i386.rpm
wget http://www.coreseek.cn/uploads/csft/3.1/CentOS5/csft-3.1-1.1.i386.rpm
安装csft-3.1-1.1.i386.rpm的时候会提示需要个postgresql的动态库的依赖,需要安装
yum install -y postgresql-libs.i386
rpm -Uvh csft-3.1-1.1.i386.rpm mmseg-3.1-1.i386.rpm
(另外需要下载mmseg的源代码包)
把安装后的/etc/csft/example.sql倒入数据库
mysql < example.sql
wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
tar zxvf mmseg-3.1.tar.gz
cd mmseg-3.1/data
mmseg -u unigram.txt
把生成的unigram.txt.uni 改名并拷贝到相应位置
mv unigram.txt.uni /var/data/dict/uni.lib
cd /etc/csft
vi csfs.conf
编辑索引定义添加
charset_type        = zh_cn.utf-8
charset_dictpath    = /var/data/dict
保存建立索引
# csft-indexer –all (对所有索引定义建立索引)
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file ‘./csft.conf’…
indexing index ‘test1′…
iniparser: cannot open /var/data/dict/mmseg.ini
1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        1,
pt:1, 1;        3, 6,
pt:3, 24;       pt:6, 0;        3, 6,
pt:3, 47279;    pt:6, 0;        3, 6,
pt:3, 61;       pt:6, 0;        3,
pt:3, 13411;    pt:3, 30471;    pt:6, 1;        pt:3, 30471;    pt:3, 24;       pt:6, 1;        pt:3, 24;       pt:6, 0;        1,
3, 6,
pt:1, 1;        pt:3, 24;       pt:6, 1;        pt:3, 24;       pt:6, 0;        3,
pt:3, 14538;    pt:3, 298;      3, 6,
pt:3, 24;       pt:6, 0;        3, 6,
pt:3, 24;       pt:3, 154;      pt:1, 1;        pt:6, 0;        pt:1, 1;        pt:3, 13411;    pt:1, 1;        pt:3, 13411;    pt:3, 13411;    pt:3, 30471;    3,
pt:3, 990;      pt:3, 1448;     pt:6, 1;        pt:3, 1448;     pt:6, 0;        collected 8 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 8 docs, 269 bytes
total 0.030 sec, 8825.17 bytes/sec, 262.46 docs/sec
然后测试中文分词
# csft-search 测试
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file ‘/etc/csft/csft.conf’…
iniparser: cannot open /var/data/dict/mmseg.ini
3, 6,
pt:3, 24;       pt:3, 154;      pt:6, 0;        pt:1, 1;        index ‘test1′: query ‘测试 ‘: returned 4 matches of 4 total in 0.014 sec
displaying matches:
1. document=7, weight=2, group_id=3, date_added=Tue Jan 12 15:37:42 2010
id=7
group_id=3
group_id2=11
date_added=2010-01-12 15:37:42
title=12 测试
content=大册,测试
2. document=5, weight=1, group_id=3, date_added=Tue Jan 12 13:38:04 2010
id=5
group_id=3
group_id2=9
date_added=2010-01-12 13:38:04
title=测试
content=一些
3. document=6, weight=1, group_id=3, date_added=Tue Jan 12 15:26:01 2010
id=6
group_id=3
group_id2=10
date_added=2010-01-12 15:26:01
title=标题
content=我的测试
4. document=8, weight=1, group_id=3, date_added=Tue Jan 12 15:42:25 2010
id=8
group_id=3
group_id2=12
date_added=2010-01-12 15:42:25
title=测试 我的
content=先吃饭
words:
1. ‘测试’: 4 documents, 5 hits
hgm0910
会员
#2 | 发表时间: 2010 05 13 11:42 | 修改: hgm0910
回复 
已经解决了...
是csfs.conf配置文件中没把sql_query_pre    = SET NAMES utf8注释去掉
但是现在出现一个问题
搜索结果中中文是以问号显示


[root@localhost csft]# csft-search 测试
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file '/etc/csft/csft.conf'...
3, 6,
pt:3, 24;       pt:3, 154;      pt:6, 0;        pt:1, 1;        index 'test1': query '测试 ': returned 3 matches of 3 total in 0.007 sec

displaying matches:
1. document=5, weight=2, group_id=1, date_added=Thu May 13 10:21:28 2010
        id=5
        group_id=1
        group_id2=5
        date_added=2010-05-13 10:21:28
        title=???
        content=????????????????????
2. document=6, weight=2, group_id=1, date_added=Thu May 13 10:21:28 2010
        id=6
        group_id=1
        group_id2=6
        date_added=2010-05-13 10:21:28
        title=???
        content=?????????
3. document=8, weight=1, group_id=2, date_added=Thu May 13 10:21:28 2010
        id=8
        group_id=2
        group_id2=8
        date_added=2010-05-13 10:21:28
        title=????
        content=??????

words:
1. '测试': 3 documents, 5 hits


终端编码:utf8

mysql:
ysql> show variables like 'character_set_%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation_%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)


csft.conf:
source src1
{
    type = mysql

    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_port = 3306

    sql_query_pre = SET NAMES utf8

    sql_query = \
        SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
        FROM documents

    sql_attr_uint            = group_id

    sql_attr_timestamp        = date_added

    sql_ranged_throttle    = 0

    sql_query_info        = SELECT * FROM documents WHERE id=$id
}

source src1throttled : src1
{
    sql_ranged_throttle = 100
}

index test1
{
    source = src1
    path = /var/data/test1
    docinfo = extern
    mlock = 0
    morphology = none
    min_word_len = 1

    charset_type = zh_cn.utf-8
    charset_dictpath = /var/data/dict
    html_strip = 0
}

indexer
{
    mem_limit = 32M
}

searchd
{
    log = /var/log/searchd.log
    query_log = /var/log/query.log
    read_timeout = 5
    client_timeout = 300
    max_children = 30
    pid_file = /var/log/searchd.pid
    max_matches = 1000
    seamless_rotate = 1
    preopen_indexes = 0
    unlink_old = 1
    mva_updates_pool = 1M
    max_packet_size = 8M
    max_filters     = 256
    max_filter_values = 4096
}
HonestQiao
会员
#3 | 发表时间: 2010 05 13 16:11
回复 
请注意数据库原始数据的编码,以及conf的设置编码,query_pre的编码请求设置,以及命令行界面本身的编码都需要一致。
hgm0910
会员
#4 | 发表时间: 2010 05 14 11:50
回复 
我在2楼贴上已经写的很清楚了
所有的都配置好了
而且我把原来的数据库删了后重新用create database test character set utf-8;建立了数据库
但是在终端显示的中文依然是问号
在MYSQL中查询表内容是显示正常的
HonestQiao
会员
#5 | 发表时间: 2010 05 14 18:49
回复 
那你的mysql的客户端编码是什么?
hgm0910
会员
#6 | 发表时间: 2010 05 17 11:09
回复 
以下是我机器配置的信息
还需要配置什么其他部分吗?

mysql:
mysql> show variables like 'character_set_%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation_%';
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

csft.conf:
sql_query_pre = SET NAMES utf8
charset_type = zh_cn.utf-8

终端:utf8

系统编码:
LANG=zh_CN.UTF-8
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=
HonestQiao
会员
#7 | 发表时间: 2010 05 17 16:22
回复 
sql_query_pre = SET NAMES utf8

如果你的数据库是正确的编码的话,那么上面这一句,是必要的。
 
回复
Bold Style  Italic Style  Image 链接  URL 链接 
发帖注意:
  • 网址中请去掉http://开头,例如:您需要输入www.coreseek.cn,而不是http://www.coreseek.cn
  • 咨询问题,请贴出详细的操作系统版本、Coreseek版本(Linux环境请给出编译参数)
  • 请仔细查看中文手册和本站安装指南,确认操作正确
  • 请仔细查看常见问题解答,也许你的问题已经有解决方法

» 帐号  » 密码 
发帖前请登陆, 或者 注册 .