MySQL Manual | 6.8.3 Предстоящие доработки по полнотекстовому поиску

MySQL Manual

6 Справочник по языку MySQL
- 6.8 Полнотекстовый поиск в MySQL
  - 6.8.1 Ограничения для полнотекстового поиска
  - 6.8.2 Тонкая настройка полнотекстового поиска в MySQL
  - 6.8.3 Предстоящие доработки по полнотекстовому поиску

Previous / Next / Up / Table of Contents

6.8.3 Предстоящие доработки по полнотекстовому поиску

Сделать все операции с индексом FULLTEXT более быстрыми.
Операторы схожести
Поддержка для слов, тождественных индексам, - чтобы словами могли быть любые строки, которые пользователь пожелает трактовать как слова, например "C++", "AS/400", "TCP/IP" и т.д.
Поддержка полнотекстового поиска в таблицах типа MERGE.
Поддержка многобайтовых наборов символов.
Сделать список стоп-слов (``stopword'') зависящим от языка данных.
Стемминг (в зависимости от языка данных, конечно).
Обобщенный синтаксический пре-анализатор с определяемым пользователем функциониями (UDF).
Сделать данную модель поиска более гибкой (путем добавления ряда регулируемых параметров к FULLTEXT в CREATE/ALTER TABLE).

User Comments

Posted by Nicolas Ross on Friday May 17 2002, @6:24am

[Delete] [Edit]

I agree 100% I am doing a DB with full text and
users aren't able to find part of words. As an
exmple :

L'instustrie du ...

(This is french btw) and i search against
industrie, then this word is not found. Neitheir
plurals like industries wich is not found. It
would be verry appriciated to have partial word
search enable...

Posted by [email protected] on Friday May 17 2002, @6:24am

[Delete] [Edit]

I wanted to find the COUNT of the maximum number
of models associated with a vendor in one of my
tables (ie, I had a table consisting of model
id's assigned by me, model numbers, and the
manufacturer ids (foreign keys from the
manufacturer table)). Straight SQL isn't very
good at analytical statements, so I executed a
bit of a kludge. I executed the following
query: "SELECT COUNT(manufacturer) AS x FROM
model GROUP BY manufacturer ORDER BY x DESC". I
then took the first row of the result in order to
find the maximum COUNT of machines associated
with a manufacturer.

Posted by narendra phadke on Friday May 17 2002, @6:24am

[Delete] [Edit]

Narendra Phadke <[email protected]>
I am supposed to store Japanese data in MySQL.
I think I need the Japanese character set to do
that.

It seems the C:\mysql\share\charsets folder
contains character sets for all languages except
Japanese!!

Where can I find the conf. files for Japanese
Language?
Do I have to create it? if yes how can I create
it?

Posted by David Altherr on Wednesday December 18 2002, @5:27pm

[Delete] [Edit]

tibi, the problem is that aliases declared in the
SELECT expression cannot be referenced in the
WHERE clause of the SQL; instead they must be
referenced in the HAVING clause. However, note
that this is only an exception and you should
always place your conditional statements in the
WHERE clause otherwise as MySQL optimizes WHERE
clauses and not HAVING clauses. The fix to your
specific example is as follows:

<code>
SELECT MATCH (tekst) AGAINST ('progressive') AS x
FROM info HAVING x >0.5;
</code>

Posted by [email protected] on Friday May 17 2002, @6:24am

[Delete] [Edit]

This page seems a bit self-contradictory:

"If a thread obtains a READ lock on a table, that thread...can only read from the table" says that the locking thread can't update a READ locked table, whereas "no other thread can update a READ-locked table" implies it can. Experiments show that the first bit is true: no thread at all can update a READ locked table.

There appear to be no provisions for locking a table such that only one thread can update while any thread can read. I would have found this useful.

Posted by Paul King on Friday May 17 2002, @6:24am

[Delete] [Edit]

Why does UNLOCK have to unlock all tables locked from within a thread? Why can you not choose to UNLOCK a single table?

I am having problems with this because ultimately you have to UNLOCK them all and re-LOCK all through your scripts. I can't justify locking big tables for long periods of time just because an UNLOCK will destroy all my other locks.

It wouldn't be so bad if READ locks actually did what they seem to be supposed to do but as they don't I hold up a growing queue of requests to a large table that should be unlocked ASAP.

This is made worse by the fact that you have to LOCK all tables that a thread uses until an UNLOCK is issued.

If anybody has any work-arounds please let me know.

Posted by [name withheld] on Friday May 17 2002, @6:24am

[Delete] [Edit]

in 3.23.42-log, you cannot match against a merged
table. Eg:
SELECT * FROM table_a WHERE MATCH( description )
AGAINST ('blah');SELECT * FROM table_b WHERE
MATCH( description ) AGAINST ('blah');
SELECT * FROM table_c WHERE MATCH( description )
AGAINST ('blah');SELECT * FROM table_abc_m WHERE
MATCH( description ) AGAINST ('blah'); <== ERROR
1030: Got error -1 from table handler

Posted by John Lucas on Friday May 17 2002, @6:24am

[Delete] [Edit]

We desperately need the enhancements to the FULL
TEXT search. Particlarly it would ideal to have
the following facilities and for the full text
results to be scored based these. 1. Word
Proximity 2. stemming of words. In our
collection words in a search phrase that are
close together should be scored higher than
another document, even if another document has
lots of occurances of just 1 or 2 words in the
phrase but they are not close together. The
proximity of words should be specifiable as
either an operator in a particular search or
defined in the environment as a default.

Posted by Vaclav Haisman on Friday May 17 2002, @6:24am

[Delete] [Edit]

There should be explained what each transaction
isolation level mean.

Posted by Lindsay Pallickal on Friday May 17 2002, @6:24am

[Delete] [Edit]

I agree with Mr. Barszczewski in that we need to
be able to select word delimiters. I am working
on a system where users search through a table of
file names and often things like
brackets/underscores are used in place of spaces.
I can't imagine it would be hard implement the
ability to let the user specify additional
delimeters. Please try and include this in the
next release. For now I am trying to work around
the limitation by creating two synced copies of
the table I need to search - one with the
original filenames and a mirrored copy of that
table where any delimeter I specify is replaced
by a space. The mirrored table is used for the
search and the unique id's of records found in
that table are used to locate the actual filename
in the original table.

Posted by Mattias on Friday May 17 2002, @6:24am

[Delete] [Edit]

There is a description of SQL Server Transaction
Isolation Levels which I believe fairly well
reflects MySQL behavious aswell. URL:
http://www.swynk.com/friends/achigrik/TIL.asp

Posted by Joe Colburn on Friday May 17 2002, @6:24am

[Delete] [Edit]

I've tried this with the MATCH... AS SCORE
method, and while it returns results, they are
exact matches. For example, looking for "book"
finds me any text with the word "book" in it, but
not text with "books" or "notebook" as the
example indicates. Perhaps this is a bug?

Posted by Monte Ohrt on Friday May 17 2002, @6:24am

[Delete] [Edit]

The partial word match is a good idea, but only if
proper stemming is done. You want to make sure
that the matching words have the same morpheme
(same basic meaning).

Example:
if you search for the word "runs", it should also
match "run", "running", "runner" since all of
these have the same morpheme of "run". However,
you would NOT want matches such as "rune", "runt",
"rung", "runic" since these words do NOT hold the
same meaning, and would be worthless in your
search.

Another example: a search for "sock" should return
"sock" and "socks", but should NOT return "socket"

Of course, another _great_ addition to the search
engine would be thesaurus matches, so a search for
"doctor" could return "physician", but only if
asked to do so in the query like "@doctor" or
something like that.

Posted by [email protected] on Friday May 17 2002, @6:24am

[Delete] [Edit]

With the MATCH AGAINST syntax, how can one
search for let's say: "file.gif"?
It seems to only accept alphanumeric characters
only... the period (.) is dropped...
Let me know!
[email protected]
Thanks

Posted by Erlend Stromsvik on Friday May 17 2002, @6:24am

[Delete] [Edit]

Joe:
apple* will match ``apple'', ``apples'',
``applesauce'', and ``applet''

[email protected]:
Read section 6.8 to see what characters are
indexed. I guess the '.' is dropped. And
therfore 'gif' is dropped too (since only words
with 3+ chars are indexed)

Posted by Mike Brittain on Friday May 17 2002, @6:24am

[Delete] [Edit]

Search results are ordered by descending
relevance (as noted above). If you include an
ORDER BY declaration in your query, the relevance
scoring is apparently used by MySQL as a
secondary ordering declaration.

"ORDER BY title ASC"

is essentially the same as saying

"ORDER BY title ASC, relevance DESC"

(I wasn't sure if relevance scoring remained
intact if you were to order off another column,
but this does seem to be the case from the
testing that I did.)

Posted by [email protected] on Monday July 22 2002, @12:13pm

[Delete] [Edit]

The parser needs to be a little more complex. I have
a database where users need to search on words
that have characters such as quotes (inches) and
slashes (i.e. 3/4). In a LIKE scenario I can use the
escape sequence character "\", but in full text mode
this does not work. I have a TEXT field that I can
not index with anything other than a FULL TEXT
index. So searching on something like 3/4" does not
work at all as it parses this into "3" and "4" as
words, dropping the other characters.

Posted by Brian Cunningham on Friday May 17 2002, @6:24am

[Delete] [Edit]

In response to Monte Ohrt above: I would like to
be able to choose how a query string is treated.
That is, I would maybe like to see a (optional)
function related to search that would take the
field and the search method as a parameter. So I
could the override the default method with
something like SELECT MATCH (tekst) AGAINST
INDEXTYPE ('run', stemmed) AS x FROM info. Or,
perhaps, add this functionality to MATCH itself.
Similarly I'd like to be able to specify the
smallest word to index in my CREATE statement. I
personally use partial word searches when I don't
know how to spell something. A dumb example: Is
it "socket" or "sockit"? My search for sock
should be able to search using "sock" and return
more than just "sock" and "socks". Then in an app
I might code it so the user can choose to
override the default search method using radio
button selections.

Posted by Ian Griffiths on Wednesday December 18 2002, @5:27pm

[Delete] [Edit]

It'd be a boon to be able to embolden matches
inside the MATCH(). The next best thing is
having another routine to split the search term
once more and loop through, regexp'ing for it and
replacing with \\1. I'm not sure if you'd
want to produce nonstandard output like that but
like I say, it'd be very useful and save
processor time in more code.

Posted by [name withheld] on Friday May 17 2002, @6:24am

[Delete] [Edit]

Note that you can only set ft_min_word_length in
versions 4 and above. For those of us with 3.23.x
you have to modify myisam/ftdefs.h to

#define MIN_WORD_LEN n (n being the minimum to
index)

And then recompile!

Posted by [email protected] on Friday July 26 2002, @5:50am

[Delete] [Edit]

Help!!!
Is there any way to use full-text on such tables
which encode with Two-Bytes Charactersets such as
GBK/GB2312/Big5 ?
I run mysql on Win2k Server,and I have try any
thing ,but when I use MATCH(field) AGAINST
('SOMETHING') while "SOMETHING" is encode with
GBK/gb2312,I always get empty result,How can I ?
HELP !!!

Posted by frederic vandenplas on Saturday June 15 2002, @11:27am

[Delete] [Edit]

This select returns all product_names containing
plas, now i m figuring out how to pass % trough
the queury string
SELECT * FROM Products WHERE Product_name LIKE '%
plas%' ORDER BY product_id ASC

Posted by NTABUYE BUTERA Paul on Monday August 19 2002, @8:48am

[Delete] [Edit]

As an extention of the fulltext capability we
should have a "file path" type of column in MySQL.
Such a column will hold the path a text file on
which one should perform fulltext indexation at
requested indexation time. Thus, we avoid the
extensive amount of time lost while loading data
in a table already containing the FULLTEXT option.

Posted by [email protected] on Tuesday October 29 2002, @6:29am

[Delete] [Edit]

Firstly, thanks to the mySQL team for creating
Fulltext - it's immaculately cool stuff.

I'd like to see a few features:
1. A list of the "stopwords" - the words that match
against a given article. If I could see these
keywords, that would be cool.
2. The ability to search <3 letter words.
3. The ability to search part of a word.

I understand ya'll are probably already working on
these and I'm most likely just bugging you, but I
figured "hell why Not?" so there ya go.

Posted by Alfred Heitink on Wednesday November 13 2002, @3:37am

[Delete] [Edit]

I'm really happy with the fulltext capabilities. I like to
use it as a simple search possibility on websites. At
this moment it's not possible to make a distinction
between 0 hits when you're searching or to many
hits when you're searching. In both cases you will
end up with 0 hits. It would be very nice to have
different exit codes. That would make it much easier
to inform the user how to adjust his queries.
Other things like a an adjustable stopword list is
already on the todo list. Again thanks for all the
effort.

Posted by Martyn Allan on Thursday January 30 2003, @8:08pm

[Delete] [Edit]

I would like to see in boolean mode ( the same as non boolean mode ):

Every correct word in the collection and in the query is weighted according to its significance in the query or collection.

What would also be good is weighting based on the column, so that you may weight hits on a name higher than hits on say the description ( either by a setting in the query or the full-text setup ).

Posted by Frank Mayer on Wednesday February 5 2003, @7:10pm

[Delete] [Edit]

I'd like to know if it is possible (in the future) to weight parts of the MATCH-Condition a little bit more precise. Especialliy in the case, if I have to suppress a large number of irrelevant hits. In these cases it may be more useful to give the exclusion-conditions individual weights than by simply add a "+" or a "-".

Are there any plans or possibilities?

Yours,
Frank Mayer

Add your own comment.

Top / Previous / Next / Up / Table of Contents