DoboWiki
Top
> .NETプログラミング研究/98 をテンプレートにして作成
.NETプログラミング研究/98 をテンプレートにして作成
開始行:
#title(日本語の文章を単語に分割する(分かち書きをする))
#navi(.NETプログラミング研究)
#contents
*日本語の文章を単語に分割する(分かち書きをする) [#i0c6d...
文章を単語単位に分割したいと思う時があります。英語ならば...
調べてみると、文章を単語で分割して記述することを「分かち...
-[[わかち書き - Wikipedia>http://ja.wikipedia.org/wiki/%E...
-[[形態素解析 - Wikipedia>http://ja.wikipedia.org/wiki/%E...
こんなに難しそうなことは素人の手には負えませんので、簡単...
+形態素解析ツールをインストールして使う
+Webサービスを使う
形態素解析ツールとして有名なものには、[[KAKASHI>http://ka...
-[[MeCabSharp - Meteor Factory>http://mf3.dotpp.net/?Soft...
-[[MeCab.NET プロジェクト日本語トップページ - SourceForge...
-[[mecab/cabocha用C#ラッパーライブラリ mutterofstar : Vec...
Webサービスによる方法というのは、Yahoo! Japanの[[日本語形...
これらの方法は、外部のアプリケーション、外部のWebサービス...
次に分かち書きだけを行う方法ですが、以下のような方法が見...
+公開されている分かち書きを行うコード([[TinySegmenter>ht...
+IWordBreakerを使う
ここではこの2つの方法を順番に紹介します。なお、形態素解析...
**TinySegmenterを移植する [#sfbafd90]
JavaScriptだけで日本語分かち書きを行う、[[TinySegmenter>h...
TinySegmenterは様々な言語に移植されていますが、C#やVB.NET...
-[[TinySegmenter.NET : 分かち書きを行うC#のクラス - DoboW...
-[[TinySegmenter VB.NET : 分かち書きを行うVB.NETのクラス ...
なお、他の言語への移植版には、以下のようなものがありまし...
-Perl: [[Text-TinySegmenter>http://search.cpan.org/dist/T...
-C++: [[tinysegmenter-cpp>http://code.google.com/p/tinyse...
-Ruby: [[TinySegmenterをRubyに移植してみた[Ruby]>http://d...
-Ruby: [[TinySegmenterをRubyに移植 - llameradaの日記>http...
-Python: [[TinySegmenterをPythonで書いてみた>http://www.p...
-Python: [[TinySegmenter in Python>http://lilyx.net/pages...
-PHP: [[PHP版TinySegmenter作ってみた>http://www.programmi...
-Java: [[TinySegmenter.java>http://code.google.com/p/cmec...
-VBA: [[Excelで自然言語処理: VBAでTinySegmenterしてみる>h...
-Objective-C: [[TinySegmenterをiPhone(Objective-C)に移...
-RegexKitLite: [[TinySegmenter.mをRegexKitLiteに対応させ...
-xyzzy lisp: [[tiny-segmenter - xyzzy Lisp だけで実装され...
**IWordBreakerを使用する [#c00e18e4]
WindowsのIndex Serviceで使用されている[[IWordBreaker>http...
IWordBreakerで分かち書きをする方法はちょっと難しいですが...
-[[Sql005 フルテキスト検索機能のワードブレーカを検証する...
-[[C#で分かち書き>http://a-tak.com/xoops2/modules/wordpre...
-[[C#からIndex Serviceを使って”分かち書き(わかちがき)”...
-[[WordBreakerで形態素解析>http://d.hatena.ne.jp/veveve/2...
-[[IWordBreaker とファイル検索>http://d.hatena.ne.jp/NyaR...
すでにこれだけの記事がありますので、ここで紹介する必要も...
まずは、WordBreaker.csです。
#code(csharp){{
//=======================================================...
// WordBreaker.cs
//=======================================================...
using System;
using System.Runtime.InteropServices;
namespace StemText
{
//===================================================...
// Wordbreaker stuff
//===================================================...
[Flags]
public enum WORDREP_BREAK_TYPE
{
WORDREP_BREAK_EOW = 0,
WORDREP_BREAK_EOS = 1,
WORDREP_BREAK_EOP = 2,
WORDREP_BREAK_EOC = 3
}
[ComImport]
[Guid("CC907054-C058-101A-B554-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IWordSink
{
void PutWord([MarshalAs(UnmanagedType.U4)] int cwc,
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcSrcLen,
[MarshalAs(UnmanagedType.U4)] int cwcSrcPos);
void PutAltWord([MarshalAs(UnmanagedType.U4)] int...
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcSrcLen,
[MarshalAs(UnmanagedType.U4)] int cwcSrcPos);
void StartAltPhrase();
void EndAltPhrase();
void PutBreak(WORDREP_BREAK_TYPE breakType);
}
[ComImport]
[Guid("CC906FF0-C058-101A-B554-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IPhraseSink
{
void PutSmallPhrase([MarshalAs(UnmanagedType.LPWS...
[MarshalAs(UnmanagedType.U4)] int cwcNoun,
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcModifi...
[MarshalAs(UnmanagedType.U4)] int ulAttachm...
void PutPhrase([MarshalAs(UnmanagedType.LPWStr)] ...
[MarshalAs(UnmanagedType.U4)] int cwcPhrase);
}
public class CWordSink : IWordSink
{
public void PutWord(int cwc, string pwcInBuf, int...
{
Console.WriteLine("PutWord buffer: " + pwcInB...
}
public void PutAltWord(int cwc, string pwcInBuf, ...
{
Console.WriteLine("PutAltWord buffer: " + pwc...
}
public void StartAltPhrase()
{
Console.WriteLine("StartAltPhrase");
}
public void EndAltPhrase()
{
Console.WriteLine("EndAltPhrase");
}
public void PutBreak(StemText.WORDREP_BREAK_TYPE ...
{
string strBreak;
switch (breakType)
{
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOC:
strBreak = "EOC";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOP:
strBreak = "EOP";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOS:
strBreak = "EOS";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOW:
strBreak = "EOW";
break;
default:
strBreak = "ERROR";
break;
}
Console.WriteLine("PutBreak : " + strBreak);
}
}
public class CPhraseSink : IPhraseSink
{
public void PutSmallPhrase(string pwcNoun, int cw...
int cwcModifier, int ulAttachmentType)
{
Console.WriteLine("PutSmallPhrase: " + pwcNou...
+ " , " + pwcModifier.Substring(0, cwcMo...
}
public void PutPhrase(string pwcPhrase, int cwcPh...
{
Console.WriteLine("PutPhrase: " + pwcPhrase.S...
}
}
[StructLayout(LayoutKind.Sequential)]
public struct TEXT_SOURCE
{
[MarshalAs(UnmanagedType.FunctionPtr)]
public delFillTextBuffer pfnFillTextBuffer;
[MarshalAs(UnmanagedType.LPWStr)]
public string awcBuffer;
[MarshalAs(UnmanagedType.U4)]
public int iEnd;
[MarshalAs(UnmanagedType.U4)]
public int iCur;
}
// used to fill the buffer for TEXT_SOURCE
public delegate uint delFillTextBuffer(
[MarshalAs(UnmanagedType.Struct)] ref TEXT_SOURCE...
[ComImport]
[Guid("D53552C8-77E3-101A-B552-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IWordBreaker
{
void Init([MarshalAs(UnmanagedType.Bool)] bool fQ...
[MarshalAs(UnmanagedType.U4)] int maxTokenS...
[MarshalAs(UnmanagedType.Bool)] out bool pf...
void BreakText([MarshalAs(UnmanagedType.Struct)] ...
[MarshalAs(UnmanagedType.Interface)] IWordS...
[MarshalAs(UnmanagedType.Interface)] IPhras...
void GetLicenseToUse([MarshalAs(UnmanagedType.LPW...
}
//HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control...
//の Japanese_Default の WBreakerClass の値を調べてGu...
[ComImport]
//Windows 2000
//[Guid("80A3E9B0-A246-11D3-BB8C-0090272FA362")]
//Windows XP
//[Guid("BE41F4E6-9EAD-498f-A473-F3CA66F9BE8B")]
//Windows Vista, 7
[Guid("E1E8F15E-8BEC-45DF-83BF-50FF84D0CAB5")]
public class CWordBreaker
{
}
}
}}
#code(vbnet){{
'========================================================...
' WordBreaker.vb
'========================================================...
Imports System.Runtime.InteropServices
Namespace StemText
'====================================================...
' Wordbreaker stuff
'====================================================...
<Flags()> _
Public Enum WORDREP_BREAK_TYPE
WORDREP_BREAK_EOW = 0
WORDREP_BREAK_EOS = 1
WORDREP_BREAK_EOP = 2
WORDREP_BREAK_EOC = 3
End Enum
<ComImport()> _
<Guid("CC907054-C058-101A-B554-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IWordSink
Sub PutWord(<MarshalAs(UnmanagedType.U4)> ByVal c...
<MarshalAs(UnmanagedType.LPWStr)> ByV...
<MarshalAs(UnmanagedType.U4)> ByVal c...
<MarshalAs(UnmanagedType.U4)> ByVal c...
Sub PutAltWord(<MarshalAs(UnmanagedType.U4)> ByVa...
<MarshalAs(UnmanagedType.LPWStr)> ...
<MarshalAs(UnmanagedType.U4)> ByVa...
<MarshalAs(UnmanagedType.U4)> ByVa...
Sub StartAltPhrase()
Sub EndAltPhrase()
Sub PutBreak(ByVal breakType As WORDREP_BREAK_TYPE)
End Interface
<ComImport()> _
<Guid("CC906FF0-C058-101A-B554-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPhraseSink
Sub PutSmallPhrase(<MarshalAs(UnmanagedType.LPWSt...
<MarshalAs(UnmanagedType.U4)> ...
<MarshalAs(UnmanagedType.LPWSt...
<MarshalAs(UnmanagedType.U4)> ...
<MarshalAs(UnmanagedType.U4)> ...
Sub PutPhrase(<MarshalAs(UnmanagedType.LPWStr)> B...
<MarshalAs(UnmanagedType.U4)> ByVal...
End Interface
Public Class CWordSink
Implements IWordSink
Public Sub PutWord(ByVal cwc As Integer, _
ByVal pwcInBuf As String, _
ByVal cwcSrcLen As Integer, _
ByVal cwcSrcPos As Integer) _
Implements IWordSink.PutWord
Console.WriteLine("PutWord buffer: " & pwcInB...
End Sub
Public Sub PutAltWord(ByVal cwc As Integer, _
ByVal pwcInBuf As String, _
ByVal cwcSrcLen As Integer, _
ByVal cwcSrcPos As Integer) _
Implements IWordSink.PutAltWord
Console.WriteLine("PutAltWord buffer: " & pwc...
End Sub
Public Sub StartAltPhrase() Implements IWordSink....
Console.WriteLine("StartAltPhrase")
End Sub
Public Sub EndAltPhrase() Implements IWordSink.En...
Console.WriteLine("EndAltPhrase")
End Sub
Public Sub PutBreak(ByVal breakType As StemText.W...
Implements IWordSink.PutBreak
Dim strBreak As String
Select Case breakType
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOC
strBreak = "EOC"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOP
strBreak = "EOP"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOS
strBreak = "EOS"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOW
strBreak = "EOW"
Exit Select
Case Else
strBreak = "ERROR"
Exit Select
End Select
Console.WriteLine("PutBreak : " & strBreak)
End Sub
End Class
Public Class CPhraseSink
Implements IPhraseSink
Public Sub PutSmallPhrase(ByVal pwcNoun As String...
ByVal cwcNoun As Intege...
ByVal pwcModifier As St...
ByVal cwcModifier As In...
ByVal ulAttachmentType ...
Implements IPhraseSink.PutS...
Console.WriteLine("PutSmallPhrase: " & pwcNou...
" , " & pwcModifier.Substri...
End Sub
Public Sub PutPhrase(ByVal pwcPhrase As String, _
ByVal cwcPhrase As Integer) _
Implements IPhraseSink.PutPhrase
Console.WriteLine("PutPhrase: " & pwcPhrase.S...
End Sub
End Class
<StructLayout(LayoutKind.Sequential)> _
Public Structure TEXT_SOURCE
<MarshalAs(UnmanagedType.FunctionPtr)> _
Public pfnFillTextBuffer As delFillTextBuffer
<MarshalAs(UnmanagedType.LPWStr)> _
Public awcBuffer As String
<MarshalAs(UnmanagedType.U4)> _
Public iEnd As Integer
<MarshalAs(UnmanagedType.U4)> _
Public iCur As Integer
End Structure
' used to fill the buffer for TEXT_SOURCE
Public Delegate Function delFillTextBuffer( _
<MarshalAs(UnmanagedType.Struct)> ByRef pTextSour...
<ComImport()> _
<Guid("D53552C8-77E3-101A-B552-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IWordBreaker
Sub Init(<MarshalAs(UnmanagedType.Bool)> ByVal fQ...
<MarshalAs(UnmanagedType.U4)> ByVal maxT...
<MarshalAs(UnmanagedType.Bool)> ByRef pf...
Sub BreakText(<MarshalAs(UnmanagedType.Struct)> B...
<MarshalAs(UnmanagedType.[Interface...
<MarshalAs(UnmanagedType.[Interface...
Sub GetLicenseToUse(<MarshalAs(UnmanagedType.LPWS...
End Interface
'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\...
'の Japanese_Default の WBreakerClass の値を調べてGui...
'Windows 2000
'<Guid("80A3E9B0-A246-11D3-BB8C-0090272FA362")>
'Windows XP
'<Guid("BE41F4E6-9EAD-498f-A473-F3CA66F9BE8B")>
'Windows Vista, 7
<ComImport()> _
<Guid("E1E8F15E-8BEC-45DF-83BF-50FF84D0CAB5")> _
Public Class CWordBreaker
End Class
End Namespace
}}
最後のCWordBreakerクラスの部分ですが、GuidがOSによって変...
次に分かち書きを実行する部分です。
#code(csharp){{
//=======================================================...
// Main.cs
//=======================================================...
using System;
using System.Runtime.InteropServices;
namespace StemText
{
//===================================================...
// Main class
//===================================================...
class MainClass
{
static uint pfnFillTextBuffer(ref TEXT_SOURCE pTe...
{
// return WBREAK_E_END_OF_TEXT
return 0x80041780;
}
[STAThread]
static void Main(string[] args)
{
//分かち書きをする文章
string tokStr = "今日はいい天気ですね。";
try
{
CWordBreaker wb = new CWordBreaker();
IWordBreaker iwb = (IWordBreaker)wb;
CWordSink cws = new CWordSink();
IWordSink iws = (IWordSink)cws;
CPhraseSink cps = new CPhraseSink();
IPhraseSink ips = (IPhraseSink)cps;
bool pfLicense = true;
iwb.Init(true, 1000, out pfLicense);
//string tokStr = args[0];
TEXT_SOURCE pTextSource = new TEXT_SOURCE...
pTextSource.pfnFillTextBuffer =
new delFillTextBuffer(pfnFillTextBuff...
pTextSource.awcBuffer = tokStr;
pTextSource.iCur = 0;
pTextSource.iEnd = tokStr.Length;
iwb.BreakText(ref pTextSource, iws, ips);
}
catch (System.Exception ex)
{
Console.WriteLine(ex.Message);
}
Console.ReadLine();
}
}
}
}}
#code(vbnet){{
'========================================================...
' Main.vb
'========================================================...
Imports System.Runtime.InteropServices
Namespace StemText
'====================================================...
' Main Module
'====================================================...
Module MainModule
Function pfnFillTextBuffer(ByRef pTextSource As T...
' return WBREAK_E_END_OF_TEXT
Return &H80041780UI
End Function
<STAThread()> _
Sub Main(ByVal args As String())
'分かち書きをする文章
Dim tokStr As String = "今日はいい天気ですね。"
Try
Dim wb As New CWordBreaker()
Dim iwb As IWordBreaker = DirectCast(wb, ...
Dim cws As New CWordSink()
Dim iws As IWordSink = DirectCast(cws, IW...
Dim cps As New CPhraseSink()
Dim ips As IPhraseSink = DirectCast(cps, ...
Dim pfLicense As Boolean = True
iwb.Init(True, 1000, pfLicense)
'Dim tokStr As String = args(0)
Dim pTextSource As New TEXT_SOURCE()
pTextSource.pfnFillTextBuffer = _
New delFillTextBuffer(AddressOf pfnFi...
pTextSource.awcBuffer = tokStr
pTextSource.iCur = 0
pTextSource.iEnd = tokStr.Length
iwb.BreakText(pTextSource, iws, ips)
Catch ex As System.Exception
Console.WriteLine(ex.Message)
End Try
Console.ReadLine()
End Sub
End Module
End Namespace
}}
このコードを実行すると、以下のように出力されます。
#pre{{
PutWord buffer: 今日
PutWord buffer: は
PutWord buffer: いい
PutWord buffer: 天気
PutWord buffer: です
PutWord buffer: ね
PutBreak : EOP
}}
**TinySegmenterとIWordBreakerの結果の比較 [#jd797a68]
TinySegmenterとIWordBreakerの結果の比較が、「[[COM で分か...
**予告 [#tf9a6b05]
次回は、形態素解析について紹介する予定です。
**コメント [#e4ba3995]
#comment
//これより下は編集しないでください
#pageinfo([[:Category/.NET]] [[:Category/ASP.NET]],2010-1...
終了行:
#title(日本語の文章を単語に分割する(分かち書きをする))
#navi(.NETプログラミング研究)
#contents
*日本語の文章を単語に分割する(分かち書きをする) [#i0c6d...
文章を単語単位に分割したいと思う時があります。英語ならば...
調べてみると、文章を単語で分割して記述することを「分かち...
-[[わかち書き - Wikipedia>http://ja.wikipedia.org/wiki/%E...
-[[形態素解析 - Wikipedia>http://ja.wikipedia.org/wiki/%E...
こんなに難しそうなことは素人の手には負えませんので、簡単...
+形態素解析ツールをインストールして使う
+Webサービスを使う
形態素解析ツールとして有名なものには、[[KAKASHI>http://ka...
-[[MeCabSharp - Meteor Factory>http://mf3.dotpp.net/?Soft...
-[[MeCab.NET プロジェクト日本語トップページ - SourceForge...
-[[mecab/cabocha用C#ラッパーライブラリ mutterofstar : Vec...
Webサービスによる方法というのは、Yahoo! Japanの[[日本語形...
これらの方法は、外部のアプリケーション、外部のWebサービス...
次に分かち書きだけを行う方法ですが、以下のような方法が見...
+公開されている分かち書きを行うコード([[TinySegmenter>ht...
+IWordBreakerを使う
ここではこの2つの方法を順番に紹介します。なお、形態素解析...
**TinySegmenterを移植する [#sfbafd90]
JavaScriptだけで日本語分かち書きを行う、[[TinySegmenter>h...
TinySegmenterは様々な言語に移植されていますが、C#やVB.NET...
-[[TinySegmenter.NET : 分かち書きを行うC#のクラス - DoboW...
-[[TinySegmenter VB.NET : 分かち書きを行うVB.NETのクラス ...
なお、他の言語への移植版には、以下のようなものがありまし...
-Perl: [[Text-TinySegmenter>http://search.cpan.org/dist/T...
-C++: [[tinysegmenter-cpp>http://code.google.com/p/tinyse...
-Ruby: [[TinySegmenterをRubyに移植してみた[Ruby]>http://d...
-Ruby: [[TinySegmenterをRubyに移植 - llameradaの日記>http...
-Python: [[TinySegmenterをPythonで書いてみた>http://www.p...
-Python: [[TinySegmenter in Python>http://lilyx.net/pages...
-PHP: [[PHP版TinySegmenter作ってみた>http://www.programmi...
-Java: [[TinySegmenter.java>http://code.google.com/p/cmec...
-VBA: [[Excelで自然言語処理: VBAでTinySegmenterしてみる>h...
-Objective-C: [[TinySegmenterをiPhone(Objective-C)に移...
-RegexKitLite: [[TinySegmenter.mをRegexKitLiteに対応させ...
-xyzzy lisp: [[tiny-segmenter - xyzzy Lisp だけで実装され...
**IWordBreakerを使用する [#c00e18e4]
WindowsのIndex Serviceで使用されている[[IWordBreaker>http...
IWordBreakerで分かち書きをする方法はちょっと難しいですが...
-[[Sql005 フルテキスト検索機能のワードブレーカを検証する...
-[[C#で分かち書き>http://a-tak.com/xoops2/modules/wordpre...
-[[C#からIndex Serviceを使って”分かち書き(わかちがき)”...
-[[WordBreakerで形態素解析>http://d.hatena.ne.jp/veveve/2...
-[[IWordBreaker とファイル検索>http://d.hatena.ne.jp/NyaR...
すでにこれだけの記事がありますので、ここで紹介する必要も...
まずは、WordBreaker.csです。
#code(csharp){{
//=======================================================...
// WordBreaker.cs
//=======================================================...
using System;
using System.Runtime.InteropServices;
namespace StemText
{
//===================================================...
// Wordbreaker stuff
//===================================================...
[Flags]
public enum WORDREP_BREAK_TYPE
{
WORDREP_BREAK_EOW = 0,
WORDREP_BREAK_EOS = 1,
WORDREP_BREAK_EOP = 2,
WORDREP_BREAK_EOC = 3
}
[ComImport]
[Guid("CC907054-C058-101A-B554-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IWordSink
{
void PutWord([MarshalAs(UnmanagedType.U4)] int cwc,
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcSrcLen,
[MarshalAs(UnmanagedType.U4)] int cwcSrcPos);
void PutAltWord([MarshalAs(UnmanagedType.U4)] int...
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcSrcLen,
[MarshalAs(UnmanagedType.U4)] int cwcSrcPos);
void StartAltPhrase();
void EndAltPhrase();
void PutBreak(WORDREP_BREAK_TYPE breakType);
}
[ComImport]
[Guid("CC906FF0-C058-101A-B554-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IPhraseSink
{
void PutSmallPhrase([MarshalAs(UnmanagedType.LPWS...
[MarshalAs(UnmanagedType.U4)] int cwcNoun,
[MarshalAs(UnmanagedType.LPWStr)] string pw...
[MarshalAs(UnmanagedType.U4)] int cwcModifi...
[MarshalAs(UnmanagedType.U4)] int ulAttachm...
void PutPhrase([MarshalAs(UnmanagedType.LPWStr)] ...
[MarshalAs(UnmanagedType.U4)] int cwcPhrase);
}
public class CWordSink : IWordSink
{
public void PutWord(int cwc, string pwcInBuf, int...
{
Console.WriteLine("PutWord buffer: " + pwcInB...
}
public void PutAltWord(int cwc, string pwcInBuf, ...
{
Console.WriteLine("PutAltWord buffer: " + pwc...
}
public void StartAltPhrase()
{
Console.WriteLine("StartAltPhrase");
}
public void EndAltPhrase()
{
Console.WriteLine("EndAltPhrase");
}
public void PutBreak(StemText.WORDREP_BREAK_TYPE ...
{
string strBreak;
switch (breakType)
{
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOC:
strBreak = "EOC";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOP:
strBreak = "EOP";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOS:
strBreak = "EOS";
break;
case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOW:
strBreak = "EOW";
break;
default:
strBreak = "ERROR";
break;
}
Console.WriteLine("PutBreak : " + strBreak);
}
}
public class CPhraseSink : IPhraseSink
{
public void PutSmallPhrase(string pwcNoun, int cw...
int cwcModifier, int ulAttachmentType)
{
Console.WriteLine("PutSmallPhrase: " + pwcNou...
+ " , " + pwcModifier.Substring(0, cwcMo...
}
public void PutPhrase(string pwcPhrase, int cwcPh...
{
Console.WriteLine("PutPhrase: " + pwcPhrase.S...
}
}
[StructLayout(LayoutKind.Sequential)]
public struct TEXT_SOURCE
{
[MarshalAs(UnmanagedType.FunctionPtr)]
public delFillTextBuffer pfnFillTextBuffer;
[MarshalAs(UnmanagedType.LPWStr)]
public string awcBuffer;
[MarshalAs(UnmanagedType.U4)]
public int iEnd;
[MarshalAs(UnmanagedType.U4)]
public int iCur;
}
// used to fill the buffer for TEXT_SOURCE
public delegate uint delFillTextBuffer(
[MarshalAs(UnmanagedType.Struct)] ref TEXT_SOURCE...
[ComImport]
[Guid("D53552C8-77E3-101A-B552-08002B33B0E6")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IWordBreaker
{
void Init([MarshalAs(UnmanagedType.Bool)] bool fQ...
[MarshalAs(UnmanagedType.U4)] int maxTokenS...
[MarshalAs(UnmanagedType.Bool)] out bool pf...
void BreakText([MarshalAs(UnmanagedType.Struct)] ...
[MarshalAs(UnmanagedType.Interface)] IWordS...
[MarshalAs(UnmanagedType.Interface)] IPhras...
void GetLicenseToUse([MarshalAs(UnmanagedType.LPW...
}
//HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control...
//の Japanese_Default の WBreakerClass の値を調べてGu...
[ComImport]
//Windows 2000
//[Guid("80A3E9B0-A246-11D3-BB8C-0090272FA362")]
//Windows XP
//[Guid("BE41F4E6-9EAD-498f-A473-F3CA66F9BE8B")]
//Windows Vista, 7
[Guid("E1E8F15E-8BEC-45DF-83BF-50FF84D0CAB5")]
public class CWordBreaker
{
}
}
}}
#code(vbnet){{
'========================================================...
' WordBreaker.vb
'========================================================...
Imports System.Runtime.InteropServices
Namespace StemText
'====================================================...
' Wordbreaker stuff
'====================================================...
<Flags()> _
Public Enum WORDREP_BREAK_TYPE
WORDREP_BREAK_EOW = 0
WORDREP_BREAK_EOS = 1
WORDREP_BREAK_EOP = 2
WORDREP_BREAK_EOC = 3
End Enum
<ComImport()> _
<Guid("CC907054-C058-101A-B554-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IWordSink
Sub PutWord(<MarshalAs(UnmanagedType.U4)> ByVal c...
<MarshalAs(UnmanagedType.LPWStr)> ByV...
<MarshalAs(UnmanagedType.U4)> ByVal c...
<MarshalAs(UnmanagedType.U4)> ByVal c...
Sub PutAltWord(<MarshalAs(UnmanagedType.U4)> ByVa...
<MarshalAs(UnmanagedType.LPWStr)> ...
<MarshalAs(UnmanagedType.U4)> ByVa...
<MarshalAs(UnmanagedType.U4)> ByVa...
Sub StartAltPhrase()
Sub EndAltPhrase()
Sub PutBreak(ByVal breakType As WORDREP_BREAK_TYPE)
End Interface
<ComImport()> _
<Guid("CC906FF0-C058-101A-B554-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPhraseSink
Sub PutSmallPhrase(<MarshalAs(UnmanagedType.LPWSt...
<MarshalAs(UnmanagedType.U4)> ...
<MarshalAs(UnmanagedType.LPWSt...
<MarshalAs(UnmanagedType.U4)> ...
<MarshalAs(UnmanagedType.U4)> ...
Sub PutPhrase(<MarshalAs(UnmanagedType.LPWStr)> B...
<MarshalAs(UnmanagedType.U4)> ByVal...
End Interface
Public Class CWordSink
Implements IWordSink
Public Sub PutWord(ByVal cwc As Integer, _
ByVal pwcInBuf As String, _
ByVal cwcSrcLen As Integer, _
ByVal cwcSrcPos As Integer) _
Implements IWordSink.PutWord
Console.WriteLine("PutWord buffer: " & pwcInB...
End Sub
Public Sub PutAltWord(ByVal cwc As Integer, _
ByVal pwcInBuf As String, _
ByVal cwcSrcLen As Integer, _
ByVal cwcSrcPos As Integer) _
Implements IWordSink.PutAltWord
Console.WriteLine("PutAltWord buffer: " & pwc...
End Sub
Public Sub StartAltPhrase() Implements IWordSink....
Console.WriteLine("StartAltPhrase")
End Sub
Public Sub EndAltPhrase() Implements IWordSink.En...
Console.WriteLine("EndAltPhrase")
End Sub
Public Sub PutBreak(ByVal breakType As StemText.W...
Implements IWordSink.PutBreak
Dim strBreak As String
Select Case breakType
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOC
strBreak = "EOC"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOP
strBreak = "EOP"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOS
strBreak = "EOS"
Exit Select
Case WORDREP_BREAK_TYPE.WORDREP_BREAK_EOW
strBreak = "EOW"
Exit Select
Case Else
strBreak = "ERROR"
Exit Select
End Select
Console.WriteLine("PutBreak : " & strBreak)
End Sub
End Class
Public Class CPhraseSink
Implements IPhraseSink
Public Sub PutSmallPhrase(ByVal pwcNoun As String...
ByVal cwcNoun As Intege...
ByVal pwcModifier As St...
ByVal cwcModifier As In...
ByVal ulAttachmentType ...
Implements IPhraseSink.PutS...
Console.WriteLine("PutSmallPhrase: " & pwcNou...
" , " & pwcModifier.Substri...
End Sub
Public Sub PutPhrase(ByVal pwcPhrase As String, _
ByVal cwcPhrase As Integer) _
Implements IPhraseSink.PutPhrase
Console.WriteLine("PutPhrase: " & pwcPhrase.S...
End Sub
End Class
<StructLayout(LayoutKind.Sequential)> _
Public Structure TEXT_SOURCE
<MarshalAs(UnmanagedType.FunctionPtr)> _
Public pfnFillTextBuffer As delFillTextBuffer
<MarshalAs(UnmanagedType.LPWStr)> _
Public awcBuffer As String
<MarshalAs(UnmanagedType.U4)> _
Public iEnd As Integer
<MarshalAs(UnmanagedType.U4)> _
Public iCur As Integer
End Structure
' used to fill the buffer for TEXT_SOURCE
Public Delegate Function delFillTextBuffer( _
<MarshalAs(UnmanagedType.Struct)> ByRef pTextSour...
<ComImport()> _
<Guid("D53552C8-77E3-101A-B552-08002B33B0E6")> _
<InterfaceType(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IWordBreaker
Sub Init(<MarshalAs(UnmanagedType.Bool)> ByVal fQ...
<MarshalAs(UnmanagedType.U4)> ByVal maxT...
<MarshalAs(UnmanagedType.Bool)> ByRef pf...
Sub BreakText(<MarshalAs(UnmanagedType.Struct)> B...
<MarshalAs(UnmanagedType.[Interface...
<MarshalAs(UnmanagedType.[Interface...
Sub GetLicenseToUse(<MarshalAs(UnmanagedType.LPWS...
End Interface
'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\...
'の Japanese_Default の WBreakerClass の値を調べてGui...
'Windows 2000
'<Guid("80A3E9B0-A246-11D3-BB8C-0090272FA362")>
'Windows XP
'<Guid("BE41F4E6-9EAD-498f-A473-F3CA66F9BE8B")>
'Windows Vista, 7
<ComImport()> _
<Guid("E1E8F15E-8BEC-45DF-83BF-50FF84D0CAB5")> _
Public Class CWordBreaker
End Class
End Namespace
}}
最後のCWordBreakerクラスの部分ですが、GuidがOSによって変...
次に分かち書きを実行する部分です。
#code(csharp){{
//=======================================================...
// Main.cs
//=======================================================...
using System;
using System.Runtime.InteropServices;
namespace StemText
{
//===================================================...
// Main class
//===================================================...
class MainClass
{
static uint pfnFillTextBuffer(ref TEXT_SOURCE pTe...
{
// return WBREAK_E_END_OF_TEXT
return 0x80041780;
}
[STAThread]
static void Main(string[] args)
{
//分かち書きをする文章
string tokStr = "今日はいい天気ですね。";
try
{
CWordBreaker wb = new CWordBreaker();
IWordBreaker iwb = (IWordBreaker)wb;
CWordSink cws = new CWordSink();
IWordSink iws = (IWordSink)cws;
CPhraseSink cps = new CPhraseSink();
IPhraseSink ips = (IPhraseSink)cps;
bool pfLicense = true;
iwb.Init(true, 1000, out pfLicense);
//string tokStr = args[0];
TEXT_SOURCE pTextSource = new TEXT_SOURCE...
pTextSource.pfnFillTextBuffer =
new delFillTextBuffer(pfnFillTextBuff...
pTextSource.awcBuffer = tokStr;
pTextSource.iCur = 0;
pTextSource.iEnd = tokStr.Length;
iwb.BreakText(ref pTextSource, iws, ips);
}
catch (System.Exception ex)
{
Console.WriteLine(ex.Message);
}
Console.ReadLine();
}
}
}
}}
#code(vbnet){{
'========================================================...
' Main.vb
'========================================================...
Imports System.Runtime.InteropServices
Namespace StemText
'====================================================...
' Main Module
'====================================================...
Module MainModule
Function pfnFillTextBuffer(ByRef pTextSource As T...
' return WBREAK_E_END_OF_TEXT
Return &H80041780UI
End Function
<STAThread()> _
Sub Main(ByVal args As String())
'分かち書きをする文章
Dim tokStr As String = "今日はいい天気ですね。"
Try
Dim wb As New CWordBreaker()
Dim iwb As IWordBreaker = DirectCast(wb, ...
Dim cws As New CWordSink()
Dim iws As IWordSink = DirectCast(cws, IW...
Dim cps As New CPhraseSink()
Dim ips As IPhraseSink = DirectCast(cps, ...
Dim pfLicense As Boolean = True
iwb.Init(True, 1000, pfLicense)
'Dim tokStr As String = args(0)
Dim pTextSource As New TEXT_SOURCE()
pTextSource.pfnFillTextBuffer = _
New delFillTextBuffer(AddressOf pfnFi...
pTextSource.awcBuffer = tokStr
pTextSource.iCur = 0
pTextSource.iEnd = tokStr.Length
iwb.BreakText(pTextSource, iws, ips)
Catch ex As System.Exception
Console.WriteLine(ex.Message)
End Try
Console.ReadLine()
End Sub
End Module
End Namespace
}}
このコードを実行すると、以下のように出力されます。
#pre{{
PutWord buffer: 今日
PutWord buffer: は
PutWord buffer: いい
PutWord buffer: 天気
PutWord buffer: です
PutWord buffer: ね
PutBreak : EOP
}}
**TinySegmenterとIWordBreakerの結果の比較 [#jd797a68]
TinySegmenterとIWordBreakerの結果の比較が、「[[COM で分か...
**予告 [#tf9a6b05]
次回は、形態素解析について紹介する予定です。
**コメント [#e4ba3995]
#comment
//これより下は編集しないでください
#pageinfo([[:Category/.NET]] [[:Category/ASP.NET]],2010-1...
ページ名:
▲
▼
[
トップ
] [
新規
|
子ページ作成
|
一覧
|
単語検索
|
最終更新
|
ヘルプ
]