User:Zuhui/๐Ÿ‘€/Experimental Translation: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
ย 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Translation and post-colonial thinking about hybridity=
="The sign is dead"=
="The sign is dead"=
===NMT===
* If language is reduced to just data, where does meaning actually come from?
* Is the difference between human translation and machine translation purely technical, or is there a deeper, more โ€˜philosophicalโ€™ aspect to it?
==MT - SMT - NMT==
'''In early stage of machine translation, rule-based MT did not work'''<br>
Languages are too complex and diverse to be reduced to fixed rules.<br>
โ†“<br>
'''Algorithms based on habit: SMT'''<br>
SMT analyzes large-scale human translation data to learn patterns and calculates the likelihood of certain phrases or words being translated a specific way. so more flexible and capable of reflecting linguistic complexities compared to rule-based systems.<br>
โ†“<br>
'''NMT and word vectors'''<br>
NMT is a significant advancement over SMT. it uses these vectors to perform translations by aligning and transforming relationships across languages.
* NMT๋Š” ๋‹จ์ˆœํžˆ ์™ธ๊ตญ์–ด๋ฅผ "์ด์ƒํ•œ ๊ธฐํ˜ธ"๋กœ ๋ณด๊ณ  ์ด๋ฅผ ํ•ด๋…ํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋‹ค: SMT๋Š” ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ๋‹จ์ˆœ ํ™•๋ฅ  ๊ณ„์‚ฐ์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, NMT๋Š” ์–ธ์–ด ๋‚ด๋ถ€์˜ ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๋‘ ์–ธ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•˜๊ณ  ์ด๋ฅผ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ๋ฐœ์ „์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ.
{|align=right
{|align=right
|{{youtube|NEreO2zlXDk}}
|{{youtube|NEreO2zlXDk}}
|}
|}
'''[https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html#:~:text=A%20token%20is%20an%20instance,containing%20the%20same%20character%20sequence Tokenization]'''
'''Tokens and Vector Embeddings'''<br>
<br>
A '''token''' is the smallest unit into which text is broken down for processing in tasks like machine translation.<br><br>โ€ข Tokens can be words, prefixes/suffixes, or even specific characters.<br>โ€ข These tokens are then converted into numerical data that machines can process. <br>โ€ข '''[https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html#:~:text=A%20token%20is%20an%20instance,containing%20the%20same%20character%20sequence Tokenization]'''<br><br><br>'''Vector embedding''' is a technique that represents each token as coordinates in a multidimensional space.<br><br>โ€ข The machine learns the relationships between words using these coordinates.<br>โ€ข Each word is represented as a vector, which captures how it relates to other words.<br><br><br>'''Word Window'''<br>analyzes how often a specific token appears near other tokens in a given range of text.<br>โ€ข Usually, a word window spans 3โ€“15 words.<br><br><br>'''Multidimensional Vector'''<br>Vectors represent the relationships between words <u>mathematically</u>.โ€จEach tokenย  is expressed as a vector in a multidimensional space. These vectors represent:<br><br>โ€ข The likelihood of a specific word appearing alongside others.<br>โ€ข The similarities and differences between words.<br><br>Vectors arenโ€™t just limited to two or three-dimensional representations. In tasks like machine translation, <u>vectors typically span hundreds of dimensions.</u>
<br>
* ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐ๋Š” ์–ธ์–ด์  ๋„คํŠธ์›Œํฌ๋ฅผ ํ˜•์„ฑํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์˜๋ฏธ์™€ ๋งฅ๋ฝ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค: ๋ฒกํ„ฐ ๊ณต๊ฐ„์€ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹จ์ˆœํžˆ ํŠน์ • ๋‹จ์–ด์™€์˜ ๊ด€๊ณ„๋ฟ ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ๋งบ๋Š” ๋ชจ๋“  ๊ด€๊ณ„๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ.
<br>
* ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์ด ๋‹จ์–ด์˜ ๋‹จ์ˆœํ•œ ์น˜ํ™˜์„ ๋„˜์–ด, ๋‹จ์–ด์˜ ๋งฅ๋ฝ๊ณผ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ด ๋จ.
<br>
* NMT๋Š” ๋‹จ์ˆœํ•œ ๋ฒˆ์—ญ ์ด์ƒ์˜ ์ž‘์—…: ์–ธ์–ด ๊ฐ„์˜ ๋Œ€ํ™”์™€ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฒˆ์—ญ ๋ฐฉ์‹์„ ์—ด์–ด์ค€๋‹ค.
<br>
ย 
<br>
===Spacey emptiness, Gayatri Spivak===
<br>
'''"spacey emptiness"''' as introduced by Gayatri Spivak refers to the '''gaps, voids, or untranslatable spaces between languages''' that cannot be bridged by simple word-for-word translations.<br><br>
<br>
'''why does this gap exists?'''<br>languages are products of unique cultural, historical, and social contexts. These contexts shape how meaning is expressed, and they often don't have exact parallels in other languages.<br><br>
<br>
'''why does this gap HAS to exist?'''<br> Spivak says that trying to completely eliminate the gap between languages risks suppressing diversity. Instead, the "spacey emptiness" should be seen as an opportunity for richer, more creative interactions.<br>
<br>
์Šคํ”ผ๋ฐ•์€ ๋ฒˆ์—ญ์„ ๋‹จ์ˆœํ•œ ๋ณ€ํ™˜์ด ์•„๋‹ˆ๋ผ ์–ธ์–ด์™€ ์–ธ์–ด ๊ฐ„์˜ ๋Œ€ํ™”๋กœ ๊ฐ„์ฃผํ•œ๋‹ค. ์ด๋Š” NMT๊ฐ€ ์ด๋Š” ๊ณตํ—ˆํ•œ ๊ณต๊ฐ„์„ ์–ต์ง€๋กœ ์ง€์šฐ๋Š” ๋Œ€์‹ , ๊ฐ ์–ธ์–ด์˜ ๋…ํŠนํ•œ ์˜๋ฏธ ์ฒด๊ณ„์™€ ๊ตฌ์กฐ๋ฅผ ์กด์ค‘ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋งž๋‹ฟ์•„ ์žˆ๋‹ค. NMT๋˜ํ•œ ๊ณต๋ฐฑ์†์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ, ๊ทธ๋ฆฌ๊ณ  ํ•œ ์–ธ์–ด์˜ ์˜๋ฏธ ์ฒด๊ณ„๋ฅผ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ ๋‹จ์ˆœํžˆ ๋ณต์‚ฌํ•˜์ง€ ์•Š๊ณ , ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ƒˆ๋กญ๊ฒŒ ํ˜•์„ฑํ•œ๋‹ค๋Š” ์ ์—์„œ ์Šคํ”ผ๋ฐ•์˜ ๊ณตํ—ˆํ•œ ๊ณต๊ฐ„๊ณผ ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œใ….
<br>
ย 
<br>
===Allison Parrish===
<br>
<br>
<br>
<br>
===Multidimensional Vectors===
{|align=right
{|align=right
|{{youtube|L3D0JEA1Jdc}}
|{{youtube|L3D0JEA1Jdc}}
Line 30: Line 34:
Allison Parrish uses colors to show the same principle, adding vectors for red and blue together to get purple. <br><br><u>'''This blows up any model for language that is thinking of the meaning of language as a relationship of referents to an external (or internal) reality, since meaning is produced by vector space:''' the plotting of tokens on a matrix according to where they fall in language useโ€”and not in relation to what they represent.</u><br><br> '''But language still represents, and organic bodies are still feeling it in space-times other than vector space, and what do you do with that?'''
Allison Parrish uses colors to show the same principle, adding vectors for red and blue together to get purple. <br><br><u>'''This blows up any model for language that is thinking of the meaning of language as a relationship of referents to an external (or internal) reality, since meaning is produced by vector space:''' the plotting of tokens on a matrix according to where they fall in language useโ€”and not in relation to what they represent.</u><br><br> '''But language still represents, and organic bodies are still feeling it in space-times other than vector space, and what do you do with that?'''
</blockquote>
</blockquote>
==Critique of translational norms==
===Global English and machine translation===
="Experimental" as in fallible force=
="Experimental" as in fallible force=

Latest revision as of 19:09, 23 November 2024

"The sign is dead"

  • If language is reduced to just data, where does meaning actually come from?
  • Is the difference between human translation and machine translation purely technical, or is there a deeper, more โ€˜philosophicalโ€™ aspect to it?

MT - SMT - NMT

In early stage of machine translation, rule-based MT did not work
Languages are too complex and diverse to be reduced to fixed rules.
โ†“
Algorithms based on habit: SMT
SMT analyzes large-scale human translation data to learn patterns and calculates the likelihood of certain phrases or words being translated a specific way. so more flexible and capable of reflecting linguistic complexities compared to rule-based systems.
โ†“
NMT and word vectors
NMT is a significant advancement over SMT. it uses these vectors to perform translations by aligning and transforming relationships across languages.

  • NMT๋Š” ๋‹จ์ˆœํžˆ ์™ธ๊ตญ์–ด๋ฅผ "์ด์ƒํ•œ ๊ธฐํ˜ธ"๋กœ ๋ณด๊ณ  ์ด๋ฅผ ํ•ด๋…ํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋‹ค: SMT๋Š” ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ๋‹จ์ˆœ ํ™•๋ฅ  ๊ณ„์‚ฐ์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, NMT๋Š” ์–ธ์–ด ๋‚ด๋ถ€์˜ ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๋‘ ์–ธ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•˜๊ณ  ์ด๋ฅผ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ๋ฐœ์ „์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ.
Tokens and Vector Embeddings
A token is the smallest unit into which text is broken down for processing in tasks like machine translation.

โ€ข Tokens can be words, prefixes/suffixes, or even specific characters.
โ€ข These tokens are then converted into numerical data that machines can process.
โ€ข Tokenization


Vector embedding is a technique that represents each token as coordinates in a multidimensional space.

โ€ข The machine learns the relationships between words using these coordinates.
โ€ข Each word is represented as a vector, which captures how it relates to other words.


Word Window
analyzes how often a specific token appears near other tokens in a given range of text.
โ€ข Usually, a word window spans 3โ€“15 words.


Multidimensional Vector
Vectors represent the relationships between words mathematically.โ€จEach token is expressed as a vector in a multidimensional space. These vectors represent:

โ€ข The likelihood of a specific word appearing alongside others.
โ€ข The similarities and differences between words.

Vectors arenโ€™t just limited to two or three-dimensional representations. In tasks like machine translation, vectors typically span hundreds of dimensions.
  • ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐ๋Š” ์–ธ์–ด์  ๋„คํŠธ์›Œํฌ๋ฅผ ํ˜•์„ฑํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์˜๋ฏธ์™€ ๋งฅ๋ฝ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค: ๋ฒกํ„ฐ ๊ณต๊ฐ„์€ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹จ์ˆœํžˆ ํŠน์ • ๋‹จ์–ด์™€์˜ ๊ด€๊ณ„๋ฟ ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ๋งบ๋Š” ๋ชจ๋“  ๊ด€๊ณ„๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ.
  • ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์ด ๋‹จ์–ด์˜ ๋‹จ์ˆœํ•œ ์น˜ํ™˜์„ ๋„˜์–ด, ๋‹จ์–ด์˜ ๋งฅ๋ฝ๊ณผ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ด ๋จ.
  • NMT๋Š” ๋‹จ์ˆœํ•œ ๋ฒˆ์—ญ ์ด์ƒ์˜ ์ž‘์—…: ์–ธ์–ด ๊ฐ„์˜ ๋Œ€ํ™”์™€ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฒˆ์—ญ ๋ฐฉ์‹์„ ์—ด์–ด์ค€๋‹ค.

Spacey emptiness, Gayatri Spivak

"spacey emptiness" as introduced by Gayatri Spivak refers to the gaps, voids, or untranslatable spaces between languages that cannot be bridged by simple word-for-word translations.

why does this gap exists?
languages are products of unique cultural, historical, and social contexts. These contexts shape how meaning is expressed, and they often don't have exact parallels in other languages.

why does this gap HAS to exist?
Spivak says that trying to completely eliminate the gap between languages risks suppressing diversity. Instead, the "spacey emptiness" should be seen as an opportunity for richer, more creative interactions.
์Šคํ”ผ๋ฐ•์€ ๋ฒˆ์—ญ์„ ๋‹จ์ˆœํ•œ ๋ณ€ํ™˜์ด ์•„๋‹ˆ๋ผ ์–ธ์–ด์™€ ์–ธ์–ด ๊ฐ„์˜ ๋Œ€ํ™”๋กœ ๊ฐ„์ฃผํ•œ๋‹ค. ์ด๋Š” NMT๊ฐ€ ์ด๋Š” ๊ณตํ—ˆํ•œ ๊ณต๊ฐ„์„ ์–ต์ง€๋กœ ์ง€์šฐ๋Š” ๋Œ€์‹ , ๊ฐ ์–ธ์–ด์˜ ๋…ํŠนํ•œ ์˜๋ฏธ ์ฒด๊ณ„์™€ ๊ตฌ์กฐ๋ฅผ ์กด์ค‘ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋งž๋‹ฟ์•„ ์žˆ๋‹ค. NMT๋˜ํ•œ ๊ณต๋ฐฑ์†์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ, ๊ทธ๋ฆฌ๊ณ  ํ•œ ์–ธ์–ด์˜ ์˜๋ฏธ ์ฒด๊ณ„๋ฅผ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ ๋‹จ์ˆœํžˆ ๋ณต์‚ฌํ•˜์ง€ ์•Š๊ณ , ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ƒˆ๋กญ๊ฒŒ ํ˜•์„ฑํ•œ๋‹ค๋Š” ์ ์—์„œ ์Šคํ”ผ๋ฐ•์˜ ๊ณตํ—ˆํ•œ ๊ณต๊ฐ„๊ณผ ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œใ….

Allison Parrish

Allison Parrish uses colors to show the same principle, adding vectors for red and blue together to get purple.

This blows up any model for language that is thinking of the meaning of language as a relationship of referents to an external (or internal) reality, since meaning is produced by vector space: the plotting of tokens on a matrix according to where they fall in language useโ€”and not in relation to what they represent.

But language still represents, and organic bodies are still feeling it in space-times other than vector space, and what do you do with that?

Critique of translational norms

Global English and machine translation

"Experimental" as in fallible force